This video challenges the widely accepted idea that AI capabilities are inherently "jagged" — excellent at some things and terrible at others. The presenter argues that jaggedness was never a property of AI intelligence itself, but rather an artifact of how we asked AI to work: single-turn, one-shot interactions with no organizational structure. As inference computing, agent harnesses, and multi-agent coordination have matured, AI capabilities have smoothed dramatically — especially for practical workplace tasks.
The central proof point is Cursor's coding harness solving a novel research-grade mathematics problem (Problem 6 from an unpublished Stanford/MIT/Berkeley competition) — using the same system designed to build web browsers. Four independent organizations — Anthropic, Google DeepMind, OpenAI, and Cursor — have converged on the same multi-agent architecture pattern without coordination, suggesting this organizational approach to AI is a fundamental solution, not a coincidence.
Since 2022, the AI community has operated under the assumption that AI capabilities follow a "jagged frontier" — brilliant at some tasks, unreliable at others. This framing shaped how organizations deployed AI, how educators taught about AI, and how individuals decided what to delegate.
The presenter argues this jaggedness was actually caused by a structural limitation: we were asking AI to solve every problem in a single turn, with no notes, no colleagues, no ability to retry, and no organizational scaffolding. As the video puts it: "We have been asking a capable analyst to solve every problem in 30 seconds with no notes, no colleagues, no ability to try something and no ability to retry."
Three developments have smoothed AI performance:
The key insight: two learning curves have been running in parallel. We track the intelligence curve (model capability), but have largely ignored the fluency curve — our own improving ability to use, structure, and scaffold AI work. The fluency curve may now matter more than the intelligence curve for practical applications.
On March 3, Cursor CEO Michael Truel announced that Cursor's coding agent had solved Problem 6 of an unpublished research-grade math competition, producing a solution with stronger bounds and better coverage than the official human-written answer. The system ran for four days with zero hints and zero human guidance.
Why this matters more than typical AI math benchmarks: Cursor is not a math company. A system designed to write code — one that had previously built a web browser from scratch in Rust — looked at a problem in spectral graph theory and produced mathematics the problem's own authors had not found. This demonstrates generalization, not narrow specialization.
Wilson Lynn's January blog post on scaling long-running autonomous coding revealed the architecture:
Test cases included: a web browser from scratch in Rust, a Solid-to-React migration, a Java language server, a Windows 7 emulator (1.2M lines), and an Excel clone (1.6M lines).
Four organizations independently built the same multi-agent coordination structure:
"This convergence is not a coincidence. It is a solution to a real problem: how do you get useful work from units of intelligence with finite context, finite per-step reliability, and no persistent memory?"
The deeper observation: these agent architectures replicate the same organizational structures humans developed to scale their own intelligence — roles, handoffs, verification, and restart procedures. We invented these patterns for human teams, and it turns out they generalize naturally to autonomous agents. We have essentially given agents our organizational intelligence, and it scales.
The presenter introduces a framework for understanding which work AI can tackle through agent harnesses:
The key claim: Tier 2 is much larger than we think. Even "soft" work like product strategy has surprisingly high verifiability — bring a product strategy to four experienced product leaders and their assessments will be remarkably consistent.
As agent harnesses improve and execution costs drop, the most valuable professional skill is shifting from "I can do the work" to "I can tell if the work is correct." The presenter calls these meta-skills:
These evaluation competencies become more valuable as execution gets cheaper with agents, not less. The Anthropic 2026 coding trends report describes engineers "delegating tasks where they can easily sniff-check on correctness" — and the presenter argues this applies far beyond engineering.
The relevant question for knowledge workers is no longer "Can AI do a specific task in my job?" but rather: "Can my work be decomposed into verifiable sub-problems?" The answer is yes far more often than most people are comfortable with.