Nate B. Jones runs ChatGPT 5.4 through a blind eval suite against Claude Opus 4.6 and Gemini 3.1 Pro, testing six structured evaluations with independent judging. The results reveal a model with genuine strengths in quantitative modeling and file processing, but critical weaknesses in common-sense reasoning, writing quality, and data judgment. The video also examines OpenAI's strategic direction toward agentic infrastructure and what Peter Steinberger's hire signals about the company's roadmap.
Jones opens with a deceptively simple prompt: "I need to wash my car. The car wash is 100 meters away. Should I walk or drive?" ChatGPT 5.4 in thinking mode said walk — a careful, well-structured, completely wrong answer. Claude Opus 4.6 answered in one sentence: "Drive. You need the car at the car wash." Gemini also got it right. This sets the tone: 5.4 can build elaborate reasoning chains that miss the obvious.
The review is based on blind evals — outputs labeled by number, judged independently, with an AI-fluent second reviewer. Six structured evaluations plus real-world task testing. No benchmark scores are cited; all results come from practical, work-relevant tests.
5.4 is a big upgrade from 5.2 but still loses to Opus 4.6 on both business and creative writing. It has a "tin ear" for tone — when asked to mimic challenging authors like Shakespeare or P.G. Wodehouse, results are weak. Business writing also fell short on clarity and articulation.
Given a boring JP Morgan financial deck, models had to find the funniest phrase, identify the pun, and rewrite it. Opus 4.6 delivered a triple-layered pun across three semantic layers. ChatGPT 5.4 found a real pun and delivered a competent rewrite. Gemini fabricated sources and URLs — twice.
In thinking mode, 5.4 tied for first on factual accuracy and calibration. In auto mode, it hallucinated future Nobel laureates and dropped to dead last. This is the single most important finding: the thinking mode vs. auto mode gap is enormous.
A "digital shoebox" migration test with 465 files of mixed formats. ChatGPT 5.4 processed 461 of 465 files (99.1% coverage) including CSVs, Excel, JSON, PDFs, VCF contacts, handwritten receipts via OCR, and corrupted JSON. But it let fake data through — a customer named "Mickey Mouse" and a $25,000 car wash — straight into the production database. It also failed to deduplicate (278 customers vs. the correct 176) and over-created business status values (13 vs. the needed 4-5).
The single most important finding: ChatGPT 5.4's performance plummets when not in thinking mode. Most users (99%) will encounter auto mode by default. In auto mode, factual accuracy drops dramatically, the model hallucinated, and it ranked dead last among frontier models. Jones warns that trainers and AI enthusiasts will need to teach everyone to toggle to thinking mode every single time — a UX burden that shouldn't exist because the auto-switcher should invoke thinking where thinking is needed.
Given a prompt to project Seattle Seahawks 2026 season win probabilities, 5.4 produced a six-tab workbook with Pythagorean win expectation, Elo-like ratings with offseason retention decay, and a Poisson binomial season distribution. It also wrote a self-critique more honest than most consulting deliverables. Opus 4.6 produced a cleaner, better-formatted three-tab workbook but with less statistical rigor.
5.4 processes more file types with less friction. In the schema migration eval, it handled every format thrown at it. Claude discovered all files but chose not to install OpenPyXL (a three-second pip install), silently skipping Excel files — a judgment failure, not an environment limitation. Box also published scores confirming 5.4's lead in document processing.
5.4 knows the AI competitive landscape better than its competitors. When people use AI to learn about AI, the model often doesn't know what model it is. 5.4 shows a clear jump here — it understands the frontier and how models work.
Peter Steinberger was hired weeks before this release. He built OpenClaw using Codex, and most GitHub users prefer Claude for their OpenClaw. Peter was hired to build a secure, stable, enterprise version of OpenClaw at OpenAI. The model's emphasis on computer use, long-running tasks, and tool discovery all point toward agentic infrastructure.
In the release notes, the word that appears most often is agents — not intelligence, not reasoning. The model is positioned as infrastructure for agentic systems that operate software, manage tools, sustain workflows across hours, and coordinate with external services.
5.4 introduces progressive tool discovery — the ability to discover tools at runtime rather than loading all definitions up front. This changes the cost structure for massive tool ecosystems at the enterprise level. If you've been building agents that juggle dozens of MCP servers, this is directly relevant.
OpenAI is shipping monthly (a cadence no other frontier lab has committed to publicly). They're converging on agentic capability while diverging on philosophy. Jones encourages paying less attention to benchmark winners and more attention to what's actually being measured for real work — features like progressive tool discovery matter more than benchmark scores. The future OpenAI is building toward is agentic, tool-heavy, about sustained workflows rather than single turns, and about operating software rather than generating text.