GPT-5.4 Got the Best Score I've Ever Seen — Then I Found Something Stranger

Study Guide

Overview

Matt Maher shares results from his custom planning benchmark, testing GPT-5.4, Opus 4.6, Sonnet 4.6, and Gemini 3.1 Pro across multiple AI coding tools (Codex CLI, Claude Code, Gemini CLI, and Cursor). GPT-5.4 in "extra high" mode scores 95% — the highest he has ever measured — but the more surprising finding is that the tool you use and how you configure it can matter as much as which model you choose.

Key Concepts

The Planning Benchmark

Maher created a benchmark that measures planning quality, not coding ability. It hands an AI model a real Product Requirements Document (PRD) with roughly 100 features across 10 files and asks it to build an implementation plan. The score reflects how much of the original request actually makes it into the plan. This matters because features dropped during planning never get built — leading to systems that silently cover only 70% of requirements.

Model Scores (No Planning Mode)

GPT-5.4 (Codex CLI, high): 82%
GPT-5.4 (Codex CLI, extra high): 95% — best score ever recorded
Opus 4.6 (Claude Code, high): 92.9%
Sonnet 4.6 (Claude Code): 92.4% — a massive jump from its previous ~77% range
Gemini 3.1 Pro (Gemini CLI): 52% — strong coder but poor at long-context planning

The Cursor Effect

When the same models are run through Cursor instead of their native CLIs, planning scores consistently improve across every model tested:

Gemini 3.1 Pro: 52% → 57%
Opus 4.6 (planning mode): 77% → 93%
Sonnet 4.6 (planning mode): 87.4% → 92%
GPT-5.4 (high): 82% → 88.4%

Maher observed Cursor performing a built-in verification pass — after completing a plan, it went back to evaluate the output against the original request, looking for gaps and filling in what was missed. This is the exact strategy Maher has long advocated as the best way to close the coverage gap.

The Planning Mode Paradox

In Claude Code specifically, using the dedicated "planning mode" actually hurts performance. Opus 4.6 in planning mode scores ~77%, but the same model in execution mode — simply told to create a plan — scores ~92.9%. That 15-point swing is larger than the gap between most models. Maher theorizes that execution mode gives the model more freedom to structure its own thinking, potentially spawning sub-agents or performing internal gap analysis that planning mode's workflow restricts.

Key Takeaways

GPT-5.4 extra high is the new planning champion at 95%, but only in Codex CLI extra-high mode — the standard high mode scores a more modest 82%.
Sonnet 4.6 has closed the gap with Opus 4.6 on planning tasks (92.4% vs 92.9%), a dramatic improvement from its previous ~77% range.
Cursor consistently outperforms native CLIs across every model tested, likely due to a built-in verification/gap-analysis pass.
Avoid Claude Code planning mode for planning tasks. Telling the model to plan within execution mode produces significantly better results.
The "tool race" matters as much as the "model race." The combination of model + tool + configuration determines your outcome — not the model alone.

Discussion Questions

If Cursor appears to do a built-in verification pass, should Claude Code and Codex CLI adopt a similar approach? What are the tradeoffs?
Why might a dedicated "planning mode" constrain a model's planning ability rather than enhance it? What does this suggest about how AI tool developers should design specialized modes?
This benchmark measures planning coverage (features captured), not plan quality (correctness, feasibility, architecture). How might results differ if quality dimensions were also measured?
GPT-5.4's million-token context window is cited as potentially relevant to its high score. How important is raw context length vs. how the model uses available context?
Maher notes these findings are specific to planning, not overall coding. How should practitioners weigh planning benchmarks against coding benchmarks like SWE-bench when choosing tools?