Anthropic released a 1 million token context window for both Opus 4.6 and Sonnet 4.6, and the real story is not just the size increase. The breakthrough is that these models maintain strong performance across the entire context window, potentially solving the long-standing problem of "context rot" where model quality degrades sharply as input length grows.
Context rot refers to the well-documented degradation in large language model performance as the number of input tokens increases. Previous studies, notably the Chroma context rot study from summer 2025, showed that models experienced massive performance drop-offs as input tokens climbed past 100,000-200,000 tokens. Larger context windows were essentially "fool's gold": you got more budget, but the model could not use it effectively.
This meant that practitioners had to aggressively manage their context windows, clearing sessions at around 100,000-120,000 tokens to maintain output quality. Failing to do so resulted in noticeably worse responses.
Anthropic published results from the eight needle retrieval test, a variant of the "needle in a haystack" benchmark. In this test, eight specific pieces of information (the "needles") are scattered throughout a massive conversation. The model is then asked to retrieve each needle precisely from different points in the context window.
This is particularly relevant for coding use cases where a large codebase may contain many similar patterns, and the model needs to distinguish and retrieve the exact right one.
The jump from Opus 4.5 to Opus 4.6 represents both a 5x increase in usable context window (200K to 1M) and roughly a 3x improvement in retrieval effectiveness. The performance drop from 256K tokens all the way to 1 million tokens is only about 14%, a dramatic improvement over previous models.
With the Chroma study's findings, the conventional wisdom was to clear your Claude Code session at around 100K-120K tokens. The new data suggests a different approach:
The headline number of "1 million tokens" is impressive, but the real significance is the quality of performance at scale. At 1 million tokens, Opus 4.6 still outperforms Gemini 3.1 and matches GPT 5.4's performance at much smaller context sizes. This is not just a bigger window; it is a fundamentally more capable one.
For developers using Claude Code, this means longer uninterrupted coding sessions, fewer manual context resets, and the ability to work with larger codebases without performance hacks. The combination of a massive context window with minimal degradation represents one of the most practically significant improvements Anthropic has shipped.