IndyDevDan tackles a practical engineering question: when you plan a new feature for a product, what kind of spec should you write the plan in? Markdown, HTML, or something richer? His answer is the principle he says has never changed: more useful tokens outperform fewer useful tokens, with the emphasis on useful. But the deeper point of the video is that you cannot know which spec is actually better unless you measure it.
To measure, he runs three Pi coding agents on Gemini 3.5 Flash, each building the exact same product from a different spec type (markdown, HTML, and an enhanced "V" / visual spec), and watches everything through a Pi agent observability dashboard. The recurring thesis: if you don't measure, you won't improve.
Dan opens by referencing two recent industry moments: the viral Anthropic engineers' post on the "unreasonable effectiveness of HTML" for specs, and OpenAI's release of a benchmark-topping image model that generates near-perfect images. Given those, should you spec in markdown, HTML, or something else? His framing reframes the choice away from "which is best" toward the trade-off between performance, speed, and cost of the resulting agentic solution.
Takeaway: Treat spec format as an engineering trade-off to be measured, not a style preference to be argued.
The observability app is deliberately simple: agents stream every event, turn, and tool call to a centralized server; the server stores them in a database (so a refresh restores everything); a UI displays them. Early results surprised him. The markdown agent used more tokens than the HTML agent (markdown took 25 then 29 turns in different runs; HTML took 17). He notes this could be model variance or could reflect how the specs were written, and that the markdown agent may simply have done more research / understood the codebase better. The point: without looking, you'd never know.
The product being specced is "Steel Man," a counter-positional agent. Given a bull thesis (Apple as an underappreciated AI distribution winner), it builds the bear case, generating UI components (a quote block, a catalyst timeline, a valuation gauge, plus tables/bar charts and score cards) to support its argument. Dan frames this as a deliberate antidote to sycophancy: agents default to telling you what you want to hear, so a bear/counter agent forces the opposite perspective. In the demo it finds that the Mac Mini Claude craze represents less than 2% of Apple's total revenue, and backs the analysis with about 40 tool-call references rather than fabrication.
Takeaway: Observability surfaces counter-intuitive facts (markdown costing more), and a counter-positional product agent is a powerful, research-grounded way to pressure-test a thesis.
The dashboard offers multiple views: a swim lane view, a single (single-lane) view for detailed per-agent breakdowns, a compressed function mode, a more aesthetic form mode, and a race mode that places agents left-to-right with events chunked by turn so you can compare them side by side and stack runs. Each run shows cost, tokens-per-second, context usage, event counts, and duration (one Steel Man run was 79 events over about a minute). You can click into the start event to inspect the agent's full system prompt and the exact tools loaded on boot, which matters because production system prompts are often templated and skill-loading silently inflates the prompt.
On the model itself: Dan is enthusiastic about Gemini 3.5 Flash, calling it near state of the art with excellent tokens-per-second. He acknowledges the price came in higher than expected ("not so much a flash model anymore") but argues for thinking in cost per intelligence rather than cost per token. He notes his tokens-per-second figure required some inference because the Pi coding agent doesn't expose it out of the box.
Takeaway: Pick views that fit the question (race mode for head-to-head, single lane for depth), always inspect the real system prompt and loaded tools, and evaluate models on cost-per-intelligence and the full speed/cost trade-off rather than headline token price.
Comparing the three plans directly: the markdown spec is tried, true, simple, and best for preserving tokens with a text-focused result. The HTML spec is essentially the same markdown content expressed in HTML, spending more tokens to produce visually richer UI mockups (a mocked quote component, timeline component, and valuation gauge) that humans, teammates, and agents can all understand better. He stresses you could push HTML specs much further and be more prescriptive. In a focused markdown-vs-HTML comparison, markdown used 170 events versus HTML's 100 with similar context, raising the question of variance vs a real difference. His prescribed answer: turn it into an eval and run it repeatedly to find average performance, the natural next step after observability and product agents.
Takeaway: A single run can mislead; convert spec comparisons into repeatable evals to see true average performance, speed, and cost.
The enhanced spec is the V spec (visual spec): an HTML plan with embedded images generated by GPT Image 2.0. Dan updated his build prompt so agents must read any images inside a plan, because image tokens are highly useful for powerful multimodal models, and Gemini 3.5 Flash specifically wins on multimodal benchmarks (MMLU and related). He says all his plans are now V specs, with HTML as an optional enhancement and markdown-with-referenced-images as a lighter alternative; the visual experience makes plans dramatically easier to understand. The trade-off he flags honestly: image generation costs roughly $1 to $3 per plan, and his observability tool currently misses that cost entirely.
He closes by naming the two constraints of agentic engineering, planning and reviewing, and arguing visual/HTML specs help most on the planning side. The bigger frame is tokenomics and the agent value chain: it's not enough to run agents (token-maxing at the bottom); you move up the chain by generating value from tokens and ultimately arbitraging that value into revenue. Observability is what lets you understand the internals, traits, and true cost of how an agent reached its result, which is the foundation for making the right trade-offs. He notes the whole thing is delivered as a single composable Pi coding agent extension that plugs observability into every agent, part of a stack of weekly-built, composable customizations (including a flat Pi-to-Pi agent communication network).
Takeaway: Plan with visual/image specs for clarity and multimodal leverage, but track the hidden costs (image generation) your tooling may miss, and treat observability as the prerequisite for climbing the tokenomics value chain from using tokens to capturing revenue.