Pi Coding Agent Observability: HTML Specs with Gemini 3.5 Flash and GPT Image 2

Study Guide

Overview

IndyDevDan tackles a practical engineering question: when you plan a new feature for a product, what kind of spec should you write the plan in? Markdown, HTML, or something richer? His answer is the principle he says has never changed: more useful tokens outperform fewer useful tokens, with the emphasis on useful. But the deeper point of the video is that you cannot know which spec is actually better unless you measure it.

To measure, he runs three Pi coding agents on Gemini 3.5 Flash, each building the exact same product from a different spec type (markdown, HTML, and an enhanced "V" / visual spec), and watches everything through a Pi agent observability dashboard. The recurring thesis: if you don't measure, you won't improve.

Key Concepts

The trade-off triangle (trifecta): Performance, speed, and cost. The question is never just "what's better," but what trade-off a given agentic solution makes across these three axes.
Useful tokens: Giving agents more tokens helps only when those tokens are useful. HTML and image-enriched specs add tokens that can carry more accurate, more communicable information.
Agent observability: A centralized server receives streamed events from every agent; a UI reads them; the engineer ingests the information. This closes the loop so you can improve both engineering agents and out-of-the-loop product agents.
Product agents vs engineering agents: Engineering agents in the terminal / software factory are "table stakes." Observability becomes exponentially more important for product agents that run thousands to hundreds of thousands of times.
Tokenomics / the agent value chain: Step one, use the tokens; step two, generate value from them; step three, capture (arbitrage) that value as revenue.

The Spec Question and the Trade-off Triangle

Dan opens by referencing two recent industry moments: the viral Anthropic engineers' post on the "unreasonable effectiveness of HTML" for specs, and OpenAI's release of a benchmark-topping image model that generates near-perfect images. Given those, should you spec in markdown, HTML, or something else? His framing reframes the choice away from "which is best" toward the trade-off between performance, speed, and cost of the resulting agentic solution.

Takeaway: Treat spec format as an engineering trade-off to be measured, not a style preference to be argued.

Agent Observability Architecture and First Results

The observability app is deliberately simple: agents stream every event, turn, and tool call to a centralized server; the server stores them in a database (so a refresh restores everything); a UI displays them. Early results surprised him. The markdown agent used more tokens than the HTML agent (markdown took 25 then 29 turns in different runs; HTML took 17). He notes this could be model variance or could reflect how the specs were written, and that the markdown agent may simply have done more research / understood the codebase better. The point: without looking, you'd never know.

The product being specced is "Steel Man," a counter-positional agent. Given a bull thesis (Apple as an underappreciated AI distribution winner), it builds the bear case, generating UI components (a quote block, a catalyst timeline, a valuation gauge, plus tables/bar charts and score cards) to support its argument. Dan frames this as a deliberate antidote to sycophancy: agents default to telling you what you want to hear, so a bear/counter agent forces the opposite perspective. In the demo it finds that the Mac Mini Claude craze represents less than 2% of Apple's total revenue, and backs the analysis with about 40 tool-call references rather than fabrication.

Takeaway: Observability surfaces counter-intuitive facts (markdown costing more), and a counter-positional product agent is a powerful, research-grounded way to pressure-test a thesis.

Inside the Dashboard: Views, Costs, and Gemini 3.5 Flash

The dashboard offers multiple views: a swim lane view, a single (single-lane) view for detailed per-agent breakdowns, a compressed function mode, a more aesthetic form mode, and a race mode that places agents left-to-right with events chunked by turn so you can compare them side by side and stack runs. Each run shows cost, tokens-per-second, context usage, event counts, and duration (one Steel Man run was 79 events over about a minute). You can click into the start event to inspect the agent's full system prompt and the exact tools loaded on boot, which matters because production system prompts are often templated and skill-loading silently inflates the prompt.

On the model itself: Dan is enthusiastic about Gemini 3.5 Flash, calling it near state of the art with excellent tokens-per-second. He acknowledges the price came in higher than expected ("not so much a flash model anymore") but argues for thinking in cost per intelligence rather than cost per token. He notes his tokens-per-second figure required some inference because the Pi coding agent doesn't expose it out of the box.

Takeaway: Pick views that fit the question (race mode for head-to-head, single lane for depth), always inspect the real system prompt and loaded tools, and evaluate models on cost-per-intelligence and the full speed/cost trade-off rather than headline token price.

Comparing Spec Types: Markdown, HTML, and V Specs

Comparing the three plans directly: the markdown spec is tried, true, simple, and best for preserving tokens with a text-focused result. The HTML spec is essentially the same markdown content expressed in HTML, spending more tokens to produce visually richer UI mockups (a mocked quote component, timeline component, and valuation gauge) that humans, teammates, and agents can all understand better. He stresses you could push HTML specs much further and be more prescriptive. In a focused markdown-vs-HTML comparison, markdown used 170 events versus HTML's 100 with similar context, raising the question of variance vs a real difference. His prescribed answer: turn it into an eval and run it repeatedly to find average performance, the natural next step after observability and product agents.

Takeaway: A single run can mislead; convert spec comparisons into repeatable evals to see true average performance, speed, and cost.

Visual Specs, GPT Image 2, and Tokenomics

The enhanced spec is the V spec (visual spec): an HTML plan with embedded images generated by GPT Image 2.0. Dan updated his build prompt so agents must read any images inside a plan, because image tokens are highly useful for powerful multimodal models, and Gemini 3.5 Flash specifically wins on multimodal benchmarks (MMLU and related). He says all his plans are now V specs, with HTML as an optional enhancement and markdown-with-referenced-images as a lighter alternative; the visual experience makes plans dramatically easier to understand. The trade-off he flags honestly: image generation costs roughly $1 to $3 per plan, and his observability tool currently misses that cost entirely.

He closes by naming the two constraints of agentic engineering, planning and reviewing, and arguing visual/HTML specs help most on the planning side. The bigger frame is tokenomics and the agent value chain: it's not enough to run agents (token-maxing at the bottom); you move up the chain by generating value from tokens and ultimately arbitraging that value into revenue. Observability is what lets you understand the internals, traits, and true cost of how an agent reached its result, which is the foundation for making the right trade-offs. He notes the whole thing is delivered as a single composable Pi coding agent extension that plugs observability into every agent, part of a stack of weekly-built, composable customizations (including a flat Pi-to-Pi agent communication network).

Takeaway: Plan with visual/image specs for clarity and multimodal leverage, but track the hidden costs (image generation) your tooling may miss, and treat observability as the prerequisite for climbing the tokenomics value chain from using tokens to capturing revenue.

Practical Lessons

Don't argue spec formats; instrument agents and measure performance, speed, and cost.
Stream every event, turn, and tool call to a central, persistent store so nothing is lost on refresh.
Inspect the actual rendered system prompt and the real set of loaded tools; templating and skills inflate prompts invisibly.
A single run is variance; turn comparisons into evals and look at averages.
Use counter-positional ("steel man" / bear) agents to fight model sycophancy and force the opposite view.
Evaluate models on cost per intelligence, not just cost per token; Gemini 3.5 Flash earns its price for product agents.
Visual (V) specs with GPT Image 2 improve planning clarity, but budget for image-generation cost your dashboard may not capture.