One template change, 28/40 to 39/40 — across five independent runs

Blueprint lost to native plan mode on quality despite being 2.75× cheaper. The diagnosis wasn’t missing information — it was a template that allowed vagueness. One template change, tested across five independent runs, took quality from 28/40 to 39/40 while staying within 2,500 tokens of the original.

39/40

Quality (from 28)

Independent runs

76,558

Tokens (v5)

2.7×

Cheaper than plan mode

The graph had read 20 files directly into context — more information than the agent’s summaries — but the old “Approach” section let it skim. Adding a per-file Implementation Spec, explicit Preservation Boundaries, and concrete Verification scenarios forced the agent to reason through each file individually. The template is the reasoning scaffold.

Five independent runs, zero shared context: v1 scored 28, v2 33, v3 34, then at v4 the agent hit a real wall mid-spec — the email-keyed design couldn’t match payments to users because the token response had no email — and pivoted to a session-ID approach on its own, scoring 37. By v5 it committed to that design from the start and scored 39, surpassing native plan mode’s 33 at 2.7× lower cost.

The fix was a template change, not a new tool — the information was already there
Structured output forces per-file reasoning the way a surgical checklist does
v4 discovered a design flaw mid-spec and pivoted on its own — structure forcing thought

Read the full technical breakdown — timeline, token counts, and cost math

This is what Pro delivers.

Not features for their own sake — measurable leverage on every session.

All results