39/40
plan quality after one template change
One template change, 28/40 to 39/40 — across five independent runs
Blueprint lost to native plan mode on quality despite being 2.75× cheaper. The diagnosis wasn’t missing information — it was a template that allowed vagueness. One template change, tested across five independent runs, took quality from 28/40 to 39/40 while staying within 2,500 tokens of the original.
The graph had read 20 files directly into context — more information than the agent’s summaries — but the old “Approach” section let it skim. Adding a per-file Implementation Spec, explicit Preservation Boundaries, and concrete Verification scenarios forced the agent to reason through each file individually. The template is the reasoning scaffold.
Five independent runs, zero shared context: v1 scored 28, v2 33, v3 34, then at v4 the agent hit a real wall mid-spec — the email-keyed design couldn’t match payments to users because the token response had no email — and pivoted to a session-ID approach on its own, scoring 37. By v5 it committed to that design from the start and scored 39, surpassing native plan mode’s 33 at 2.7× lower cost.
- The fix was a template change, not a new tool — the information was already there
- Structured output forces per-file reasoning the way a surgical checklist does
- v4 discovered a design flaw mid-spec and pivoted on its own — structure forcing thought
This is what Pro delivers.
Not features for their own sake — measurable leverage on every session.