A coding agent finishes a task. The tests pass, the build is green, the diff reads clean. You merge it. How often is that code actually doing what you asked?
I stopped guessing and measured it. The answer is worse than it feels, and the fix is boring in the best way.
The setup
I took a real Next.js and TypeScript app, froze it at a single commit, and wrote five small features. Each one is a one-line request plus five acceptance criteria. I handed the request to four coding agents (Claude Sonnet, Claude Haiku, Codex, and Kimi), ran every combination three times, and scored the output by running the code, not by reading the diff.
A separate gate does the scoring. It re-applies each diff to a clean copy of the repo, executes the code, and checks every acceptance criterion. The agent never grades its own work. A green build counts for nothing.
A green build was never the bar
Somewhere we agreed that “it compiles and the tests pass” means done. It doesn’t. The agent wrote the code and, in the same pass, wrote the tests around it. Green means the code agrees with itself. It says nothing about whether it matches what you wanted.
So I split the two apart. The agent produces a diff. An independent gate decides if the diff is correct by executing it against the criteria. Different jobs, different owners.
With nothing but the one-line request, 40% of runs shipped a real correctness bug. Build green, tests passing, and still wrong.
What spec-driven development actually changes
The idea is small: write the acceptance criteria before the agent starts, and verify against them by execution after it finishes. That’s it.
I ran every task two ways. Same model, same repo, same effort. One difference:
- Raw: the one-line request. What most people type.
- Spec: the same request plus the five acceptance criteria as a checklist.
Pooled across the four models, raw shipped a genuine bug in 40% of runs. With the criteria up front, that dropped to 0%. The 95% confidence intervals don’t overlap. Every model went to zero genuine bugs once it knew what “done” meant.
Tell the agent what correct means. Check it by running the code. That closes the gap.
The result I didn’t expect
I assumed the expensive model would always come out ahead. It didn’t.
Haiku is a cheap model. Run raw, it produced a fully correct feature 40% of the time, about the same as Claude Sonnet. Give Haiku the acceptance criteria and it hits 100%, level with the frontier model. The cheap model with a spec beat the frontier model without one.
If you’re paying for the big model to paper over a vague request, you might be buying the wrong upgrade. On this benchmark the spec moved the needle more than the model did. It’s the same lesson behind the dangerous illusion of AI productivity: the speed is real, the leverage comes from the structure around it.
Why it isn’t a rigged win
The fair objection: I gave the spec arm the exact criteria the gate later checks, so of course it passes.
Two things keep it honest. The gate runs the same checks on both arms, so the raw arm isn’t held to an easier standard. And every criterion is labeled before any run as either genuine (any competent version should satisfy it) or contract (an arbitrary interface choice the agent couldn’t guess, like a parameter name). The 40% counts genuine bugs only. The raw arm never gets punished for failing to read my mind.
It’s five features, one repo, one stack. Directional, not a law. That’s the whole reason it’s public instead of a screenshot.
Try to break it
The repo has the protocol (written before the runs, so the analysis can’t be fit to the result), the features, the prompts, the execution gate, and all 120 diffs with their verdicts. You can re-score every run without paying for a single agent call. If you find where the method falls apart, the forum thread is open and I’ll answer.
Where this leaves you
Writing exact acceptance criteria for every task, and gating every merge on execution, is real work. Most teams skip it because doing it by hand on each task is tedious. That tedious part is what PaellaDoc automates: it turns your intent into criteria and gates on execution, so spec-driven stops being a discipline you have to remember and starts being the default. The principle holds without the tool, which is exactly why there’s no product anywhere in the benchmark.