~/blog / framework / spec-driven-development-green-build-not-correct.md · 5 min · 1032 words
[post] · category / framework · by

Spec-driven development: a green build is not a correct feature

I gave four coding agents the same tasks, with and without acceptance criteria, and scored the result by running the code. Raw shipped a real bug 40% of the time. With a spec, zero.

JL @jlcases PaellaDoc creator · València
A green build is not a correct feature. Across 120 runs, raw requests shipped a genuine bug in 40% of cases with the build green; with the acceptance criteria up front, 0%.

A coding agent finishes a task. The tests pass, the build is green, the diff reads clean. You merge it. How often is that code actually doing what you asked?

I stopped guessing and measured it. The answer is worse than it feels, and the fix is boring in the best way.

The setup

I took a real Next.js and TypeScript app, froze it at a single commit, and wrote five small features. Each one is a one-line request plus five acceptance criteria. I handed the request to four coding agents (Claude Sonnet, Claude Haiku, Codex, and Kimi), ran every combination three times, and scored the output by running the code, not by reading the diff.

A separate gate does the scoring. It re-applies each diff to a clean copy of the repo, executes the code, and checks every acceptance criterion. The agent never grades its own work. A green build counts for nothing.

A green build was never the bar

Somewhere we agreed that “it compiles and the tests pass” means done. It doesn’t. The agent wrote the code and, in the same pass, wrote the tests around it. Green means the code agrees with itself. It says nothing about whether it matches what you wanted.

So I split the two apart. The agent produces a diff. An independent gate decides if the diff is correct by executing it against the criteria. Different jobs, different owners.

With nothing but the one-line request, 40% of runs shipped a real correctness bug. Build green, tests passing, and still wrong.

What spec-driven development actually changes

The idea is small: write the acceptance criteria before the agent starts, and verify against them by execution after it finishes. That’s it.

I ran every task two ways. Same model, same repo, same effort. One difference:

  • Raw: the one-line request. What most people type.
  • Spec: the same request plus the five acceptance criteria as a checklist.
Bar chart of the genuine correctness bug rate per model, raw request versus the same request with acceptance criteria up front. Claude Sonnet 33% to 0, Haiku 53% to 0, Codex 33% to 0, Kimi 40% to 0.

Pooled across the four models, raw shipped a genuine bug in 40% of runs. With the criteria up front, that dropped to 0%. The 95% confidence intervals don’t overlap. Every model went to zero genuine bugs once it knew what “done” meant.

Tell the agent what correct means. Check it by running the code. That closes the gap.

The result I didn’t expect

I assumed the expensive model would always come out ahead. It didn’t.

Bar chart of the all-criteria pass rate. Haiku 40% and Claude Sonnet 40% without a spec, both 100% with the acceptance criteria. A cheap model with a spec matches the frontier model without one.

Haiku is a cheap model. Run raw, it produced a fully correct feature 40% of the time, about the same as Claude Sonnet. Give Haiku the acceptance criteria and it hits 100%, level with the frontier model. The cheap model with a spec beat the frontier model without one.

If you’re paying for the big model to paper over a vague request, you might be buying the wrong upgrade. On this benchmark the spec moved the needle more than the model did. It’s the same lesson behind the dangerous illusion of AI productivity: the speed is real, the leverage comes from the structure around it.

Why it isn’t a rigged win

The fair objection: I gave the spec arm the exact criteria the gate later checks, so of course it passes.

Two things keep it honest. The gate runs the same checks on both arms, so the raw arm isn’t held to an easier standard. And every criterion is labeled before any run as either genuine (any competent version should satisfy it) or contract (an arbitrary interface choice the agent couldn’t guess, like a parameter name). The 40% counts genuine bugs only. The raw arm never gets punished for failing to read my mind.

It’s five features, one repo, one stack. Directional, not a law. That’s the whole reason it’s public instead of a screenshot.

Try to break it

The repo has the protocol (written before the runs, so the analysis can’t be fit to the result), the features, the prompts, the execution gate, and all 120 diffs with their verdicts. You can re-score every run without paying for a single agent call. If you find where the method falls apart, the forum thread is open and I’ll answer.

Where this leaves you

Writing exact acceptance criteria for every task, and gating every merge on execution, is real work. Most teams skip it because doing it by hand on each task is tedious. That tedious part is what PaellaDoc automates: it turns your intent into criteria and gates on execution, so spec-driven stops being a discipline you have to remember and starts being the default. The principle holds without the tool, which is exactly why there’s no product anywhere in the benchmark.