[+] · research · measured, in the open

we measure, in the open.

Claims about AI coding agents are cheap. These aren't. Each study here ships with its data, its harness, and its claim boundary, caveats first. Some are rigorous benchmarks. Some are exploratory experiments. Each one says which it is.

Two rules hold for everything on this page. The data and the code are public, so you can check the claim instead of trusting it. And the caveat comes before the headline, never after, even when it cuts against us. That is the whole point of doing this in the open.

The studies.

benchmark · n=120/210

A green build is not a correct feature

Across 210 runs, agent output passed the build but was genuinely wrong ~40% of the time. With the criteria gated first, that dropped toward zero. The verification benchmark.

read → experiment · exploratory

Agents need a runtime, not a bigger model

A four-episode recovery marathon. Prompt and skill Claude Code plateaued at 66%; a bigger model did not close the gap; a governed recovery runtime reached 100%. And the catch: a custom retry harness also recovered.

read →

How we label it.

Benchmark: a study with real sample size and a controlled comparison. The claim is meant to hold.
Experiment: exploratory, often with n=1 controls. Directional, not proof. Labeled so you read it as such.
Every study links its open repository: data, harness, and the exact list of what can and cannot be said.