[+] · research · measured, in the open
we measure, in the open.
Two rules hold for everything on this page. The data and the code are public, so you can check the claim instead of trusting it. And the caveat comes before the headline, never after, even when it cuts against us. That is the whole point of doing this in the open.
The studies.
benchmark · n=120/210
A green build is not a correct feature
Across 210 runs, agent output passed the build but was genuinely wrong ~40% of the time. With the criteria gated first, that dropped toward zero. The verification benchmark.
read → experiment · exploratoryAgents need a runtime, not a bigger model
A four-episode recovery marathon. Prompt and skill Claude Code plateaued at 66%; a bigger model did not close the gap; a governed recovery runtime reached 100%. And the catch: a custom retry harness also recovered.
read →How we label it.
- Benchmark: a study with real sample size and a controlled comparison. The claim is meant to hold.
- Experiment: exploratory, often with n=1 controls. Directional, not proof. Labeled so you read it as such.
- Every study links its open repository: data, harness, and the exact list of what can and cannot be said.