we measure, in the open.

Claims about AI coding agents are cheap. These aren't. Each study here ships with its data, its harness, and its claim boundary, caveats first. Some are rigorous benchmarks. Some are exploratory experiments. Each one says which it is.

Two rules hold for everything on this page. The data and the code are public, so you can check the claim instead of trusting it. And the caveat comes before the headline, never after, even when it cuts against us. That is the whole point of doing this in the open.

The studies.

How we label it.

  • Benchmark: a study with real sample size and a controlled comparison. The claim is meant to hold.
  • Experiment: exploratory, often with n=1 controls. Directional, not proof. Labeled so you read it as such.
  • Every study links its open repository: data, harness, and the exact list of what can and cannot be said.