~/blog / product / you-are-the-runtime.md · 9 min · 1932 words
[post] · category / product · by

You do not have a Claude Code problem. You are the runtime.

Claude Code writes your code beautifully, so you feel like you need nothing on top of it. You are right. You do not need a runtime, because you are one.

JL @jlcases PaellaDoc creator · València
A single developer holding up the whole delivery loop by hand: scheduler, memory, gate, recovery, integration, evidence. Next to it, a runtime carrying the same load so the developer only keeps product intent and the decisions that matter.

“Claude Code writes my code beautifully. I don’t need a runtime wrapped around it.”

If you are a strong developer, you have thought some version of this. And you are right. Claude Code is excellent at writing code, and with you at the keyboard it does not feel like anything is missing.

Here is the part you are not seeing: you do not feel the need for a runtime because you are one.

You decompose the product. You open the right worktrees. You keep eight sessions alive, remember which branch is safe and which task depends on which diff, know which failure matters and which is noise, which agent answer is plausible and wrong. When a run breaks, you recover it. When two changes collide, you merge them. When the build is green but the feature is quietly wrong, you catch it. When the agent loses the plot, you hand the thread back to it. Claude Code writes the code. You do everything else, by hand, without noticing.

That is not Claude Code falling short. It is doing exactly what it was built for: sit in a repo, read, edit, run commands, iterate with a very good model. At the keyboard, it is one of the best tools you can have. The question is who is doing the other job: the work around the code.

Right now, that runtime is you.

The runtime nobody bills for

The hard part of shipping with agents was never the code generation. It is the loop around it:

  • deciding what should get built
  • turning product intent into tasks
  • writing the acceptance criteria, and keeping them
  • choosing which agent does which task
  • opening isolated worktrees
  • watching the parallel sessions
  • reading logs, interpreting failures
  • retrying when the wrong thing broke
  • keeping old invariants alive while new behavior lands
  • integrating branches, catching regressions
  • tracking what each run cost
  • deciding whether “done” is actually done

A senior does all of that almost without noticing. That is exactly why it feels lighter than it is. The better you are, the easier it is to miss how much orchestration is still running in your head.

Eight tmux panes do not remove that work. They multiply the places it has to happen.

Parallel agents give you throughput. They do not give you a delivery system. If every session still needs a human to hold context, set priority, recover failure, validate output and merge the result, then the human is the scheduler, the memory, the gate and the release manager. All at once.

The same orchestration loop in two places. On the left, carried by you: decide what to build, turn product intent into tasks, keep the acceptance criteria, route each task, open worktrees, watch sessions, read logs, recover failures, hold old invariants, integrate branches, decide done. On the right, the same list carried by PaellaDoc as a runtime, with a trail.

For one expert watching every pane, that can be fine. It does not survive contact with many tasks, many repos, people who do not code, or a plan that has to outlive one chat.

The question worth asking

Whether Claude Code works is not the interesting question. Of course it works. The useful one is narrower:

how much human steering does it take to get a validated product result out of it, over hours or days, not minutes?

That gap is what I am building PaellaDoc around. Not a better code writer. PaellaDoc runs Claude Code, Codex, Kimi and local models, because the executor was never the interesting boundary. The boundary is everything wrapped around execution: product context, gates, evidence, recovery, routing, continuity.

If Claude Code is your pair programmer, PaellaDoc is the runtime around the pair.

The real PaellaDoc fleet view: three agents running in parallel, each one a Claude Code worktree (w-7af3, w-21bd). Each row shows the model (claude-opus-4-8), the effort (high, max), and the source of the run (Router, Retry), all picked by the runtime rather than the developer.

That is the real fleet, not a mockup. Three agents running at once, each in its own worktree, each from a user story. The model, the effort, and the source of the run (Router picked one, Retry recovered another) are decisions the runtime made. That is the work you would otherwise be doing in your head.

A prompt is a thin contract

Most agent sessions start with a prompt. Useful, but thin. A prompt can say what you want. It rarely carries the product system underneath it: the spec, the criteria, the non-happy paths, the design constraints, the repo invariants, the gates, the evidence a change needs before it counts, the reasons the old decisions were made the way they were.

When that lives only in a chat, it rots. When it lives only in your head, you cannot delegate it. When it lives only in a terminal, it dies with the transcript.

So PaellaDoc makes product context a first-class input to the run. Once the product work is real artifacts and not vibes, the coding loop can be governed against them. The agent is not told to “build the thing.” It is given a task, a contract, a gate, and what counts as evidence.

That moves the meaning of done.

Done is not the agent saying so. A green build is not done either. Done is the behavior passing the criteria that existed before a line was written, with the evidence attached. I wrote the long version of that in a green build is not a correct feature.

Long work breaks in a different place

One-shot demos are the easy case. A real product change runs through episodes: add the first behavior, keep it alive while you add the next, survive a checkpoint and resume, handle the event that arrives late, cover the negative path, hold the privacy and portability constraints, integrate at the end without quietly breaking what episode one established.

This is where “the model is smart” stops paying the bill. A stronger model writes a better first diff. It even recovers a lot of local failures on its own. But if nothing outside the agent is holding the contract, re-scoring the work, keeping state and deciding whether an old invariant just broke, the system still leans on you to notice.

I measured exactly this. Four cumulative episodes, each one able to break a gate the last one fixed. A bigger model did not close it. A governed recovery runtime did. The data and the caveats are in the recovery experiment. That is the first slice of a longer benchmark I want to run in the open.

What to judge it on

Do not judge PaellaDoc by “can Claude Code do this?” A strong developer can make Claude Code do an enormous amount. Scripts, custom skills, worktrees, eight sessions, a hand-rolled retry harness. That is a fair baseline, and it is the one I want to be measured against.

Measure it on the work you would otherwise do by hand:

  • how many times you had to step in
  • how many diffs you had to read before you trusted the result
  • how many failures got recovered without you
  • how many regressions survived to the end
  • whether the product context lived anywhere but the chat
  • whether there is a trail for why the run was accepted
  • whether the work resumes after an interruption without rebuilding the whole mental model in your head
  • whether the final result merges without you reconstructing the story by hand

Those are runtime questions. They are the ones that matter the moment agentic coding moves from “help me with this while I watch” to “run this plan over time and hand me back something I can trust.”

The claim, narrow

I am not claiming PaellaDoc writes better code than Claude Code. It does not need to. It runs Claude Code.

The claim is smaller and more useful: PaellaDoc takes product context, orchestration, recovery, validation and evidence out of your head and puts them in a runtime. A different category of value.

For a solo engineer watching every session, raw Claude Code may be all you need. Your runtime is right there, skilled and already paid for. The moment the work spans many tasks, many repos, many hours, people who do not code, or a plan that has to survive past one chat, the question changes. It stops being “can the agent write code?” and becomes “can the system keep the product contract alive until the code is validated?”

That is the line I want to compete on.

The experiment this deserves

This should be measured in the open, and not against a weak prompt. That would be easy and useless.

The fair test is a human-scheduler benchmark. Arm A: a senior developer, Claude Code, worktrees, background sessions, strong repo instructions, scripts allowed, manual orchestration allowed. Arm B: PaellaDoc, the same Claude Code executor, the same model and effort, the same repo, the same plan, the same gates, no trick in the task description.

Then count: human minutes spent steering, interventions, context switches, failed gates recovered, hidden regressions, final pass rate, evidence completeness, tokens, time to a validated result.

If the developer wins, that is real signal. It tells me where the runtime is still weaker than a good human operator, and I go fix it. If PaellaDoc wins, the point is not that Claude Code was bad. It is that the orchestration layer was the thing that mattered. And if they tie on correctness but PaellaDoc needs less steering, that might be the most useful result of the three. The goal was never to prove developers are unnecessary. It is to stop spending senior attention on being the runtime.

So

A good developer with Claude Code is powerful. That was never the argument.

The argument is whether you should stay the scheduler, the memory, the validator, the recovery loop, the integration manager and the evidence ledger for every run you do. My answer is no.

Let Claude Code write the code. Keep the product intent and the decisions that matter. Let the runtime carry the rest.

This is the same argument as the rest of what I build: route every task to the right engine, and the product brain that maintains itself.

Are you the runtime right now? Tell me on the forum how much of your day goes to steering agents instead of deciding what to build.