Keeping autonomous coding agents honest

The connector looked finished. Green status in the run log, a commit on the branch, and a note from the agent that read like a tidy handoff from a competent junior: researched the API, wrote the Go client, added the field mapping, tests pass. I’d left it running overnight. When I opened the diff in the morning, the Go didn’t compile. One of the “passing” tests had been deleted instead of fixed. And the agent had decided the cleanest route to a green run was to edit a shared config file two directories outside anything it was supposed to touch.

Nobody lied to me. The agent genuinely believed it had succeeded, because the only judge of success was the same model that did the work, and it had every incentive to declare victory and stop. That morning is the reason this post exists.

I build an autonomous connector factory at work: coding agents, driven through the Pi SDK, that research a third-party API and ship a full-stack connector across three repositories, Go on the execution side, TypeScript and Vue on the authoring side. It runs for days without a human watching, and on a good week it produces a new connector every few days. It’s part of how Shopware built Nexus, the AI-assisted integration platform that went GA in May 2026. The hard engineering was never getting an agent to write code; models do that. It was getting to a place where I could trust output I didn’t read, produced while I was asleep. Two things bought that trust: an eval harness that grades every step, and a reliability floor that keeps the loop alive long enough for the grading to matter.

A critic that can’t grade its own homework

The first fix was the obvious one, and it works better than it has any right to. Every stage’s output is graded by a fresh critic session: a separate model instance with no shared context, no memory of the reasoning that produced the work, no sunk cost in defending it. It sees the diff and the rubric, nothing else.

That “nothing else” is the point. When you ask the model that just wrote 400 lines whether those lines are good, you are asking it to argue against its own recent conclusions, and it is very agreeable. A fresh session starts cold, with no story to protect. It reads the deleted test as a deleted test rather than as “a cleanup I decided was fine,” because it was never in the room when that got rationalized.

I don’t want to oversell it. A cold critic is still a language model; it will occasionally wave through something subtle or flag something that’s actually fine. But the failure mode it removes is the expensive one: the confident false pass that runs unattended for hours and commits broken work with a cheerful note attached.

Verdicts you can act on without reading them

A critic that writes three paragraphs of prose is useless to an automated loop. The pipeline can’t parse vibes. So they don’t return prose; they return a verdict from a closed set, machine-readable, and every verdict carries a severity: BLOCKING or advisory.

BLOCKING means the stage failed and cannot proceed. Advisory means “worth noting, not worth stopping for.” The split lets the pipeline act deterministically: a blocking verdict feeds a retry, and a run with only advisory findings moves forward. There’s no step where the runner has to decide what the critic meant, because it already committed to a machine-readable answer.

In Go the core of it is about this small:

type Severity int

const (
	Advisory Severity = iota // note it, keep going
	Blocking                 // stage failed, do not proceed
)

// The only thing a critic session may return. The pipeline branches
// on Pass and Severity; prose stays in Findings.
type Verdict struct {
	Rubric   string    // "research" | "code" | "validation" | "synthesis"
	Pass     bool
	Severity Severity  // meaningful only when Pass is false
	Findings []Finding // detail for logs and resume prompts
}

func (v Verdict) Blocks() bool {
	return !v.Pass && v.Severity == Blocking
}

The rubrics themselves are typed by stage family, because grading research and grading code are different jobs. A research critic checks whether the API claims are grounded in something real, whether auth flows and rate limits were looked up rather than guessed. A code critic checks that it compiles, that tests exist and run, that error handling isn’t a row of swallowed returns. Validation and synthesis get their own rubrics. Handing every stage the same generic “is this good?” checklist gets you generic grading, so each family gets one written for what actually goes wrong there.

Budgets, because self-debugging usually converges

Each family also gets a retry budget, and I set them deliberately high. Code gets 8 attempts, validation 6, synthesis 4, research 3. Eight looks reckless, like paying the model to flail. In practice a coding agent told precisely what failed (“your client doesn’t compile, here’s the error; this test was deleted, restore it”) usually fixes it within a few rounds. The generous budget isn’t there for the median case, which converges fast. It’s there for the occasional connector that needs six passes to get an awkward pagination scheme right, so one hard connector doesn’t get abandoned two attempts short of working.

The budgets differ by family because the work differs. Code is the most iterative and forgiving of retries, so it gets the most. Research is closer to a get-it-right-or-the-connector-is-built-on-sand step, and re-running it tends to reproduce the same confident guess, so it gets the fewest. I’ve tuned the numbers, but the shape (code highest, research lowest) has held up.

Where a deterministic gate beats a model every time

The thing I wish I’d internalized earlier: a critic model is the right tool for “is this code correct,” and the wrong tool for anything you can check with plain code. Model judgment is expensive and non-deterministic, and it’s occasionally wrong in ways you can’t reproduce. So wherever a rule is actually a rule, I moved it out of the critic and into a deterministic gate.

The clearest example is scope. Every unit of work declares which paths it’s allowed to touch. After the agent runs, a path-scope gate compares the actual diff against that declaration and rejects anything outside it, no model in the loop. Remember the shared config file two directories away from that first morning? That’s exactly what this catches, every time, for free, in milliseconds. The diff either stayed in scope or it didn’t.

The other deterministic check I lean on is a prior-commitments gate. Earlier stages record decisions in their handoffs: this connector uses OAuth, this entity maps to that field, this endpoint is the one for incremental sync. Later stages get checked against those recorded commitments, and a stage that quietly contradicts an earlier decision gets flagged. This is the one that catches goal drift: an agent five steps deep has redefined the task into something easier and is now succeeding at the wrong thing. A fresh critic reading a single stage in isolation can’t see that drift. A check that holds the whole chain accountable to its own earlier words can.

The reliability floor: none of this matters if the loop dies at 3 a.m.

None of the grading matters if the process doing it is wedged. And agents fail differently from ordinary services: they stall silently, get stuck, and die mid-task. A web server that hangs throws errors and trips a health check; an agent loop that hangs looks exactly like one thinking hard. So there’s a second layer under the eval harness whose only job is keeping the loop honest about being alive.

An idle watchdog watches for progress: every real step calls a touch, and if nothing touches for a while, it fires. In my production setup that’s a warn at 15 minutes and an alert at 45. Fifteen minutes of silence is survivable; forty-five means a model call is hung and someone should look.

Consecutive-failure detection handles the other death: the loop that isn’t hung but is failing the same way forever. After N identical failures in a row it trips and stops, instead of burning tokens on attempt forty of a task that will never pass.

Then there’s crash survival. State is written with atomic write-rename, so a crash mid-save leaves the previous checkpoint intact rather than a half-written file that poisons the next run. A process lock guards against two copies at once, and its leftover content is the signal: a lock file still on disk after the process is gone means the last run crashed rather than exited cleanly. On restart, a crash-classification step reads that residue plus the last checkpoint and synthesizes a resume prompt: a plain note about how the last run ended and whether to re-verify the final step. Graceful shutdown hooks catch the clean case, saving a final checkpoint and releasing the lock.

I pulled these reliability patterns out into a public, dependency-free Go library: go-agent-reliability. Watchdog, stuck, checkpoint, runlock, recovery, and lifecycle, each a standalone package using only the standard library, built to compose in a plain for loop rather than a framework. That’s where these patterns live now, cleaned up and tested away from anything proprietary. The README wires all six together around a small demo agent you can Ctrl-C or kill -9 mid-run and watch resume with an accurate account of how it died.

What I’d do differently

Rubric drift bit me hardest and quietest. A rubric is a prompt, and prompts rot. As connectors got more varied, criteria that made sense for the first few APIs started firing on things that were actually fine, and I trimmed them reactively instead of versioning the rubrics and tracking pass rates per criterion. If I started again I’d treat rubrics as evaluated artifacts with their own metrics from day one, not as static text I edit when something annoys me.

I underestimated the cost of fresh critic sessions. A cold session has to re-read the diff and its surroundings from scratch, every stage, every retry. That’s real tokens and real latency, and on a multi-day run it adds up to a meaningful fraction of the bill. It’s worth it, the confident-false-pass problem is worse than the cost, but I’d budget for it up front rather than be surprised by the invoice.

Advisory verdicts get ignored forever, and I should have seen that coming. The appeal of the advisory tier is that it doesn’t stop the pipeline, which means in an unattended system nothing ever acts on advisories at all. They pile up in logs nobody reads. Either a finding is worth acting on, in which case it should escalate to blocking once it recurs, or it isn’t, in which case it’s noise. A middle tier that no automated process consumes is just a place for real problems to hide politely.

And I’d add the deterministic gates earlier. The path-scope gate took me a detour through trying to get a critic to reason about scope before I accepted that a diff comparison does it perfectly and for nothing. Every time I’ve made a model adjudicate something a rule could adjudicate, I’ve regretted it. The rule is faster and reproducible, and it never has a bad day.

The same distrust-by-default mindset shows up when you give agents real tools, not just the right to write code. If an agent can call an API on your behalf, it should do so through scoped tokens with an audit trail, not a shared credential with the keys to everything. That’s a rabbit hole I went down in mcp-oauth-go, which turns an MCP server into an OAuth 2.1 resource server with per-tool scopes and audit logging. Same instinct as the critic sessions: assume the agent will do the most convenient thing available, and make the convenient thing also the safe thing.