What we learned shipping OttoInvites through an autonomous multi-agent AI harness

AI Code April 22, 2026

OttoInvites shipped as a by-product. The thing we actually set out to build was a harness that lets AI agents run a real codebase safely, and we needed a real project to stress-test it. A party invitation app for kids is what came out the other side. The OttoInvites case study covers the product. This post is about the harness: the agent team, the guardrails, the routines that keep it improving, and the 20% where fully autonomous development stopped working.

The shape of the harness

Four agents, a handful of guardrails, and a preview pipeline. Everything runs inside an isolated Docker container, orchestrated by Paperclip AI on a kanban-style task board.

Vertical workflow diagram of the OttoInvites harness. A user adds an issue to the To Do column. The Product Agent proposes three implementation strategies. The user approves one. The Engineer Agent implements it and opens a PR. The Code Review Agent reviews and iterates. The QA Agent clicks through the feature and files UX feedback. Supabase branches the database and Vercel spins up a preview. The user reviews the preview and merges or iterates.

Each block is one of three things: a user action, an agent doing scoped work, or infrastructure reacting. The invariant: no code is written until a strategy is approved, and no code reaches the user without passing every gate in between.

Everything inside a Docker container

The first decision was isolation. Every agent runs inside a Docker container, not on the host, with credentials scoped to the single repository it needs. If an agent does something unexpected, the blast radius stops at the container. That matters because the whole point of the exercise was to stress-test the harness, which meant pushing the agents harder than you would in production. A tight blast radius is what makes that safe.

The agent team

Four Claude Code agents, each with a narrow job. Narrow is the important word. A single agent that “does everything” gets lazy. Four agents that check each other’s work don’t.

Product Agent

Picks up new issues from the To Do column. Reads the issue, explores the codebase, and writes back three implementation strategies, each with tradeoffs, cost, and risks named explicitly. The user chooses one. No code is written before a strategy is approved.

Engineer Agent

Takes the approved strategy and writes the code. Runs the full gate suite locally before pushing: lint, types, tests, visual verification. Opens a PR only when everything is green.

Code Review Agent

Reviews the PR the moment it opens. Iterates with the Engineer until the code meets the quality bar. Most style, structure, and naming feedback closes without a human ever seeing the thread.

QA Agent

Checks out the PR, spins up the app in the preview environment, and clicks through the feature the way a user would. Probes edge cases, writes UX feedback, iterates with the Engineer until the experience is clean. The QA Agent is the last line before a human gets pinged.

Guardrails

The agents are only as good as what they’re checked against. Four gates run on every change:

Chrome visual verification. After a UI change, the Engineer must screenshot the result. If it doesn’t match what the plan described, the gate fails.
Playwright E2E tests. Every new feature gets a test. Every bug fix gets a regression test. TDD isn’t a stylistic choice here, it’s what makes autonomous work safe. If the test suite stays green, the agent knows the change didn’t break anything upstream.
ESLint. Automatic style enforcement. No agent arguments about formatting.
Knip. Catches orphaned files, unused exports, and dead dependencies. Agents tend to leave debris; Knip sweeps it up.

Gates are non-negotiable. An agent that can’t make them pass can’t open the PR. A PR that stops passing them mid-review gets kicked back to the Engineer.

Deployment: Supabase + Vercel

When the Code Review and QA agents are done, the preview pipeline takes over:

Supabase branches the database automatically and pipes the new credentials into the deployment.
Vercel spins up a preview environment from the PR.
The preview URL gets posted back to the PR, and the user is notified to review.

From the user’s perspective, the whole thing between “approved the strategy” and “got a preview URL” is hands-off.

Routines: what agents do when nobody’s asking

The workflow above runs when there’s work to do. But the harness has a second life in the background. Four scheduled routines run without user input:

Daily, Engineer: audits E2E test coverage. Looks for untested code paths and proposes tests.
Weekly, Engineer: cleans up orphaned branches and dead files in the dev environment.
Daily, Product: reviews recent user feedback and updates its own memory so the next planning round starts with better context.
Daily, Orchestrator: runs evaluations across every agent. This is the eval layer of the harness. Each agent’s work from the previous day gets scored against its own rules, inefficiencies and documentation gaps are surfaced, and the Orchestrator proposes edits to agent instructions and project docs where the patterns show up. No agent escapes the eval; it’s how the harness keeps tuning itself.

The routines are what makes the harness improve itself. Without them, you’re running the same harness on day 30 as you shipped on day 1. With them, the skills and docs sharpen a bit every day in response to where agents actually struggled.

Where it works, where it doesn’t

We ran the harness on OttoInvites for the whole build. Honest read: it works fine for a project of this size, and we had a lot of fun setting it up and playing with Paperclip and the Claude Code agents. It has a ceiling in terms of project complexity and reliability, and the more complex the project gets, the more human intervention it needs.

Roughly 80% of the time, the end result with all the skills, hooks, guardrails, and multiple autonomous review stages was satisfactory. The other 20% wasn’t helpful and sometimes misleading, so a final human review was still required.

Two failure patterns came up often enough to name.

Wrong-root-cause PRs

An agent assumes a root cause that isn’t the actual problem and opens a PR fixing the imagined cause. The case we hit repeatedly: an issue that would have been resolved by restarting the Docker container. The agent didn’t check the environment and instead wrote a PR against the code, which wasn’t necessary.

Governance drift

Paperclip doesn’t provide strong guardrails on agent governance: it gives agents full access to manipulate issues on the task board. The process we wired up worked 90% of the time. The other 10% was an agent going off process and doing something not approved or not needed, wasting tokens. A concrete example: an agent starting work on a task before the user had approved a strategy, even though that approval requirement was an explicit rule in its AGENTS.md.

What we’re keeping

For a project of this size, we’re keeping the approach. Fully autonomous development is achievable here and it was a lot of fun to set up. The interesting question isn’t whether agents can build software. They can. The question is how much human oversight you trade for autonomy, and whether the 20% where they get it wrong costs you more than the 80% saves you.

For OttoInvites, it was worth it. For anything larger or more complex, the ceiling starts to move against you, and the governance gap in Paperclip starts to matter more. But it’s exciting to watch how fast projects like Paperclip are moving and how fast the models are advancing overall. We can only expect the ceiling to keep getting higher.

💡 Read the product side of this story: OttoInvites: party e-vite app built in 60 hours.

Pavel Demeshchik

CEO