Teaching an Agent to Improve Its Own Scaffolding Between Tasks

Most of my time I spend vibe-coding on small projects — it works great there. Drop in some context, get code back, tweak it, ship it. But at some point I got curious: how does this hold up on a genuinely large codebase with serious legacy baggage?

So I picked one. Multiple languages, multiple databases, zero documentation, an architecture that clearly accumulated over years without any coherent plan. The kind of project where most people turn around and walk away already at the “how do I even run this” stage. Over a weekend, Claude Code and I rewrote it from the foundation up and got a working prototype running.

This isn’t a post about how fast things are now. It’s about a specific technique without which that scale would have been impossible.

The problem with large codebases

Vibe-coding works on small projects because the context fits in your head and in the context window at the same time. On large legacy code that breaks fast: the agent doesn’t know what’s already in the codebase, repeats the same mistakes, and loses the thread between tasks.

The obvious answer is more context in the prompt. But that scales poorly and gets expensive. The right answer is to structure what the agent needs to know — and automatically update that knowledge between tasks.

The architecture: three phases per ticket

The key insight: each logical task needs its own session with a properly prepared scaffold. Not one giant context for the entire project, but targeted preparation for the specific ticket at hand.

Phase 1 — Harness assembly. Before each task, a supervisor agent reads the project (manifests, schemas, API surface, existing scaffold) and does something like a deep research pass: how do other people solve this kind of problem in this stack, what MCP servers exist for the tools we need, what anti-patterns are typical. The output is a harness tailored to the ticket:

CLAUDE.md with project context and critical rules
Skills with step-by-step instructions for the type of work (migration, refactor, external API integration — each has its own structure)
Rules — one file per anti-pattern: “Never X, because Y”
MCP servers with documentation for the relevant libraries

The harness isn’t rebuilt from scratch each time — it reads what already exists and adds only what’s missing for the new task.

Phase 2 — Work session. A normal interactive Claude Code session. Claude works within the prepared context, I steer and course-correct. Nothing magical.

Phase 3 — Analysis and update. After the session, the supervisor reads the JSONL log (Claude Code writes a full transcript of every session) and looks for signals:

Tool results with is_error: true
User corrections — messages containing “no”, “wrong”, “instead”
Retry patterns — the same tool called with near-identical arguments across several consecutive exchanges
Rule violations — actions that contradict rules already defined in the harness

A single incident gets documented as inherent complexity. A repeating pattern (×2 or more) becomes a concrete harness patch: a new rule file, a strengthened existing rule, an MCP addition for missing docs, or an entry in the project Wiki.

Wiki as compiled memory

Between tasks the harness doesn’t reset — it accumulates. The key artifact is project-wiki.md, a living document of compiled project knowledge. Four sections:

Reuse Map — what’s already implemented and where. Agent tried to write its own implementation on top of something that already existed? The path goes here, and the next task reads it first — before any research.

Anti-Patterns — not generic “don’t use raw SQL”, but project-specific: “this codebase already has a client for X right here — don’t reinvent it.” Rules in .claude/rules/ hold general principles; the Wiki holds what’s specific to this project.

Tips — approaches that worked in non-obvious ways.

Gotchas — non-linear dependencies, initialization order, hidden contracts between components.

The idea comes directly from Karpathy’s LLM Wiki: don’t store raw observations and RAG over them — compile them into a structured knowledge base that updates incrementally. The Wiki is read before each new task and before each iteration inside loop skills. The effect is noticeable by the third or fourth task: the agent stops hitting the same walls and navigates the codebase faster — not because it got smarter, but because the accumulated context won’t let it forget what’s already been figured out.

From Karpathy’s AutoResearch comes another idea: each successful iteration becomes the new baseline. Here that’s the harness — it never rolls back, only grows. By task N the agent operates as if it’s done this in the project many times before.

The upshot

By the end of the weekend I had a strong sense of working with someone who was gradually learning the project — not a tool that starts from zero every time.

Building a harness for each task feels like overhead, but in practice it pays for itself by the second or third task within a project. And the most surprising part: harness assembly itself is fully automatable. You don’t need domain expertise in every stack — you just need to describe the scope of work and let the agent do the deep research. It finds what you need more accurately than you would in the same time.

This direction — automatic context adaptation per task, accumulating project memory across sessions — feels like where the next meaningful jump in AI-assisted productivity is going to come from.

Code: github.com/bes-dev/autoharness