Late at night. The room is dark. A phone simulator loads on my laptop screen — running a habit tracker. With bar charts for daily stats, pie charts for categories, a calendar with colored dots for each habit, and streak counting. An hour earlier none of this existed. Just a terminal with a blinking cursor and a single line: /build habtrack "Habit tracker with daily check-ins, streak counting, weekly bar chart...".

52 minutes. No cloud providers. No API keys. Just a 27-billion-parameter model running right on my laptop.

I didn’t babysit the process. Didn’t fix bugs in a chat window. Didn’t yell at the model in all caps for building the wrong thing. Fired off one command, switched to another task, checked back an hour later — the app was ready to test. Before that, an RSS reader: one hour. Before that, a calculator: 37 minutes.

Three apps in one evening. Zero hands-on. Not a single dollar spent on tokens. And the model is the least interesting part of this story.

The Hired Hand

For the past few months I’ve been experimenting with an autonomous AI publisher for mobile apps. Etnamute is one of the intermediate iterations, released as an open source agent built on Claude Code — 4,500 lines of instructions on how to build mobile applications. A single run takes anywhere from thirty minutes to two hours — a full cycle from idea to a finished cross-platform React Native app.

It works. But it’s a hired hand. We pay Anthropic for access to the models that serve as the agent’s brain. And the company can raise prices at any moment, decide that my use case violates their ToS, or the service can go down under load at the exact moment I need to run the agent. And my employee vanishes. Can’t do the job. Not because it was poorly built. Because I don’t own it.

When Qwen3.6-27B dropped (a fresh open source model from Alibaba) and Anthropic’s servers went down under load yet again, I started wondering: can I move this entire machinery onto hardware sitting right on my desk? A model whose weights live on my hard drive. Local inference on my machine. Zero dependencies on external providers or APIs.

No, you can’t move it. But you can reinvent it.

Clean Slate

The very first idea — just take the existing Etnamute and feed it to Qwen3.6-27B. Fired up the model in opencode. Typed /build-app feedwise "RSS reader with support for major feed formats. Dark theme.". Nothing. The model went into deep meditation. Not metaphorically — it literally stopped responding, lost in reasoning chains that burned through the entire 128K token context before writing a single line of code. The problem isn’t that the model couldn’t understand the instructions (a 27B model is perfectly capable of understanding what’s in Etnamute) — it’s the cascade effect: many rules spawn complex reasoning chains, and reasoning chains devour context and available compute.

The brute-force approach doesn’t work. But even if it did, there’s the controllability problem. OpenCode with oh-my-opencode plugins is a big system with its own logic. MCP servers, background sub-agents, hooks that modify model behavior on top of my prompts, a long system prompt. I had no control over what exactly happens between “I asked” and “the model did.” A black box inside a black box.

I didn’t need a Swiss Army knife with a million modes. I needed a system where I control every layer. I switched to Pi — a minimalist CLI agent. Four tools: read, write, edit, bash. System prompt — under a thousand tokens. Nothing else. A thin layer between my instructions and the LLM that lets me understand and control everything from prompt input to final output.

Strip Away Everything Unnecessary

Michelangelo said the sculpture is already inside the stone — you just need to remove the excess marble. Turned out the same was true for Etnamute’s instructions. Not adapting 4,500 lines, but finding within them those without which a correct app cannot be built. Each rule — one question: what breaks if you remove it?

User interview: cut. The model can generate a spec directly from the user’s prompt. Market research: cut. RevenueCat integration: cut. None of that is needed for an MVP. E2E testing with visual analysis: painful, but cut too. Replaced with a shell script running grep checks and a smoke test for runtime errors. Instead of NativeWind, which gave the 27B model serious trouble, we use React Native Paper — Material Design out of the box. Three-level feature priorities — MUST, SHOULD, WONT: cut the middle tier. The gray zone for a small model means stubs, not implementations.

Total: 935 lines. Five thousand tokens. Five percent of the context window instead of “context exhausted.”

With a big model you think about what else to add. With a small one: what else to cut.

The compressed prompt worked. First test — 13 hours generating an RSS reader, ending with a white screen on launch. NativeWind, which I hadn’t cut yet, was silently crashing the renderer. DOMParser, which doesn’t exist in Hermes, was dropping the app on the first feed addition. Second run after accounting for these — 2 hours, works immediately. Third — a working pomodoro timer, a different class of app with its own set of problems. Then — a calculator, then — a habit tracker with charts.

Each run isn’t just a test of the model. It’s a test of the environment. An app breaks not because the model is dumb — but because the environment didn’t warn about potential problems. White screen from NativeWind — a line “use React Native Paper” in the prompt. DOMParser — have the model use fast-xml-parser instead. Buttons in a column — a flex grid layout template. Each bug isn’t an app fix, isn’t a long debugging session with ALL CAPS in the chat. No. We fix the harness. Apps are disposable. The harness is cumulative.

But this isn’t vibe coding in reverse — not “got an error, shoved a fix into the prompt, repeat until convergence.” Each rule must earn its place in the instructions. If the bug is specific, it means we missed something important in the overall methodology. DOMParser crashes Hermes — that’s not a rule “don’t use DOMParser.” It’s a runtime pitfalls table with ten entries: APIs that compile but don’t exist at runtime. Buttons in a column — that’s not a rule that creates grid layouts. It’s a section explaining how layouts work and the general principles behind them.

The harness isn’t a cookbook of errors and fixes. It’s a description of a methodology that sidesteps entire classes of problems. A specific fix helps with one specific problem. Well-formulated principles close hundreds.

After several runs the prompts stabilized. New generations didn’t produce unique problems — all the pitfalls had been covered not by fixes, but by rules. But for generating complex apps, our approach wasn’t enough.

When Memory Runs Out

All this time our harness was a single instruction describing how to properly generate mobile apps. A monolithic instruction that the Pi agent followed during generation. One single agent. Everything lived in one context: the spec, reasoning about the plan, package installation logs, TypeScript errors it had already fixed, old file versions it had already rewritten. In one run the agent burned through 12 million tokens — most of them on repeated npm install attempts with wrong versions.

The context window is a budget. Not a buffer you can dump into endlessly, but a budget that runs out. And in a monolithic architecture it runs out at the most interesting part.

To solve this, I did something the author of Pi considers an anti-pattern: I split the monolith into sub-agents.

Each pipeline step is a separate Pi process with a clean context. The PRD agent sees only the app idea and nothing else. The skeleton agent — only the spec. The logic agent — TypeScript interfaces. The UI agent — types and stores. A few dozen lines of prompt per agent. Each agent sees exactly what it needs to do its job. Nothing more.

Mario is right: parallel agents working in the same layer can generate garbage. But my agents aren’t parallel. They’re sequential and layered. Each with its own zone of responsibility: Logic doesn’t touch UI, UI doesn’t touch stores. The contract between agents — TS interfaces created by the skeleton agent first. If the UI agent calls a method that doesn’t exist in the store — tsc catches it in a second.

Context passes through the file system. One agent writes a file. The next one reads it. No variables, no shared memory, no intermediate formats. Unix way. Simple debugging.

Sub-agents solved the context problem. But created a control problem.

Rails Instead of a Steering Wheel

Launching sub-agents in Pi is done through extensions, and above them sits a parent agent — the main Pi process from which everything is launched. Inside it — an orchestrating model that was supposed to manage the process. In practice, it hijacked it. It would see a sub-agent’s problem and try to solve it itself, through long trial and error. Often it burned millions of tokens on something a shell script could handle in seconds — it just never ran the script.

A parent agent is an LLM that makes decisions about the process. And every such decision costs tokens. And every such decision can be wrong. The model decides “let’s try a different approach” — and gets stuck in an infinite loop. The model can skip pipeline steps, and does. The model refuses to delegate to a sub-agent and breaks everything itself.

I solved this with finite state machines. A deterministic state machine where transitions are defined in advance, conditions are formalized, and terminal states are inevitable.

The model can’t skip verification — there’s no transition bypassing that state. Can’t jump straight to UI development when the app skeleton isn’t ready — that transition is invalid. Can’t endlessly fix code — after several failed attempts, the machine automatically transitions to an error state.

The model doesn’t decide what happens next. The machine decides. The model just executes the current step’s task.

For this I built a small framework — reharness. An FSM engine for orchestrating Pi agents, deterministic functions, and the like.

Not Vibe Coding

It’s worth noting that our approach is fundamentally different from what many people have gotten used to as vibe coding.

Vibe coding is when you chat with an agent, ask it to build something, get a result you don’t like, ask it to redo, still wrong, you get frustrated and start writing in ALL CAPS — and every unstructured comment of yours pollutes the context and degrades the quality of the next response. You pay for every iteration. The provider is happy.

Our pipelines are different. You don’t chat with the agent in real time. You design the environment: which prompts, in what order, with what checks, with what transition conditions. Then you set the task. The machine attempts to solve it, operating within the designed environment. You go about your business in the meantime. If something breaks — you fix the environment, not the app. Add a check. Refine a prompt. Run again.

This isn’t “vibes.” This is engineering. Building a vertical that predictably produces results. And the fundamental difference is in scalability. Vibe coding scales linearly: more apps = more of your time in the chat. Pipelines scale differently: every improvement to the environment benefits all future app generations.

The Skeleton Inside the Stone

Pipeline: scaffold → prd → skeleton → logic → ui → verify ↔ fix → complete. Each step is a pure function of the file system: files in, files out. No information passes outside this rule. Makes debugging straightforward.

Before anyone writes a single line of implementation, the skeleton agent creates all necessary TypeScript interfaces with JSDoc documentation. These aren’t TODO stubs — they’re contracts. What each method is expected to do, what its edge cases are, what happens under concurrent calls. The logic agent implements these contracts, and the UI agent builds screens on top of them. If anyone violates the contract — a simple tsc check catches it immediately.

Why does this matter? The skeleton constrains the solution space. If the model sees an interface with three methods — it will implement those three methods. Give it a blank slate — it implements whatever it sees fit, forgetting half of it along the way and making up the rest. Types are rails for each subsequent agent. This is the core principle of our entire approach: the narrower the corridor, the more precise the movement.

The second pillar is scoping. Each agent sees only what it needs to do its job. The UI agent doesn’t care what problems the logic agent ran into — that’s not its zone of responsibility.

Trust, but Verify

Generating code is only half the job. The other half — figuring out whether it actually works.

In the full Etnamute, a dedicated QA agent handles this — it can launch the app, analyze screenshots, tap buttons simulating user actions, and compare results against expectations. Heavy artillery that even in Etnamute runs slowly but catches bugs with high confidence. On a local 27B model it’s an unaffordable luxury — we don’t have that much compute.

Here we make do with smoke testing: launch the app in a headless simulator and check the logs. If it crashed — we see the stack trace. Didn’t crash — we call it working. Five mechanical checks with zero LLM involvement: tsc, bundle, runtime smoke, stub detection, anti-pattern scanning.

But catching bugs isn’t enough — you need to fix them. And here another contract turned out to be critical — the file verify-report.md. During testing, the verify step writes a detailed report when it finds an error — which file, which line, what type. The fix agent reads it and makes surgical fixes, without wasting time exploring the codebase or guessing at the problem.

Before this contract existed, the fix agent had free rein to run even longer than all other generation steps combined. Adding this single file cut the cost of error correction by an order of magnitude.

The Environment Remembers

Between the first and last run, the model didn’t get any smarter. We used the same Qwen3.6-27B throughout. What changed was the environment.

DOMParser crashes Hermes — that’s not an app bug, it’s a gap in the prompt that didn’t mention fast-xml-parser. Calculator buttons in a column instead of the familiar button grid — our UI agent didn’t receive instructions on how to design that type of interface. app.json conflicts with Expo Router — that’s not an app bug, it’s a scaffold bug that didn’t prohibit creating that file.

Every bug becomes a line in a prompt, a check in the verify scripts, or code in the scaffold. The model doesn’t learn between sessions — it has no memory, and we don’t have the resources to fine-tune. But the environment remembers the minefield and won’t let anyone step on the same mine twice. Shell scripts don’t forget checks. Grep doesn’t miss patterns. The FSM won’t let you skip steps or call them out of order.

I truly saw this the moment when, after the first 13-hour run that produced a broken app, the agent spent just 37 minutes generating a fully working calculator. The pipeline ran, verify caught one error (SafeAreaView from the wrong package), the fix agent surgically replaced one import line, verify passed, smoke passed, the app works. 37 minutes from command to working application. The model didn’t do anything remarkable — it simply fulfilled the contracts. The environment didn’t let it fail.

And this is the key insight. Improving the environment closes the gap between open and closed models more effectively than improving the model itself. Scaffold installs the right packages — the model doesn’t need to guess versions through trial and error. Verify catches errors deterministically — the model won’t waste time on diagnostics. The model doesn’t need to remember the pipeline step order — the FSM remembers for it.

Large models make us lazy. Massive context — no need to think about prompt structure, and you can fix problems by piling on even more instructions. No need to design the environment, because the big model “will figure it out anyway.” Small models don’t forgive that laziness. Every token is expensive. Every rule must earn its place. Every check must pay for itself — if its false positive rate is higher than the frequency of real bugs, it hurts more than it helps.

And the most interesting part: systems designed under these constraints work better for any model. Minimalism doesn’t hinder a strong model, but it saves a weak one. This isn’t a limitation. It’s a design principle.

Six Files

Three apps, all built in one evening:

App What’s inside Time
Feedwise RSS + Atom, bookmarks, full article text ~1 hour
Calculus iOS-style calculator 37 min
HabTrack Bar chart, pie chart, calendar, streaks 52 min

The trend is obvious. Models are getting cheaper. Hardware is getting cheaper. Quality is going up. Qwen3.6-27B, released just days ago, can generate mobile apps in reasonable time on an ordinary laptop. What will local models be capable of in a year? Two?

Last night I looked at the habit tracker and thought not about which model built it. I thought about the six markdown files that told it how. A few dozen lines each. Not a single line wasted.

Code: github.com/bes-dev/reharness