From App Factories to a Reasoning Compiler

I didn’t plan to build a compiler — I just wanted to maximize out of the AI agents I had.

What is an AI agent today? It’s actually quite simple. There is a language model — the brain and the center of decision making. And there is a harness around the model: the environment where the model works — the thing that makes the model an agent. Without the harness the model is just a text generator, sometimes quite a smart one.

Most of the resources of the labs around the world go into improving the models, which we use as is — and thank god, it’s not us who pay for their training. The harness gets much less attention from the research community. So I have good news for you: the harness is exactly the place where an indie researcher can make a contribution, without having the resources of the frontier labs.

Wishes, Not Guarantees

Today the harness mostly means two things: MCP and Skills. Skills explain to the agent in free text how to approach a task; MCP gives it the tools for that. The idea is quite elegant: give the model a minimal kit — and let it assemble the solution on its own.

But all this elegance rests on one assumption, that the model will follow the received instructions. And a skill is just text in the context window, which the model is not obliged to follow. Skills today are basically prompt engineering on a new lap: they raise the odds, but they give no guarantees. And the problem is not the quality of skills: you can write an excellent instruction, covering all the nuances. But if the model can simply ignore these instructions — it’s a bad foundation for a reliable system. A good foundation needs something else — a structure that the model doesn’t read but executes, with no way around it.

The Splinter

I got here not by theorizing but by hitting walls, project after project: I built an AI mobile developer, taught an agent to actually test things, made a harness repair itself between tasks, and finally moved everything onto a local 27B model, replacing the LLM orchestrator with a finite state machine. I wanted my agents to solve complex tasks autonomously, for many hours, without my intervention. And ideally — to do all of it with lightweight local models that can run right on my laptop. And in every new project, while raising the bar for autonomy and quality, I had to make the harness a bit less advisory and a bit more structural, until it stopped being instructions and became code. Why orchestration with deterministic state machines beats orchestration through reasoning is a separate post; the one-line version: control flow lives in code, and the model is only responsible for judgment at the leaves.

But when the harness of my agents collapsed mostly into code, I was still writing it by hand. For every new task I decided what the topology of the state machine will be, what happens in each state, what contracts between the agents, etc. The essence of this process is reasoning about the task.

An attentive reader will say: but this is how we solved agentic tasks a few years ago, and then we dropped it in favor of orchestration through reasoning, when the models became smart enough — am I trying to sell an old idea? And he will be absolutely right: deterministic orchestration returns control, but it takes away the main achievement of the reasoning era — the model’s ability to derive the solution on its own. To fix this problem, today we will go up to the level of meta-agents and bring the model back into the loop of orchestration decisions. But we will do it in a clever way: we will build an agentic system that spends reasoning once to build a deterministic machine, which then can run as many times as you want, without spending any additional reasoning on orchestration.

This is what reharness does — a reasoning compiler. It is open source, published on npm and installs with one command — at the end of the post I will show how to try everything in a couple of minutes, but first let’s see how it works.

A Trick from 1971

Let me defend my right to use the word “compiler” for reharness. For this I will use two ideas.

First, a compiler takes something expensive to interpret and makes it cheap to execute. A JIT analyzes the program execution, finds the hot paths and compiles them into machine code — so it stops re-interpreting them on every pass. reharness does essentially the same. The hot path is the model’s reasoning, which the agent repeats every time it faces the same class of tasks. The machine code is a finite state machine. Here the analogy works on the level of intuition.

To explain the second idea, we need to recall a rather beautiful trick from the distant 1971 — the first Futamura projection. Let’s start with partial evaluation: if a part of the program’s input is fixed, you can hardcode it inside. For example, for pow(x, n) with fixed n=3, we can do an in-place substitution pow(x, 3) = x * x * x. Now take an interpreter — it’s also a program, it just receives another program as input. And if we specialize the interpreter with respect to one concrete program, we get a standalone artifact that does the same as the original program, but without the interpretation overhead. Futamura’s observation here was that the essence of specializing on a program is exactly compilation.

Now let’s draw the analogies between classic compiler theory and modern agentic systems. The agentic loop — think, call a function, look at the result, think again — is an interpreter. A concrete request, or a trace of a task solved once, is a program. And if you clean out everything that depends on the concrete run, what remains is a deterministic pipeline. So “compiling reasoning” is the same formal move, just one level higher up the stack. This way, I allow myself the word.

Now let’s look at the economics of compiling reasoning. The compilation itself costs real money, because you need to spend reasoning to build the deterministic pipeline. As for the runtime, let’s look at the limit cases. If the task is fully mechanical, the compiler pulls everything into code and the pipeline doesn’t call the model at all. If the task is purely creative, the compiler will build for it a machine with one agent state, and the number of spent tokens will be the same as before compilation. Everything between these two cases is the most interesting part — tasks that are hard to turn into pure code, but where the amount of reasoning can be reduced a lot. As we can see, by construction, after compilation the cost of every new run at least doesn’t grow, and often becomes lower.

And the nicest thing here is that compilers are a well-researched area with a known anatomy, which we now can use. There is a target language (the deterministic FSM runtime). There is an intermediate representation (skeleton.xml — the single artifact that all the passes work on). There is a static analyzer that checks the correctness of the built machine. There is a pipeline from the source language to the executable code — with a multi-language frontend, exactly one creative pass, and everything else deterministic. And there is a profile-guided optimization loop from real runs, which gcc doesn’t have, but every grown-up JIT does.

Let’s walk the whole system in the order it is actually designed — from the target up. Because the promises of a compiler are based on the guarantees of the target.

The Mouthful

The runtime itself appeared before any attempts to formalize it strictly, and it was naive. Since then it grew up and became formal, and its formal name is a mouthful: a deterministic hierarchical Moore-action transducer with run-to-completion semantics. Sounds scary, but the name is a specification, where every word corresponds to some design decision. So let’s unpack it word by word, and for each one understand what problems it saves us from.

…transducer is simply an automaton with output: not a textbook recognizer that answers “accept / reject”, but a machine that does work while it moves. Formally the pipeline is a six-tuple (Q, q₀, F, Σ, δ, λ): states, the start one, the terminal ones (each marked success or error), the alphabet of events, the transition function δ and the action function λ. All the useful work — agent calls, code, artifacts — lives in λ. And all the other words of the name are restrictions on how δ and λ are allowed to behave.

Moore-action means that the action is tied to the state — and only to it. Each state runs its action to the end and emits exactly one event — the result of its own computation, not a reaction to something arriving from outside. Thanks to this we can answer the question “who did that?”, because all the computations happen only in the nodes and never on the transitions between them. Want to understand what a state does — read it, there is nowhere else to look.

…with run-to-completion semantics means that the machine fully finishes the state’s action and only after that thinks about transitions. No event queues, no preemption — strictly one event per step. There is only one allowed source of external signals — the wait state (timer, file, shell, webhook), and it is an explicit state type, not a side door. This allows to always understand at which stage of computation the machine is right now, because a run is a sequence of finished computations, not a pile of half-done actions.

Deterministic is the most loaded word, it saves us from several problems at once.

First, δ resolves an event by taking the first transition with a true guard, in written order. UML, as we know, doesn’t define the order of guards — we nail it down. So the situation “this time it somehow went differently” is impossible for the same scalars — the route will always be the same.

Second, we forbid the machine to hang silently. For any reachable pair (state, event), either a transition is defined, or the machine stops with an explicit error pointing at the problem — an unhandled event, no guard matched, a switch with no suitable branch. And the fail path first persists the machine state and then dies: even a failure leaves a resumable run behind. This, by the way, forbids skipping the verification steps — paths around the checks just don’t exist in the graph.

Third, guards execute at the transition points, when no stage is active, and they can read only the scalar bus — small values in memory: flags, counters. Try to do something with the file workspace inside a guard — and the machine will fail loudly. Thanks to this, the whole configuration of the machine is a pair (current state, dictionary of scalars). So restoring an interrupted machine is trivial — you saved this pair, and then you restored it. This also simplifies the static analysis: to trace the transitions, the analyzer needs to model a handful of scalars, not a file system.

But it’s worth noting that while the machine itself is deterministic, the agent state stays stochastic inside, because it’s an actual LLM call. But with fixed outputs of the agent states — the full run of the machine is reproducible. Splitting the task into a deterministic skeleton and stochastic reasoning only where you can’t avoid it — this is basically the product.

Hierarchical means that a state can be not only a leaf but also a composite: parallel — fork/join over an array, loop — bounded iteration, call — invoking another compiled pipeline. A composite executes by recursion as an RTC computation, it doesn’t break any of our machine’s restrictions.

Infinite loops are forbidden by design: for loop the maximum number of iterations is mandatory, though an early exit by a corresponding predicate is also allowed. The reason is simple: if the exit predicate is written by a stochastic LLM, you can easily end up in a situation where the loop diverges — we strictly forbid this.

As for parallel, the parallelism is real only where it actually matters: for agent states the parallelism is implemented at the level of OS processes. Each branch gets its own copy of the bus, which excludes the race for shared data; the branches communicate through isolated output directories, which the join then reads as a list.

And now if we return to the name of our runtime, it stops being scary and turns out simple and logical:

Transducer: the machine does work.
Moore-action: the work happens in the nodes.
Run-to-completion: the steps don’t interrupt each other.
Deterministic: one route for the same scalars, no silent stalls, with stochasticity only in the leaves.
Hierarchical: composites nest by recursion and loops are always bounded.

The poverty of the target is not a bug, it’s a feature. We deliberately restricted the machine to the necessary minimum to get checkable guarantees.

One File of Truth

Our runtime is a processor that executes programs. The programming language for this processor is the intermediate representation of the machine in XML format. The DSL consists of 12 state types (agent, interactive, code, set, switch, check, parallel, loop, wait, call, approval, final), which the runtime collapses to eight constructs — the first four are one active state with different actions, check is sugar over switch.

Here is an example program describing a multi-model code review:

<skeleton id="multi-review" initial="ingest">
  <usage>multi-review &lt;repo&gt;</usage>
  <inputs>
    <arg name="repo" positional="true" required="true"/>
    <arg name="models" type="list" default="claude-sonnet,claude-opus"/>
  </inputs>

  <state name="ingest" type="code">
    <contract><![CDATA[Collect the diff of config.repo into c.out()/diff.patch;
      set c.data.empty = true if there are no changes.]]></contract>
    <on event="DONE" target="has_changes"/>
  </state>

  <state name="has_changes" type="check" expr="data.empty">
    <on event="TRUE" target="ok"/>
    <on event="FALSE" target="review"/>
  </state>

  <state name="review" type="parallel" over="config.models" branch="reviewer" concurrency="4">
    <on event="DONE" target="gate"/>
  </state>

  <state name="reviewer" type="agent">
    <contract><![CDATA[Read the diff from the input directory, write findings.md
      with a list of findings (severity: blocking | advice).]]></contract>
  </state>

  <state name="gate" type="code">
    <contract><![CDATA[Read c.dirs('reviewer') - one directory per model - merge
      findings into c.out()/report.md, count blocking ones into c.data.blocking.]]></contract>
    <on event="DONE" target="route"/>
  </state>

  <state name="route" type="switch">
    <go guard="expr:data.blocking > 0" target="bad"/>
    <go target="ok"/>
  </state>

  <state name="ok" type="final" status="success"/>
  <state name="bad" type="final" status="error"/>
</skeleton>

Every working state has a short <contract> — this is the only place that stores the intent of the node. The reviewer state is an example of the parallel composite: it runs once for each model, and then control returns to the parent. The gate → route pair exists because of our machine’s restriction that agents can’t touch the scalars, so there is always a small code bridge for choosing the direction, which reads the files and puts one number on the bus.

If you look at the code carefully, you won’t find a single file path there. The reason is not that I simplified the example — the language has no such constructs. In the whole IR there is only one declaration — <inputs>, which defines the external arguments that are impossible to derive from the graph. From <inputs> the codegen generates the argument parser, and a static check requires that every argument the pipeline reads is declared. Implicit declaration of global variables is forbidden. Everything else is derived from the graph.

One more deliberate poverty of the target is the guard language, which is not Turing-complete on purpose. It allows only identifiers over config.* / data.* / retries.*, comparisons, boolean logic, arithmetic and literals — no function calls, no assignments, no ternary. A guard must be a cheap, total, statically checkable expression over scalars, and the grammar is simply not able to express anything else. Routing inside the machine is a function of scalars, and the guard grammar makes violating this requirement impossible.

The Wiring Nobody Wrote

You may object: the stages obviously pass data to each other (diffs, reports, build artifacts, etc.). Who declares what flows where? Nobody. And this is not a hole in the language — this is my favorite part of the whole system.

Working on the compiler, at first I tried the obvious ideas. And the most obvious one here is to make the model somehow understand the data flow inside the graph and annotate it on its own. All the attempts failed for a simple reason: the model’s opinion on this question is a second source of truth about the graph, and there is no guarantee it won’t diverge from the real graph. To be honest, in my experiments it diverged almost always, which led to generating a broken machine. Things were especially bad with loops and parallel branches.

The important insight for solving this problem was that a graph edge already is a contract by itself. If stage B is reachable from stage A, then the output of A is by definition available to B. So the most reliable solution turned out to be deriving the visibility from the topology: the producers visible to a node are its ancestors, and there is nothing more to invent here.

One important question remains, which is easy to miss and run into bugs: how many instances of the producer’s output does the reader see? In compiler theory there is an instance-wise rule for this, in the spirit of the polyhedral model. The same classic that is used in loop optimizers for resolving array accesses in loop nests. If you ever wrote results[i][j] inside two nested loops — you have all the necessary theory: a thing that executes inside loops is addressed by its loop indices. For us it’s all the same: every state has an iteration space — the chain of composites enclosing it, from the outer to the inner. An instance of a state is addressed by an instance vector: one index per enclosing composite (which parallel branch, which loop iteration). For a top-level stage this vector is empty — there is one of it.

The cardinality rule then fits in one sentence, verbatim from the code:

Producer P is visible to reader N as a collection ⇔ P’s chain is longer than the common prefix of the two chains. Otherwise — as a single instance.

All of this is easy to understand on our review example. As long as reviewer works in its own process, it sees its own copy of the bus and its own working directory, and it doesn’t see the neighbors — the question “which instance?” is resolved trivially for it: for example, there is exactly one ingest for it. But gate runs already outside the parallel, and from its point of view reviewer is not one result, but one per each parallel branch. From here the simple rule: from inside we read one branch (c.dir), from outside — a list (c.dirs). For loops it unrolls the same way. Even the tricky case “the actor reads the critic from the previous iteration” resolves by itself: the runtime gives the latest existing instance. In compiler theory this analysis is called instance-wise dataflow — the classic theory of array accesses, which fit the agentic pipeline like it was made for it.

And physically it’s all almost disappointingly simple: every output directory is named by the full instance vector — work/<stage>/<i0>/<i1>/…. The producer’s write, the branch bookkeeping and the consumer’s read are computed from the same pair (stage, vector). They can’t diverge — there is nothing for them to diverge about.

So in total the data travels over three channels:

The scalar bus — used for routing.
Per-stage directories — artifacts, with the wiring derived from the graph.
External targets — the only channel of communication with the outside world, declared through <inputs>.

Other People’s Theorems

We have the runtime, we have the IR to program it, the last important step before building the generator is static analysis. We must be able to tell AI slop from correctly working programs. For this we will again turn to compiler theory, instead of writing our own bicycles whose robustness I really don’t want to prove myself. In reharness there are just two lightweight engines for static analysis, together they take 78 lines of code.

The first engine checks the reachability of states in the graph using breadth-first search, nothing much to discuss here.

The second engine is more tricky, and it’s easiest to understand on our same review example. gate writes the scalar data.blocking, and route reads it. What we want to know here: is this scalar actually written on all the paths leading into route? Because the graph was drawn by a model — and if it created a path around gate, then on that path route will read something that doesn’t exist.

We could do a full enumeration over all the paths for this, but computationally it’s hard, and besides, a more beautiful solution was already invented for this. The trick is this: instead of paths, we compute for every node the set “what is guaranteed to be written by the moment we arrived here”. There are just two rules: a node that writes a scalar adds it to its set, and where several branches merge, we take the intersection of what arrived along them — only what is guaranteed on every branch stays guaranteed. If the question is not “is it guaranteed” but “is it possible on at least some path”, everything works the same but with the union — this is exactly how “whose outputs are visible to a node” from the previous section is computed. Loops remain, but they are also simple: we walk the graph again and again until the sets stop changing, and they can’t grow forever — there is a finite number of scalars, so the stop is guaranteed.

That’s the whole engine: in the textbooks it’s called the Kam–Ullman monotone dataflow framework, where the version with intersection is the MUST analysis, the version with union is the MAY analysis, and “the sets can’t grow forever” is its convergence theorem. The whole machinery is two lines:

IN[n]  = merge of OUT[p] over all predecessors p      (entry: IN = ∅)
OUT[n] = IN[n] ∪ gen(n)

Every check in reharness is a thin instance on top of these two engines:

check	what it is, formally
every state is reachable from start	forward reachability
every state can reach a final (no dead ends)	backward reachability from the finals
no scalar is read before it’s written	forward MUST dataflow = definite assignment
visible producers and their cardinality	MAY reachability + the common-prefix rule

Look at the bottom row of the table: deriving the data wiring and checking it is one and the same analysis, not two pieces of code that have to negotiate with each other. And a second trick of the same kind: the question “where does a state lead?” is answered in the whole system by one function — successors, and it is used by the reachability check, the data analysis and the codegen. Three consumers look at the graph with the same eyes — and the whole class of bugs “the validator assumed one thing, the codegen did another” disappears by construction.

The Rite of Passage

So, piece by piece our compiler is almost ready, what’s left is to tie everything together. And here is a fun thing: the compiler is a reharness pipeline, a machine running on its own runtime. This proves that the target language is expressive enough.

The machine itself looks like this:

research → prd → [review_prd: approval] → design → construct → fill → check_dataflow → polish → verify → done
                                                              verify FAIL → fix_verify (≤2) → verify
                                                              polish escalate → redesign (rare)

The frontend is multi-language — like gcc, which compiles C, C++ and Fortran into one IR. There are three source languages: a request in natural language; an amendment (amend) — folding a new feature into the existing PRD as an intent delta, keeping the already filled leaves; and a recorded trace — the log of a task solved once. All three converge on the PRD: a short document about what should be built.

Of the three source languages, the trace deserves a separate pause — it’s the most unusual one. A recorded session is a demonstration, and turning a demonstration into a reusable program is programming by demonstration: generalization from one example. The generalization here is explanation-based, and the recipe is exactly what it sounds like: explain why the trajectory reached the goal, keep the weakest preconditions that this explanation needs, and throw away everything specific to the run. The concrete repository becomes a parameter, dead ends and wandering are discarded, and what survives is the structure — and it is exactly the thing worth compiling. Two properties make this robust in practice. First, the trace is grounded before generalization, and the priority order matters: what the trace actually shows (the real call, the real response) beats the documentation, the documentation beats the web, and the web beats the model’s memory. Second, the whole construction is format-agnostic by design: the session is read raw — JSONL, markdown export, a pasted chat, doesn’t matter — because the universal parser here is the model itself, and you need no per-vendor adapters and no lock-in to anyone’s log format.

Next comes the only human gate, placed exactly where it’s cheap. The human approves the PRD — the intent, never the structure. It’s worth spelling out the asymmetry here: validating intent is cheap, and people are good at it; reviewing a generated state graph for semantic correctness is expensive, and people are, let’s be honest, bad at it. So the gate stands on the only truly catastrophic failure — “built the wrong thing entirely” — and everything downstream belongs to the compiler. A demonstration, by the way, buys no special trust: traces go through the same gate.

There is exactly one creative pass — design. It emits the only thing that really requires a model: the topology graph plus a behavioral contract per node — that same IR. And it self-corrects in a hot context: design runs under RPC, and after every turn the runtime re-prompts it with the errors from the analysis library, until the IR comes out clean. Essentially these are semantic compile errors returned to the author — it’s just that the author here is a model, and it fixes them without leaving the room.

Everything below design is deterministic. construct is the lowering: codegen from IR into TypeScript, stubs for the agent prompts, the argument parser from <inputs>. fill writes the leaf implementations against their contracts — the only LLM work below design, and strictly local: a leaf sees its contract, not the graph. check_dataflow runs the definite assignment. polish is one review pass of the whole pipeline against the PRD, editing only the leaves; deliberately not a review→fix→re-review loop — the loop was expensive and bought nothing. verify is the objective backstop: type compilation plus structural checks; a failure goes to a bounded fix_verify (two rounds maximum), which edits only the implementations.

And the final check is done by execution, not by faith. The result of compilation is an executable graph, which means it can be dry-run with the model leaves stubbed out: zero tokens, while the control flow, the guards and the derived wiring actually walk all the way to a terminal. Don’t trust that it “compiled” — watch how it runs. And on real runs the cost is observed, not estimated: every agent leaf reports its actual spend, the runtime sums it into the verdict. A fully amortized pipeline prints 0 agent runs · 0 tokens · $0.0000. The central claim of this post is falsifiable with one line of output — any day.

The Ring Closes

One organ remains — the one that AOT compilers don’t have, but grown-up JITs do: profile-guided optimization. A compiled pipeline keeps improving from its own runs; it’s just that the profile is not branch counters, but agent traces. The precise name for this is speedup learning, a relative of explanation-based generalization, at the granularity of a sub-routine.

Every run leaves a verdict behind, and on failure the repair step reads the trace, finds the root cause and fixes the offending leaf, without touching the node’s contract — it escalates to a graph change only if the fix doesn’t fit into a leaf. After the repair the pipeline is re-verified and re-run on the original arguments: “fixed” is confirmed by execution, not by faith — the same principle that verify lives by, the system is consistent in its epistemology.

On success the system walks the agent leaves: it refines the attached skills, if the trace showed that a skill lied, and it hunts for a repeating deterministic sub-routine that the agent reinvents by hand on every run — to freeze it into a callable tool.

And here the eighties have a warning prepared for us, called the utility problem: if you cache everything you learned, you become slower — every cached item itself costs context and selection time. The field burned itself on this decades before LLMs made it expensive also in dollars. So a frozen tool passes two gates: acquisition — it is correct by construction (parses, passes its self-test, respects the sandbox) — and retention — it keeps its place only while the call frequency justifies it. If nobody calls it — the tool gets unbound and goes to the archive, and this way the system learns what pays off and unlearns what didn’t.

If you step back, you can see the ring that closes here: the runtime executes what the pipeline built, and produces traces. The traces feed the frontend — compiling a demonstration is, after all, compiling someone’s trace — and they feed the optimizer, which improves the pipeline from its own runs. The same artifact ends up both on the input and on the output, and for a reasoning compiler this is not a coincidence: reasoning is its raw material.

Four Rungs

So, the compiler exists — but what to compile with it? The ladder of applications starts on your own desk and ends, surprisingly, in the enterprise.

Rung one: your own routine. Everything you do with an agent regularly is a candidate for compilation. Mail digests, log triage, weekly reports, dependency checks, changelogs: a recurring task compiles once for a couple of dollars and then runs on cron for months. The sweet spot is exactly the repetition: a one-off task doesn’t justify the compilation, a live agent is cheaper; but everything you run at least weekly amortizes the compilation within a couple of runs. And if the task is mechanical, the pipeline doesn’t call the model at all — I have compiled commands that haven’t spent a single token since the compilation day, and they are doing fine. The cheapest agent is the one that never calls the model.

Rung two: compiling traces — including other people’s. This is where it gets interesting, because the source language stops being your head. Solve a task with a live agent once, take the trace, compile it — and nobody solves this task from scratch anymore. In a team it looks like this: the most experienced engineer shows once how to deploy, review or migrate properly — and his session becomes a tool that everyone runs. A demo instead of a spec, and no broken telephone. And since HuggingFace started putting agent traces right on the Hub, you can compile other people’s demonstrations too: I ran reharness on open trace datasets — someone else’s session turns into your working pipeline. The more open traces exist, the more ready-made programs lie in the commons — for now, in unassembled form.

Rung three: a backend for assistants. Conversational harnesses of the OpenClaw/Hermes class are great at dialogue and judgment — and they pay tokens for every run of the routine they grind daily. The industry already considers it normal that an agentic task burns 10-100x more tokens than a direct model call — and the lion’s share of this multiplier goes not to judgment, but to re-interpreting the same structure: what’s next, did I check, what was there last time. Plug the compiler in as a backend — and the assistant keeps the conversation and the judgment, while the recognized routine drives off into a compiled pipeline at ~$0. This hits hardest for small local models: their token budget is tight as it is, and burning it on orchestration is exactly the waste this whole series started from.

Rung four: the enterprise, surprisingly. The main brake on agents in serious organizations is not the price, it’s the nondeterminism. The auditor needs “which rule fired and why”; the agent reasons differently on every run; compliance reaches for the cigarettes. A compiled pipeline cuts this knot: the agent’s flexibility stays at compile time, and the execution gets the auditability of the good old RPA — deterministic transitions, a full trace of every run, observed cost, byte-for-byte reproducibility on the mechanical parts, and exactly one labeled node where the judgment happened. The classic automation cases — invoice processing, ticket triage — are exactly high volume plus one recurring decision: one agent leaf in a deterministic skeleton, straight from the textbook.

Four rungs, one product: a recurring task stops paying for reasoning on every run. Big models made us lazy — “the model will figure it out” — and we pay for this figuring-out again and again. A compiler is a refusal to pay twice.

And the part that still makes me smile: the foundation for all of this was found not in fresh papers about agents, but in compiler theory that is decades old. Futamura, Moore machines, polyhedral analysis, Kam–Ullman fixpoints — it all was lying on the shelf, waiting for a domain that didn’t exist when it was written. The machine I used to write by hand for every task, the compiler now derives on its own.

Try it

npm install -g reharness

# compile a pipeline from a request
reharness compile "code review FSM for this project"

# or from a recorded session
reharness compile --from-session ./session.jsonl

# dry-run: the whole graph, stubs instead of agents, zero tokens
reharness <command> --dry-run

# run it for real
reharness <command> args

For the agent leaves you need at least one backend on PATH: Pi (the default) or Claude Code (--provider claude — the agents run on a subscription instead of per-token billing). A compiled pipeline is just a regular reharness/ directory in your project: read it, version it, carry it with you.

Apache 2.0: github.com/bes-dev/reharness · npm: npmjs.com/package/reharness