The Spec Is the Oracle, Not the Generator

An inversion buried in the shen-derive pivot that generalizes to almost every verified-code project.

April 19, 2026

Shen-Backpressure — the project this series is about — contains a tool called shen-derive. There’s a note in its design doc that I think contains one of the most important small ideas in the repository. It describes why the tool was rewritten. The v1 version was a code generator — a Bird-Meertens rewrite engine that parsed a typed lambda calculus, applied algebraic laws like foldr-fusion and map-fusion, and lowered the result into idiomatic Go loops. It worked. It hit a wall. The v2 version threw the code generator away and kept the evaluator. The spec, which had been the input to the code generator, became the oracle that judges a hand-written implementation.

The note frames this as an engineering pivot. I think it’s a deeper point that quietly generalizes to almost every formal-methods project in history, and I want to unpack why.

Two ways to use a precise specification

Any precise, executable specification admits two different uses. You can generate implementations from it, or you can judge implementations against it. These look similar. They are not.

When you generate, you start with the spec and mechanically derive a target-language implementation. The rewrite engine walks the spec, applies transformations, produces code. The code is correct by construction, because every step in the derivation is sound. This is the Bird-Meertens dream; it’s also the Coq-extraction dream, the Lean-to-binary dream, the “compile my proof to executable code” dream.

When you judge, you start with the implementation — written however you like, by whoever you like — and run the spec against it. You pick sample inputs. You evaluate the spec. You evaluate the implementation. If they disagree anywhere, the implementation fails. The spec never becomes the code; it becomes the thing the code has to answer to.

Both approaches use the same spec. Both end up with verified code. The difference is which direction the verification arrow points.

Why generation hits a ceiling

Generation from a spec sounds like the right move. It has proof-by-construction; it has no gap between the verified object and the shipping object; it gives you the fantasy of a language where “correct” and “efficient” are the same file. The people who try it are not fools. It’s the reasonable thing to try.

It hits a ceiling for a reason that has nothing to do with any particular project. The ceiling is that the set of programs you can generate from a spec is bounded by the rewrite laws and lowering patterns you have implemented. Adding a new shape of computation means adding a new law. And a new lowering pattern. And proving the law sound. And testing the lowering. Every domain-specific wrinkle — guard types, constrained values, field accessors, performance idioms — fights the type system of the rewrite engine. The catalog grows, the engine gets more complex, and the set of specs it accepts gets more fragile rather than more expressive. You end up with a very clever thing that handles exactly the cases its authors thought of.

This is not a bug in any one rewrite engine. It is the shape of the problem. Generation is constructive — you are building the output from parts — and the parts have to cover the whole space of outputs you want. If your spec can express a shape of computation your engine cannot lower, you are stuck. Either you extend the engine (at cost) or you constrain the spec (at cost). Neither path scales.

Why judgment does not

The judgment approach — spec as oracle, implementation written freely — has the opposite shape. The set of programs you can verify is not bounded by the transformations you can perform. It is bounded only by what your spec can express. If Shen can say it, shen-derive can judge something against it. The implementation can use any idiom, any data structure, any optimization the author (or LLM) wants, as long as the outputs match the spec’s outputs on the sampled inputs.

The ceiling has moved. Instead of being the expressiveness of your lowering patterns, it is the expressiveness of your spec language. For any reasonable spec language, that ceiling is much higher. You trade constructive correctness (the output came from the spec, so it’s right by derivation) for pointwise correctness (the output matches the spec on the inputs we checked). You lose a theoretical guarantee and gain an enormous amount of practical reach.

This is the same trade that made property-based testing eat formal methods in industry. QuickCheck didn’t win because generators of correct-by-construction programs were inadequate; it won because you could hand it a predicate about any program you had already written and get immediate feedback. It met code where code lived. The predicate-as-oracle approach is small, late-binding, and scales to whatever program you can describe.

The deeper point: precise specifications are more useful as judges than as constructors. The constructive path always has a ceiling that is the sum of the transformations you’ve implemented. The judgmental path has a ceiling that is only the expressiveness of the spec. In almost every engineering context, expressiveness is cheaper to add than transformations are.

The AI-coding twist

This inversion becomes sharper in the AI-coding setting, because the AI changes which end of the pipeline is cheap.

In the pre-AI world, the argument for generation was partly an argument about human effort. Humans are slow to write implementations. Humans make transcription errors. A code generator — even one with a low ceiling — saves labor in the region where it works. The fact that it doesn’t scale to the full expressiveness of the spec language is a real cost but a bounded one; the generator does the 80% and humans handle the long tail.

In the AI-coding world, implementations are cheap. The LLM can write Go fast, and it can write variations of Go fast, and it can rewrite Go after feedback. What is expensive is not producing code — it is producing code that respects an invariant. The bottleneck moves from writing to judging.

This is precisely where the oracle model wins and the generator model loses. The oracle model treats the implementation as disposable — write it, judge it, rewrite it, judge again — and concentrates the precise work on the spec. The generator model treats the spec as disposable — write it, compile it, ship the output — and concentrates the precise work on the lowering engine. In an environment where implementations are nearly free and invariants are not, you want the precise work concentrated on the invariants.

The shen-derive pivot is really the repo noticing this and adjusting. The spec still exists, still matters, and still has to be right. But the role of the spec has moved. It is no longer the thing the implementation is derived from. It is the thing the implementation is judged against. The implementation is produced freely — by the LLM, by the human, by whatever process can put bytes on disk — and the spec’s job is to say yes or no.

The generalization

I think this is the right way to see almost every formal-verification project.

If your spec language can both generate and judge, judge. The set of programs you can verify is larger, the toolchain is simpler, the spec can stay the same, and the thing being verified is the thing that ships. The hypothetical benefit of proof-by-construction is usually not worth the ceiling it imposes on what you can express.

If your spec language can only generate — if it’s a code generator posing as a verification tool — the ceiling is real and will be hit. You can extend the catalog forever and always discover the next case you haven’t covered. The path out is not more laws; it’s a richer spec and a separate, simpler judgment harness.

If your spec language can only judge — if it’s a predicate language without a code-generation story — you have most of what you want. The thing you are missing (generating implementations) was mostly a luxury anyway, and the LLM can fill it in cheaper than the generator could.

The spec is the oracle. Anything else you do with it is optional. The repo’s pivot from rewrite engine to verification gate is not a retreat; it’s an alignment with how precise specifications actually deliver value in programs people ship.