The LLM Does Not Know It Is Being Formally Verified

Behavioral constraints degrade under autonomy; structural ones do not. You do not need smarter models — you need smarter types.

April 19, 2026

The multi-tenant API demo that emerged from an autonomous LLM loop took a while to fully register with me. Eight implementation steps, zero gate failures, a formally verified authorization proof chain — and the LLM never saw the Shen specification. It never reasoned about sequent calculus. It never thought about authorization invariants at all. It wrote ordinary Go code, the code compiled, and the types happened to enforce a formal proof chain. This distinction — between an AI that understands formal verification and an AI that is structurally constrained into producing formally verified output — matters more than any other technical detail in the system, and I think the implications extend considerably further than most people working in this space have yet recognized.

The current wave of formal-verification-meets-AI research assumes the LLM should be the proof engineer. Lean Copilot automates roughly 74% of proof steps in Lean 4, requiring about two manual steps per proof on average. DeepSeek-Prover-V2 achieves what its authors describe as state-of-the-art theorem proving performance. ATLAS synthesizes Dafny programs with specifications and proofs, with fine-tuned models gaining significant improvements on DafnyBench. Self-Spec has the LLM generate its own pre- and postconditions before generating code. All of these approaches ask essentially the same question: can the LLM reason about formal logic? There is an implicit assumption in this line of research that I suspect is often not articulated clearly — namely, that the path to formally verified AI-generated code must run through the AI’s own capacity for logical reasoning. This seems natural enough, but it is worth examining whether the assumption is actually necessary, or whether it reflects a kind of intellectual path dependence inherited from how humans do formal verification.

Shen-Backpressure asks a different question entirely: does the LLM need to reason about proofs at all? The five-gate pipeline does not ask the LLM to write proofs. It does not even show the LLM the formal specification. A human writes a Shen spec, roughly fifty lines of sequent calculus. Shengen generates guard types with unexported fields and validated constructors. The LLM writes implementation code against those types. Five gates run in sequence — shengen sync, tests, build, shen tc+, tcb audit — and gate failures are injected into the next LLM prompt. The cycle repeats until all gates pass. The LLM’s job, then, is simply to make the code compile and the tests pass. But because the types were generated from a formal spec, making the code compile turns out to be equivalent to satisfying the formal invariants. The compiler is the proof checker; the LLM is, for lack of a better term, the code monkey.

I think this works better than one might initially expect, for reasons that become clearer when you consider the specific failure modes of each approach. Current models are genuinely excellent at writing code that compiles. They understand type errors. They know how to respond to messages like “cannot use X as type Y” or “unexported field.” This is the easiest kind of feedback for an LLM to act on, because the error message tells it more or less exactly what is wrong. Contrast this with asking an LLM to write a Lean proof: the error messages from proof assistants are notoriously opaque, referring to proof states, tactic failures, and unification problems that even experienced developers struggle with. There is an asymmetry here that I suspect is underappreciated. The information content of a type error, from the perspective of guiding correction, is dramatically higher than the information content of a proof assistant error, even though the proof assistant error is nominally about a “higher-level” concern.

This connects to a more general observation about the difference between type errors and test failures as feedback mechanisms. When a test fails, the LLM receives something like “Expected 200, got 403.” It then has to figure out why. Maybe the auth is wrong. Maybe the data setup is wrong. Maybe the test itself is wrong. The causal distance between the symptom and the fix can be enormous. When a type check fails, the LLM gets “cannot use string as type TenantAccess.” The fix is structural — it needs to construct a TenantAccess value, which requires an AuthenticatedPrincipal and a membership proof. The type system guides the LLM through the proof chain without the LLM knowing it is being guided. I find this particularly interesting because it suggests that the feedback hierarchy in AI coding loops is not simply a matter of strictness but of informational density. Not all forms of backpressure are created equal, and the distinction between them illuminates something about the nature of the constraints we are imposing.

Guard types have a further property that seems well-matched to how LLMs actually write code: they constrain construction, not usage. Once you have a ResourceAccess value, you can pass it around, store it, return it — no restrictions. The constraint is narrow, requiring you to prove the invariants to create the value, and the freedom is broad, allowing you to use the value however you want. LLMs are good at following patterns once they have the right types, and they are good at fixing constructor errors. They are notably bad at maintaining invisible invariants across large codebases. Guard types make the invariants visible and the violations loud. This seems to me a subtle but important point about the design of constraints for autonomous systems generally: the most effective constraints are those that are narrow at the point of entry and permissive everywhere else, because this matches the error profile of the agents being constrained.

The implications for AI safety seem worth considering carefully. As AI coding agents become more autonomous — running in loops, making architectural decisions, committing code — the question of how to constrain them becomes pressing. The dominant approach is behavioral: train the model to be careful, add review steps, use tests as gates. Shen-Backpressure offers a structural alternative: make the correct behavior the only behavior that compiles. The model does not need to be careful. It does not need to understand the invariants. It needs to produce code that type-checks, and the invariants are enforced by the compiler, not by the model’s judgment. The difference is between something like “please remember to check tenant access,” which is a behavioral constraint that the model might forget, and “this function requires a TenantAccess parameter,” which is a structural constraint where forgetting produces a compile error. I suspect this distinction will become increasingly important as models become more autonomous. Behavioral constraints degrade under autonomy in ways that structural constraints do not, and this asymmetry is not, as far as I can tell, widely appreciated in the current discourse around AI coding safety.

Every AI coding loop provides some form of backpressure — error feedback that pushes the LLM toward correct code. But arranging these forms into a hierarchy is instructive:

Level	Type	What It Catches	Example
0	Syntax	Unparseable code	`func main( {`
1	Type	Structural violations	`cannot use string as Amount`
2	Build	Compile errors	`undefined: NewTenantAccess`
3	Test	Behavioral errors	`Expected 200, got 403`
4	Proof chain	Missing invariant proofs	Can’t construct `ResourceAccess` without `TenantAccess`
5	Deductive	Spec inconsistencies	Contradictory Shen rules

Most AI coding loops operate at levels zero through three. Shen-Backpressure adds levels four and five. What strikes me about this hierarchy is that the higher levels catch errors that lower levels structurally cannot express — a proof chain violation is not a syntax error or a test failure, it is a structural impossibility that only the type system can articulate. This is not merely a quantitative improvement in error coverage; it represents a qualitatively different kind of feedback, one that encodes domain-specific invariants into the compilation process itself. The analogy that comes to mind, imperfect as it may be, is the difference between teaching someone the rules of chess and giving them a board where the pieces physically cannot move to illegal squares. Both produce rule-following behavior, but through fundamentally different mechanisms, and with very different failure profiles under stress.

The Ralph loop for the multi-tenant API received a prompt with four rules: wrap raw values at the boundary using guard type constructors; trust guard types internally without re-validating; follow the proof chain by constructing A before B when B requires A; and extract raw values with accessors for SQL, JSON, and templates. That is all. Four rules, plus a plan with eight implementation items. The LLM did not know about sequent calculus. It did not know the types were generated from a formal spec. It followed the rules, wrote code that compiled, and in doing so produced a formally verified authorization system. The AI was constrained into correctness by the type system — not taught, not prompted, constrained.

Martin Kleppmann’s thesis, as I understand it, is that AI will make formal verification go mainstream because LLMs can write proof scripts. I think the counter-thesis is at least as plausible, and perhaps more so: AI makes formal verification go mainstream because LLMs can write code that compiles, and compilation can be made equivalent to proof checking through spec-derived guard types. The distinction maps onto a broader question that I find myself returning to in many contexts — whether it is more effective to make agents smarter about constraints or to make constraints smarter about agents. In the case of formal verification for AI-generated code, I suspect the latter approach will prove more robust, more scalable, and ultimately more consequential. You do not need smarter models. You need smarter types.