Writing

The LLM Does Not Know It Is Being Formally Verified

Sun, 19 Apr 2026 00:00:00 GMT

The multi-tenant API demo that emerged from an autonomous LLM loop took a while to fully register with me. Eight implementation steps, zero gate failures, a formally verified authorization proof chain -- and the LLM never saw the Shen specification. It never reasoned about sequent calculus. It never thought about authorization invariants at all. It wrote ordinary Go code, the code compiled, and the types happened to enforce a formal proof chain. This distinction -- between an AI that understands formal verification and an AI that is structurally constrained into producing formally verified output -- matters more than any other technical detail in the system, and I think the implications extend considerably further than most people working in this space have yet recognized.

The current wave of formal-verification-meets-AI research assumes the LLM should be the proof engineer. Lean Copilot automates roughly 74% of proof steps in Lean 4, requiring about two manual steps per proof on average. DeepSeek-Prover-V2 achieves what its authors describe as state-of-the-art theorem proving performance. ATLAS synthesizes Dafny programs with specifications and proofs, with fine-tuned models gaining significant improvements on DafnyBench. Self-Spec has the LLM generate its own pre- and postconditions before generating code. All of these approaches ask essentially the same question: can the LLM reason about formal logic? There is an implicit assumption in this line of research that I suspect is often not articulated clearly -- namely, that the path to formally verified AI-generated code must run through the AI's own capacity for logical reasoning. This seems natural enough, but it is worth examining whether the assumption is actually necessary, or whether it reflects a kind of intellectual path dependence inherited from how humans do formal verification.

Shen-Backpressure asks a different question entirely: does the LLM need to reason about proofs at all? The five-gate pipeline does not ask the LLM to write proofs. It does not even show the LLM the formal specification. A human writes a Shen spec, roughly fifty lines of sequent calculus. Shengen generates guard types with unexported fields and validated constructors. The LLM writes implementation code against those types. Five gates run in sequence -- shengen sync, tests, build, shen tc+, tcb audit -- and gate failures are injected into the next LLM prompt. The cycle repeats until all gates pass. The LLM's job, then, is simply to make the code compile and the tests pass. But because the types were generated from a formal spec, making the code compile turns out to be equivalent to satisfying the formal invariants. The compiler is the proof checker; the LLM is, for lack of a better term, the code monkey.

I think this works better than one might initially expect, for reasons that become clearer when you consider the specific failure modes of each approach. Current models are genuinely excellent at writing code that compiles. They understand type errors. They know how to respond to messages like "cannot use X as type Y" or "unexported field." This is the easiest kind of feedback for an LLM to act on, because the error message tells it more or less exactly what is wrong. Contrast this with asking an LLM to write a Lean proof: the error messages from proof assistants are notoriously opaque, referring to proof states, tactic failures, and unification problems that even experienced developers struggle with. There is an asymmetry here that I suspect is underappreciated. The information content of a type error, from the perspective of guiding correction, is dramatically higher than the information content of a proof assistant error, even though the proof assistant error is nominally about a "higher-level" concern.

This connects to a more general observation about the difference between type errors and test failures as feedback mechanisms. When a test fails, the LLM receives something like "Expected 200, got 403." It then has to figure out why. Maybe the auth is wrong. Maybe the data setup is wrong. Maybe the test itself is wrong. The causal distance between the symptom and the fix can be enormous. When a type check fails, the LLM gets "cannot use string as type TenantAccess." The fix is structural -- it needs to construct a TenantAccess value, which requires an AuthenticatedPrincipal and a membership proof. The type system guides the LLM through the proof chain without the LLM knowing it is being guided. I find this particularly interesting because it suggests that the feedback hierarchy in AI coding loops is not simply a matter of strictness but of informational density. Not all forms of backpressure are created equal, and the distinction between them illuminates something about the nature of the constraints we are imposing.

Guard types have a further property that seems well-matched to how LLMs actually write code: they constrain construction, not usage. Once you have a ResourceAccess value, you can pass it around, store it, return it -- no restrictions. The constraint is narrow, requiring you to prove the invariants to create the value, and the freedom is broad, allowing you to use the value however you want. LLMs are good at following patterns once they have the right types, and they are good at fixing constructor errors. They are notably bad at maintaining invisible invariants across large codebases. Guard types make the invariants visible and the violations loud. This seems to me a subtle but important point about the design of constraints for autonomous systems generally: the most effective constraints are those that are narrow at the point of entry and permissive everywhere else, because this matches the error profile of the agents being constrained.

The implications for AI safety seem worth considering carefully. As AI coding agents become more autonomous -- running in loops, making architectural decisions, committing code -- the question of how to constrain them becomes pressing. The dominant approach is behavioral: train the model to be careful, add review steps, use tests as gates. Shen-Backpressure offers a structural alternative: make the correct behavior the only behavior that compiles. The model does not need to be careful. It does not need to understand the invariants. It needs to produce code that type-checks, and the invariants are enforced by the compiler, not by the model's judgment. The difference is between something like "please remember to check tenant access," which is a behavioral constraint that the model might forget, and "this function requires a TenantAccess parameter," which is a structural constraint where forgetting produces a compile error. I suspect this distinction will become increasingly important as models become more autonomous. Behavioral constraints degrade under autonomy in ways that structural constraints do not, and this asymmetry is not, as far as I can tell, widely appreciated in the current discourse around AI coding safety.

Every AI coding loop provides some form of backpressure -- error feedback that pushes the LLM toward correct code. But arranging these forms into a hierarchy is instructive:

Level	Type	What It Catches	Example
0	Syntax	Unparseable code	`func main( {`
1	Type	Structural violations	`cannot use string as Amount`
2	Build	Compile errors	`undefined: NewTenantAccess`
3	Test	Behavioral errors	`Expected 200, got 403`
4	Proof chain	Missing invariant proofs	Can't construct `ResourceAccess` without `TenantAccess`
5	Deductive	Spec inconsistencies	Contradictory Shen rules

Most AI coding loops operate at levels zero through three. Shen-Backpressure adds levels four and five. What strikes me about this hierarchy is that the higher levels catch errors that lower levels structurally cannot express -- a proof chain violation is not a syntax error or a test failure, it is a structural impossibility that only the type system can articulate. This is not merely a quantitative improvement in error coverage; it represents a qualitatively different kind of feedback, one that encodes domain-specific invariants into the compilation process itself. The analogy that comes to mind, imperfect as it may be, is the difference between teaching someone the rules of chess and giving them a board where the pieces physically cannot move to illegal squares. Both produce rule-following behavior, but through fundamentally different mechanisms, and with very different failure profiles under stress.

The Ralph loop for the multi-tenant API received a prompt with four rules: wrap raw values at the boundary using guard type constructors; trust guard types internally without re-validating; follow the proof chain by constructing A before B when B requires A; and extract raw values with accessors for SQL, JSON, and templates. That is all. Four rules, plus a plan with eight implementation items. The LLM did not know about sequent calculus. It did not know the types were generated from a formal spec. It followed the rules, wrote code that compiled, and in doing so produced a formally verified authorization system. The AI was constrained into correctness by the type system -- not taught, not prompted, constrained.

Martin Kleppmann's thesis, as I understand it, is that AI will make formal verification go mainstream because LLMs can write proof scripts. I think the counter-thesis is at least as plausible, and perhaps more so: AI makes formal verification go mainstream because LLMs can write code that compiles, and compilation can be made equivalent to proof checking through spec-derived guard types. The distinction maps onto a broader question that I find myself returning to in many contexts -- whether it is more effective to make agents smarter about constraints or to make constraints smarter about agents. In the case of formal verification for AI-generated code, I suspect the latter approach will prove more robust, more scalable, and ultimately more consequential. You do not need smarter models. You need smarter types.

From Sequent Calculus to Guard Types: How Shengen Works

Sun, 19 Apr 2026 00:00:00 GMT

There is something striking about the ratios. The payment processor spec is 48 lines of Shen. The generated Go code is 120 lines. The email campaign spec is 93 lines; its generated code, 380 lines. The multi-tenant API spec runs to about 100 lines and produces 290 lines of Go. In every case, the spec is shorter than the generated code, easier to review, and encodes invariants that would otherwise require hundreds of lines of hand-written validation -- validation that, in my experience, someone will eventually forget to call. I suspect most working programmers have an intuitive sense of this problem. We know that validation logic sprawls, that it becomes inconsistent, that the gap between what we intended and what we enforce widens quietly over time. What interests me about Shengen is that it takes a specific position on where that gap comes from and offers a surprisingly compact mechanism for closing it.

The core notation is sequent calculus, the same formalism used in programming language theory papers. Premises sit above a horizontal line, the conclusion below:

(datatype amount
  X : number;
  (>= X 0) : verified;
  ====================
  X : amount;)

This reads: if X is a number, and X >= 0 is verified, then X inhabits the type amount. It is not a type annotation or a schema; it is a proof rule, a statement about what must be true for a value to carry this type. If you have read PL theory papers, you have seen this notation before. If you have not, I think it is more accessible than it looks at first glance, because each block is self-contained and there are only a few recurring patterns. You can learn to read it in an hour, which is more than I can say for most formal specification languages I have encountered.

What Shengen does with these blocks is classify them. Every datatype block falls into one of six patterns, and each pattern generates different code. I want to walk through all six, because I think the taxonomy itself reveals something interesting about the hidden structure of the validation problem -- that what we usually treat as a single undifferentiated task (checking that data is good) actually decomposes into qualitatively distinct operations, each with different implications for code generation.

The first pattern is what I would call a wrapper: no validation, just naming.

(datatype account-id
  X : string;
  ==============
  X : account-id;)

This generates the most minimal Go imaginable:

type AccountId struct{ v string }
func NewAccountId(x string) AccountId { return AccountId{v: x} }
func (t AccountId) Val() string { return t.v }

No error return, no validation. This is pure naming -- distinguishing an account ID from any other string at the type level. The compiler then prevents NewTransaction(amount, accountId, tenantId) when you meant NewTransaction(amount, fromAccountId, toAccountId). You would be amazed, or perhaps you would not be, at how many production bugs amount to "passed the right type of string to the wrong parameter." This pattern addresses a class of error that exists entirely in the semantic gap between what the programmer meant and what the type system can see. It is, in a sense, the simplest possible proof: this string is not just any string, it is this particular kind of string.

The second pattern introduces runtime validation at the boundary -- what I think of as the constrained type:

(datatype amount
  X : number;
  (>= X 0) : verified;
  ====================
  X : amount;)

type Amount struct{ v float64 }

func NewAmount(x float64) (Amount, error) {
    if !(x >= 0) {
        return Amount{}, fmt.Errorf("x must be >= 0: %v", x)
    }
    return Amount{v: x}, nil
}

func (t Amount) Val() float64 { return t.v }

The constructor now returns (Amount, error). The validation runs once, at construction time. After that, any function receiving an Amount knows it is non-negative without checking again. This is the principle sometimes called "parse, don't validate" -- the parsed representation carries the proof with it. I find this idea compelling in a way that goes beyond mere convenience. There is a hidden premise in the conventional approach to validation that I think deserves scrutiny: the assumption that checking a value's properties is something you do at the point of use, repeatedly, scattered across a codebase. This seems natural, but it implicitly treats the knowledge that "this number is non-negative" as something that cannot be captured in the type system, that must be re-established each time. The constrained pattern rejects that premise. Multiple constraints compose straightforwardly; the dosage calculator's patient-weight type, for instance, requires both (> X 0) and (<= X 500), and the generated constructor checks both.

The third pattern is the composite, where multiple fields are typed by guard types rather than primitives:

(datatype transaction
  Amount : amount;
  From : account-id;
  To : account-id;
  ===================================
  [Amount From To] : transaction;)

type Transaction struct {
    amount Amount
    from   AccountId
    to     AccountId
}

func NewTransaction(amount Amount, from AccountId, to AccountId) Transaction {
    return Transaction{amount: amount, from: from, to: to}
}

The constructor takes guard types, not primitives. You cannot create a Transaction with a raw float64; you need an Amount, which means passing through NewAmount, which validates non-negativity. The type system chains the proofs. This is where I think the approach starts to exhibit something genuinely interesting from a theoretical standpoint. Each composite type is not merely a container of validated data; it is a node in a proof graph, and the edges of that graph are enforced by the compiler. A composite type is like a document that can only be assembled from notarized components -- the notarization has already occurred and is attested by the type itself.

The fourth pattern is what Shengen calls guarded -- a composite that also includes validation involving its fields:

(datatype balance-invariant
  Bal : number;
  Tx : transaction;
  (>= Bal (head Tx)) : verified;
  =======================================
  [Bal Tx] : balance-checked;)

type BalanceChecked struct {
    bal float64
    tx  Transaction
}

func NewBalanceChecked(bal float64, tx Transaction) (BalanceChecked, error) {
    if !(bal >= tx.Amount().Val()) {
        return BalanceChecked{}, fmt.Errorf("bal must be >= tx.amount")
    }
    return BalanceChecked{bal: bal, tx: tx}, nil
}

Note how (head Tx) becomes tx.Amount().Val(). Shengen resolves the Shen accessor chain -- head of a three-element list means the first element, which maps to the amount field -- into Go accessor calls. The premise (>= Bal (head Tx)) becomes bal >= tx.Amount().Val(). This is where the code generation bridge does real work, translating between the mathematical notation and idiomatic Go. I find this translation step particularly interesting because it sits at exactly the boundary where formal specification languages usually break down. The gap between "what the math says" and "what the code does" is precisely where subtle bugs live, and Shengen's pattern-classification approach gives it a structured way to cross that gap rather than leaving it to ad-hoc interpretation.

The fifth pattern is perhaps the most powerful: the proof chain, which requires previous proofs as input.

(datatype safe-transfer
  Tx : transaction;
  Check : balance-checked;
  =============================
  [Tx Check] : safe-transfer;)

type SafeTransfer struct {
    tx    Transaction
    check BalanceChecked
}

func NewSafeTransfer(tx Transaction, check BalanceChecked) SafeTransfer {
    return SafeTransfer{tx: tx, check: check}
}

There is no validation logic in the constructor, but it requires a BalanceChecked value, which can only exist if the balance check passed. The proof is transitive: SafeTransfer implies BalanceChecked, which implies bal >= tx.amount, which implies amount >= 0. Any function accepting a SafeTransfer gets all four guarantees for free. In the multi-tenant demo, this chain runs five links deep. At the end, a ResourceAccess type carries proofs of JWT validity, token freshness, authentication, tenant membership, and resource ownership -- and the handler code simply accepts the type. I suspect this is where the real leverage lies. In conventional code, these kinds of transitive guarantees exist only as comments or as implicit contracts maintained by programmer discipline. The proof chain pattern makes them structural, enforced by the compiler, and -- crucially -- visible in the spec.

The sixth and final pattern handles sum types, where multiple paths lead to the same conclusion type:

(datatype human-principal
  Auth : authenticated-user;
  ===========================
  Auth : authenticated-principal;)

(datatype service-principal
  Cred : service-credential;
  ============================
  Cred : authenticated-principal;)

Two blocks produce the same conclusion type. Shengen generates a Go interface with a private marker method:

type AuthenticatedPrincipal interface {
    isAuthenticatedPrincipal()
}

type HumanPrincipal struct { auth AuthenticatedUser }
func (t HumanPrincipal) isAuthenticatedPrincipal() {}

type ServicePrincipal struct { cred ServiceCredential }
func (t ServicePrincipal) isAuthenticatedPrincipal() {}

The private marker method seals the interface -- no external code can implement it. This gives you sum types in Go: an AuthenticatedPrincipal is either a HumanPrincipal or a ServicePrincipal, and nothing else. Cron jobs and API callers can flow through the same proof chain via different entry points. I think this pattern is worth pausing on because Go famously lacks algebraic data types, and the usual workarounds -- empty interface values, stringly-typed discriminators -- sacrifice precisely the kind of compile-time safety that the rest of the system is trying to establish. The sealed interface approach is a well-known Go idiom, but generating it from a formal spec closes the loop in a way that manual coding does not.

Having walked through the six patterns, I want to return to the question of the spec-to-code ratio, because I think there is something deeper going on than mere brevity. The payment processor spec is 48 lines; the hand-written equivalent, including tests for each validation, would be roughly 200 lines. The email campaign spec is 93 lines versus perhaps 600 hand-written. The multi-tenant API is 100 lines versus perhaps 500. The dosage calculator, 100 lines versus perhaps 700. The spec is reviewable in a way that the generated code is not, and does not need to be. A domain expert can read 48 lines of sequent calculus and verify that the payment invariants are correct. The generated code is correct by construction, modulo bugs in Shengen itself, a caveat I want to flag honestly because it matters: the trust model shifts from "did every developer correctly implement every validation" to "is the code generator correct." This is a meaningful trade, not an elimination of risk. But I suspect it is a favorable one, because the generator is a single artifact that can be tested and audited, whereas hand-written validation is distributed across an entire codebase and maintained by an ever-changing cast of developers.

There is a meta-level to this system that I find intriguing. The create-shengen command is an 875-line specification that teaches an LLM to build a Shengen for any target language. It specifies the input grammar (Shen datatype blocks), the classification algorithm, the per-language enforcement mechanism (unexported fields in Go, pub(crate) in Rust, private constructors in TypeScript, closures in Python), and the code generation patterns for each of the six type categories. This is, in effect, a compiler-generating compiler. One LLM invocation produces a complete Shengen for Rust or Python or C#. The formal verification supply chain then becomes: one human writes a spec, one tool -- potentially LLM-generated -- produces guard types, and the target language's compiler enforces them. I am not entirely sure what to make of this. On one hand, it is an elegant demonstration of how formal specifications can propagate through layers of tooling. On the other hand, it introduces a dependency on LLM correctness at the tooling layer, which raises questions I do not think have been fully answered yet about how one audits and trusts such a chain. It seems to me that the trust problem does not disappear; it shifts shape.

Shengen's internals, roughly 1600 lines of Go, work in four stages. First, parsing: read .shen files, tokenize, parse (datatype ...) blocks as S-expression trees. Second, classification: build a symbol table mapping Shen type names to Go names and categories, using two-pass resolution to handle cases where the block name differs from the conclusion type name (for example, block balance-invariant concludes type balance-checked). Third, resolution: map verified premises to Go conditions, translate accessor chains like (head X), (tail X), (head (tail X)) to Go method calls, and handle structural equality premises where accessor resolution fails. Fourth, emission: generate the Go file with package declaration, imports, and one type block per datatype including struct, constructor, accessors, and a String method. The generated file always carries a DO NOT EDIT header, and Gate 5 of the audit process verifies that the committed file matches what Shengen would generate. Nobody can hand-edit the guards.

I want to close with an open question that I have been turning over. The six-pattern taxonomy feels right to me -- it captures the validation structures I encounter in practice -- but I wonder whether it is complete. Is there a seventh pattern that would naturally arise in, say, concurrent systems where proofs need to be invalidated and re-established? What about temporal properties, where a proof holds only for a bounded duration? The token freshness check in the multi-tenant demo gestures in this direction, but it treats time as just another constraint, not as a fundamentally different kind of premise. It seems to me that there might be a meaningful distinction between proofs that hold permanently once established and proofs that decay, and that this distinction could warrant its own pattern. I do not have a clear answer here, but I suspect that the boundary of the six-pattern taxonomy marks the boundary of the class of systems for which this approach is most naturally suited, and that understanding where it breaks down would teach us something about the deeper structure of the validation problem.

One Rule, Three Layers: How a Shen Type Becomes Both a Compiler Check and a Runtime Guard

Sun, 19 Apr 2026 00:00:00 GMT

Most approaches to correctness give you one thing. Tests give you runtime checks against specific cases. TypeScript gives you compile-time structural constraints. Coq gives you deductive proofs. You pick your tool, you get your layer, and you accept that the others are someone else's problem.

What I find genuinely interesting about the Shen-Backpressure approach is that a single declaration, one rule written in Shen's sequent calculus, produces enforcement at all three layers simultaneously. The deductive layer, the compile-time layer, and the runtime layer all emerge from the same source. This is not because someone built an elaborate framework to connect them. It is because the approach exploits a structural coincidence: the same information that constitutes a proof rule also constitutes a type definition also constitutes a validation check. I want to trace through exactly how this works, show what it looks like across the most interesting domains I have found, and be honest about where the approach has sharp edges.

The Three Layers from One Rule

Consider a simple rule from a payment system:

(datatype amount
  X : number;
  (>= X 0) : verified;
  ====================
  X : amount;)

This reads: if X is a number, and X >= 0 is verified, then X is an amount. Three things happen to this declaration:

Layer 1: Deductive verification. Shen's own type checker (shen tc+) verifies that this rule is internally consistent as a sequent calculus judgment. If I write contradictory rules, or rules that cannot compose, Shen catches this before any code is generated. This layer exists in the space of pure logic. It does not know about Go or TypeScript or any target language.

Layer 2: Compile-time structural enforcement. Shengen, the code generator, reads this rule and emits a Go struct with an unexported field:

type Amount struct{ v float64 }

func NewAmount(x float64) (Amount, error) {
    if !(x >= 0) {
        return Amount{}, fmt.Errorf("x must be >= 0: %v", x)
    }
    return Amount{v: x}, nil
}

func (t Amount) Val() float64 { return t.v }

The lowercase v is unexported. Code outside the shenguard package cannot write Amount{v: -5}. The Go compiler rejects it. This is not a runtime check and not a lint warning. It is a hard refusal to produce a binary. The constructor is the only path to the type, and the constructor's return type (Amount, error) forces the caller to handle the failure case. The compile-time layer comes from Go's own type system, recruited into service of the Shen spec.

Layer 3: Runtime validation. Inside the constructor, if !(x >= 0) is a runtime check. When someone calls NewAmount(-5), they get an error value back. This is the familiar validation pattern, but it is not hand-written. It is generated from the (>= X 0) : verified premise in the Shen rule. The runtime check is the same information as the deductive premise, expressed in a different medium.

One rule. Three enforcement mechanisms. No other approach I am aware of gives you all three from a single declaration. Coq and Lean give you layer 1 but not layers 2-3 in mainstream languages. Tests give you layer 3 but not layers 1-2. TypeScript branded types give you a weak version of layer 2 but not layers 1 or 3. The structural coincidence that makes this possible is that Shen's sequent-calculus rules contain exactly the information needed for all three: the premises define the type shape (layer 2), the side conditions define the validation logic (layer 3), and the rule structure defines the proof (layer 1).

Where It Gets Interesting: Composition

Simple wrappers like amount are useful but not especially surprising. The approach becomes genuinely powerful when rules compose, because composition creates enforcement that is purely structural, with no runtime check at all.

(datatype safe-transfer
  Tx : transaction;
  Check : balance-checked;
  =============================
  [Tx Check] : safe-transfer;)

This rule has no verified premises. There is no arithmetic check, no string comparison, no validation logic whatsoever. The generated constructor is infallible:

func NewSafeTransfer(tx Transaction, check BalanceChecked) SafeTransfer {
    return SafeTransfer{tx: tx, check: check}
}

Yet this type is profoundly safe. You cannot construct a SafeTransfer without a BalanceChecked, and you cannot construct a BalanceChecked without proving bal >= tx.amount. The safety is entirely compile-time. The runtime checks happened earlier in the chain, at the leaf types. The composition is free, both in performance and in enforcement effort.

This is the pattern that I think deserves the most attention: proof chains where the interior nodes carry no runtime cost. The runtime validation happens once, at construction of the leaf types. Everything above that is pure structural enforcement by the compiler. You get the safety of dependent types with the runtime cost of a few comparisons at the boundary.

The Demo Landscape

I have been building demos to stress-test this approach across different domains, and the results are instructive because different domains exercise different aspects of the enforcement mechanism. Some of these are fully implemented. Some are specifications waiting for implementations. I will walk through the most interesting ones and flag what still needs to be built.

The Grounded Research Pipeline

The shen-web-tools demo builds a research pipeline where an AI searches the web, fetches pages, and generates summaries. The key invariant is grounding: the AI cannot cite a source it did not actually retrieve.

(datatype grounded-source
  Page : fetched-page;
  Hit : search-hit;
  (= (head Page) (head (tail Hit))) : verified;
  ===============================================
  [Page Hit] : grounded-source;)

The (= (head Page) (head (tail Hit))) premise is doing something subtle. head Page extracts the URL from a fetched page (which is [Url Content Timestamp]). head (tail Hit) extracts the URL from a search hit (which is [Title Url Snippet]). The verified premise asserts these are the same URL. You cannot construct a grounded-source by pairing a fetched page with an unrelated search hit. The URL match is checked at runtime in the constructor. The structural requirement that you have a grounded-source before you can produce a research-summary is checked at compile time.

This is the pattern in its purest form: the runtime check (URL equality) happens at one specific point, and then the compile-time guarantee (cannot produce ungrounded output) propagates through the rest of the system for free.

The same demo includes a pipeline state machine with five stages:

(datatype pipeline-idle     ...)
(datatype pipeline-searching ...)
(datatype pipeline-fetching  ...)
(datatype pipeline-generating ...)
(datatype pipeline-complete  ...)

Each stage type carries the outputs of its predecessor. You cannot skip from idle to generating because pipeline-generating requires a (list fetched-page) that can only come from pipeline-fetching. The state machine is enforced by the type system. There is no enum with a match statement. There is no "invalid state transition" runtime error. The invalid transition is a type error.

<!-- EXAMPLE PROMPT: Build a minimal standalone demo of the pipeline state machine. Show a simplified 3-stage pipeline (search -> fetch -> summarize) in Go where:

Each stage is a separate type carrying the previous stage's output
Attempting to skip a stage is a compile error (show the error message)
The correct path compiles and runs This should be a self-contained main.go + specs/core.shen + generated shenguard/ that someone can clone and run. Target: ~50 lines of spec, ~100 lines of application code, clear compile error demonstration. -->

The Dosage Calculator: Recursive Type-Level Computation

Most Shen type rules use simple arithmetic: >=, <=, =. The dosage calculator demo pushes further. Its interaction-clearance type invokes Shen functions defined within the spec itself:

(define pair-in-list?
  _ _ [] -> false
  D1 D2 [[D1 D2] | _] -> true
  D1 D2 [[D2 D1] | _] -> true
  D1 D2 [_ | Rest] -> (pair-in-list? D1 D2 Rest))

(define drug-clear-of-list?
  _ [] _ -> true
  Drug [Med | Rest] Pairs ->
    (and (not (pair-in-list? Drug Med Pairs))
         (drug-clear-of-list? Drug Rest Pairs)))

(datatype interaction-clearance
  Drug : drug-name;
  Meds : (list drug-name);
  Pairs : (list (list drug-name));
  (drug-clear-of-list? Drug Meds Pairs) : verified;
  ====================================================
  [Drug Meds Pairs] : interaction-clearance;)

This is striking because the type-level proof includes recursive list-walking. It is not just "is X >= 0?" It is "walk the entire contraindication list and verify that no forbidden pair exists." Shen's type checker, which is itself a Prolog program, executes this computation as part of type checking. The generated Go constructor translates the recursive walk into an iterative helper function.

The final safe-administration type requires both dose-in-range (the dosage falls within the therapeutic range for this patient's weight) and interaction-clearance (no contraindicated drug interactions). Both proofs must exist before you can administer. The composition is compile-time. The individual checks are runtime.

<!-- EXAMPLE PROMPT: Build the dosage calculator as a complete working demo. The spec exists at demo/dosage-calculator/specs/core.shen. What's needed:

Run shengen to generate the guard types (guards_gen.go)
Build out cmd/server/main.go with HTTP handlers for:
- POST /check-dose: accepts {drug, weight_kg, dose_mg, current_medications}
- Returns either a safe-administration proof summary or a structured error
Write tests demonstrating:
- Valid administration (correct dose, no interactions) succeeds
- Overdose attempt fails at dose-in-range construction
- Drug interaction fails at interaction-clearance construction
- Underweight patient (weight <= 0) fails at patient-weight construction
The demo should be runnable with go run ./cmd/server and testable with go test ./... Key insight to highlight: the recursive drug-clear-of-list? function in Shen becomes a Go helper function. Show both side by side. -->

Multi-Tenant Authorization: The Proof Chain as Security Architecture

The multi-tenant demo has the longest proof chain: seven types, each requiring its predecessor. I have written about this one in detail elsewhere, but the key observation for this piece is how different links in the chain exercise different enforcement mechanisms:

JwtToken and TokenExpiry are constrained wrappers (runtime validation of non-empty string, timestamp comparison)
AuthenticatedUser is a pure composite (no runtime check, purely structural, requires the two proofs above)
AuthenticatedPrincipal is a sum type (Go interface with private marker method, compile-time closed set)
TenantAccess and ResourceAccess are guarded composites (runtime boolean check + structural requirement for the proofs above)

A single proof chain exercises four of the six shengen categories. The runtime checks happen at the leaves and at the boolean gates. The structural composition in between is free. The sum type is enforced by Go's interface mechanism. This is the demo that best illustrates how the three layers interact in a real system.

<!-- EXAMPLE PROMPT: Create a "proof chain anatomy" diagram/document. For the multi-tenant demo, trace through a single request and annotate each step:

JWT string arrives as raw input
NewJwtToken() — RUNTIME CHECK (non-empty string)
NewTokenExpiry() — RUNTIME CHECK (exp > now)
NewAuthenticatedUser() — NO RUNTIME CHECK (pure composition)
NewHumanPrincipal() — NO RUNTIME CHECK (sum type variant construction)
NewTenantAccess() — RUNTIME CHECK (isMember == true)
NewResourceAccess() — RUNTIME CHECK (isOwned == true)
Handler receives ResourceAccess — COMPILE-TIME ONLY (type signature)

Show the Go code at each step, mark which layer is doing the enforcement, and calculate: out of 7 construction steps, only 4 have runtime cost. The other 3 are pure compile-time structural enforcement.

This should be a code walkthrough with annotations, not prose. Format as a markdown document with code blocks and callout boxes. -->

Email Campaigns: Relational Invariants Across Types

The email demo has the most unusual arithmetic constraint I have encountered in any of the specs: (= 0 (shen.mod X 10)) on age-decade. This does not just require a number in a range. It requires the number to be a valid decade (10, 20, 30, ..., 100). The modular arithmetic is checked at runtime in the constructor.

But the more interesting invariant is relational. The copy-delivery type asserts that the demographics embedded in a user profile match the demographics the email copy was written for:

(datatype copy-delivery
  Profile : known-profile;
  Copy : copy-content;
  (= (tail (tail (head Profile))) (tail Copy)) : verified;
  =========================================================
  [Profile Copy] : copy-delivery;)

This navigates the internal structure of two different composite types via head and tail, extracts specific fields, and asserts equality between them. It is the only demo where a verified premise compares subfields of two different composite types rather than comparing a field to a constant or to another field of the same type. This matters because it shows the approach can enforce relational constraints, properties that span multiple objects, not just local constraints on a single object.

<!-- EXAMPLE PROMPT: Build a standalone demo of a relational cross-type constraint. Use the email_crud spec as inspiration but simplify to the core pattern:

(datatype region-config
  Region : string;
  Currency : string;
  TaxRate : number;
  =============================
  [Region Currency TaxRate] : region-config;)

(datatype order-pricing
  Config : region-config;
  Price : number;
  (= (head (tail Config)) Currency) : verified;  \\ currency must match
  (>= Price 0) : verified;
  ============================================
  [Config Price Currency] : order-pricing;)

The demo should show:

Constructing a region-config for "US" with "USD" and 0.08 tax rate
Constructing an order-pricing that matches the region's currency (succeeds)
Attempting to construct an order-pricing with a mismatched currency (fails)
Show how the generated Go code translates the head (tail Config) accessor chain

This illustrates relational invariants in ~20 lines of spec. -->

Closed Enumerations: Rejecting LLM Hallucination

The Medicare subdomain of shen-web-tools contains what I think is the most directly practical pattern for AI-assisted development: closed enumerations.

(datatype medicare-plan-type
  X : string;
  (element? X ["original" "advantage" "part-d" "supplement" "part-a" "part-b"]) : verified;
  ========================================================================================
  X : medicare-plan-type;)

The element? check compiles to map[string]bool{...}[val] in Go, new Set([...]).has(val) in TypeScript, [...].contains(&val) in Rust. If an LLM generates a plan type of "premium" or "gold tier" or any other hallucinated string, the constructor rejects it. The closed set is enforced at runtime, and the opaque type prevents bypass at compile time.

The same spec has a panel-kind enumeration with 15 valid values for dashboard panel types. When an LLM generates a UI layout, every panel must have a valid kind. Hallucinated panel types are rejected at the guard boundary. This is backpressure applied directly to LLM output quality.

<!-- EXAMPLE PROMPT: Build a demo of closed enumeration as LLM hallucination prevention. Scenario: An LLM generates structured JSON describing a dashboard layout. Each panel has a "kind" field. The spec defines exactly which kinds are valid.

(datatype panel-kind
  X : string;
  (element? X ["bar-chart" "line-chart" "pie-chart" "table" "metric-card"
               "scatter-plot" "heatmap" "timeline" "map" "text-block"]) : verified;
  =================================================================================
  X : panel-kind;)

(datatype dashboard-panel
  Kind : panel-kind;
  Title : string;
  DataSource : string;
  (not (= Title "")) : verified;
  (not (= DataSource "")) : verified;
  ====================================
  [Kind Title DataSource] : dashboard-panel;)

Build a Go service that:

Accepts JSON from an LLM (simulated) describing a dashboard layout
Validates each panel through the guard types
Demonstrates: valid panels pass, hallucinated panel kinds are rejected, empty titles are rejected
Show the error messages that would be fed back to the LLM in a Ralph loop

This is the "backpressure on LLM output" story made concrete. Target: ~30 lines of spec, ~80 lines of Go, clear before/after of valid vs invalid LLM output. -->

Order State Machine: Deadlock Freedom as a Type Error

This is the demo I am most excited about that does not yet exist. The concept: encode a state machine's valid transitions in the Shen type system such that invalid transitions are compile errors and deadlock freedom (every non-terminal state has at least one outward transition) is a property of the spec itself.

The idea is that each state is a type, and each transition is a function whose type signature encodes the source and destination states:

(datatype order-created   ...)
(datatype order-paid      ...)
(datatype order-shipped   ...)
(datatype order-delivered  ...)  \\ terminal
(datatype order-cancelled  ...)  \\ terminal

(datatype transition-pay
  Order : order-created;
  Payment : payment-info;
  (> (head Payment) 0) : verified;
  =================================
  [Order Payment] : order-paid;)

(datatype transition-ship
  Order : order-paid;
  Tracking : tracking-number;
  ============================
  [Order Tracking] : order-shipped;)

The transition-pay function requires an order-created and produces an order-paid. You cannot ship an order that has not been paid because transition-ship requires order-paid, not order-created. The state machine is the type system. Invalid transitions are not runtime errors; they are compile errors.

Deadlock freedom comes from the spec structure: if every non-terminal state appears as a premise in at least one transition rule, then every non-terminal state has at least one valid outward transition. Shen's type checker can verify this property.

<!-- EXAMPLE PROMPT: Build the order state machine demo from scratch. This is the most ambitious demo and does not yet exist.

Phase 1 — The spec (specs/core.shen):

5 states: created, paid, processing, shipped, delivered (terminal), cancelled (terminal), refund-requested, refunded (terminal)
Valid transitions encoded as datatype rules where the source state is a premise and the destination state is the conclusion
Each transition may carry additional data (payment info, tracking number, refund reason)
Some transitions have guards (payment amount > 0, tracking number non-empty)

Phase 2 — Generated guard types:

Run shengen to generate the Go types
Each state is a struct with unexported fields
Each transition is a constructor function: NewOrderPaid(created OrderCreated, payment PaymentInfo) (OrderPaid, error)

Phase 3 — Application code:

A simple order management API with handlers for each transition
GET /order/:id returns current state
POST /order/:id/pay, /ship, /cancel, /refund-request, /refund
Each handler takes the current state type and attempts the transition
Invalid transitions are compile errors (the handler literally cannot accept the wrong state type)

Phase 4 — The deadlock freedom test:

Write a test or verification script that checks: for every non-terminal state type, there exists at least one transition rule that accepts it as a premise
This is a static property of the spec, verifiable by Shen or by inspecting the AST

Phase 5 — The compelling demonstration:

Show an LLM attempting to add a "ship from created" shortcut
The code won't compile because transition-ship requires order-paid
Show the compiler error message
Show the LLM correcting itself by adding the payment step

This should be a full working demo in demo/order-state-machine/. It is the strongest demonstration of "impossible by construction" applied to business logic. -->

Sum Types: Multiple Valid Paths to the Same Proof

The multi-tenant demo uses sum types for AuthenticatedPrincipal: both HumanPrincipal and ServicePrincipal produce the same interface type. In Go, this becomes an interface with a private marker method. In Rust, a sealed trait. In TypeScript, a union type.

What makes this interesting for the three-layer story is that the sum type adds a fourth enforcement mechanism: the closed variant set. In Go, the private marker method isAuthenticatedPrincipal() means external packages cannot add new variants. The compiler enforces the closed set at compile time. You cannot introduce a BotPrincipal without modifying the shenguard package, which is generated and audited by Gate 5.

<!-- EXAMPLE PROMPT: Build a standalone demo of sum types with closed variant enforcement. Use a simpler domain than multi-tenant to isolate the pattern:

(datatype shape-circle
  Radius : number;
  (> Radius 0) : verified;
  =========================
  Radius : shape;)

(datatype shape-rectangle
  Width : number;
  Height : number;
  (> Width 0) : verified;
  (> Height 0) : verified;
  ==========================
  [Width Height] : shape;)

Generate guard types in Go showing:

The Shape interface with private marker method
Circle and Rectangle as the only implementors
A function area(s Shape) float64 that uses a type switch
Attempting to add a new variant outside the package fails to compile
Show the compile error message

Then generate the same spec in TypeScript showing:

The type Shape = Circle | Rectangle union
Exhaustive pattern matching with never-type check
Adding a new variant causes a TS compile error in the match

This demonstrates the same Shen sum-type rule enforced via different language mechanisms. -->

The Category System: Six Patterns, Six Enforcement Profiles

Shengen classifies every Shen datatype rule into one of six categories, and each category has a distinct enforcement profile. Understanding these categories is the key to understanding what you get from a given rule.

Category	Shen Shape	Compile-Time	Runtime	Example
wrapper	Single premise, primitive type, no guards	Opaque field, forced constructor	None (infallible)	`account-id`
constrained	Single premise, primitive type, with guards	Opaque field, forced constructor, error return	Validated in constructor	`amount`, `email-address`
alias	Single premise, non-primitive type, no guards	Type synonym (transparent)	None	`prompt-required = unknown-profile`
composite	Bracketed conclusion, no guards	Opaque fields, forced constructor	None (infallible)	`transaction`
guarded	Bracketed conclusion, with guards	Opaque fields, forced constructor, error return	Validated in constructor	`balance-checked`, `tenant-access`
sumtype	Multiple blocks, same conclusion	Closed interface/trait/union	Per-variant (varies)	`authenticated-principal`

The pattern I want to highlight: composite types have zero runtime cost. They are pure structural enforcement. This means that in a proof chain, the intermediate composition steps are free. Only the leaf types (constrained, guarded) carry runtime validation cost. The deeper your proof chain, the better the ratio of compile-time to runtime enforcement.

<!-- EXAMPLE PROMPT: Build a "category showcase" that demonstrates all six categories in one spec. Design a small domain (perhaps a document management system) where one spec file exercises all six shengen categories:

wrapper: DocumentId (string wrapper, no validation)
constrained: PageCount (number, must be > 0)
alias: DraftDocument = Document (type synonym)
composite: Document [DocumentId Title PageCount] (no cross-field guards)
guarded: PublishedDocument [Document ReviewerSignoff] where (= Approved true) : verified
sumtype: AccessLevel = ReadOnly | ReadWrite (two blocks, same conclusion)

Generate Go guard types and write a short program that:

Creates each type, showing the constructor signature
Demonstrates which constructors return errors (constrained, guarded) vs which are infallible (wrapper, composite)
Shows the compile error when trying to bypass the opaque field
Annotates each type with its category and enforcement profile

This is the "Rosetta Stone" for understanding shengen output. Target: ~40 lines of spec, ~60 lines of generated Go, ~50 lines of demonstration code. -->

The Killer Insight: Proof Chains Have Diminishing Runtime Cost

Here is what I think is the most underappreciated aspect of this approach. Consider a proof chain of depth N:

leaf-type-1 (runtime check)
  └─> intermediate-1 (compile-time only)
       └─> intermediate-2 (compile-time only)
            └─> ... (compile-time only)
                 └─> final-proof (compile-time only)

The runtime cost is O(leaves), not O(depth). Every intermediate composition step is a composite with an infallible constructor. The compiler enforces the chain; the runtime validates the inputs. As your invariants get more sophisticated and your proof chains get deeper, the proportion of enforcement that is compile-time increases. You pay for validation at the boundary and get structural guarantees throughout the interior for free.

This is the opposite of how runtime validation typically scales. In a conventional system, deeper validation logic means more runtime checks, more error handling, more test cases. Here, deeper proof chains mean more compile-time enforcement and proportionally less runtime work. The formal properties scale by recruiting the compiler, not by adding code.

What Still Needs to Be Built

I want to be explicit about the gap between what exists as working code and what exists only as specifications or concepts, because I think intellectual honesty about the state of a project is more valuable than presenting everything as equally complete.

Fully implemented and working:

Payment processor (Go) - simple but complete proof chain
Multi-tenant API (Go) - full seven-type chain, tests, middleware, admin dashboard
Email campaigns (Go) - relational cross-type constraints, working handlers and templates
Research pipeline (Common Lisp/Shen) - the most architecturally complete, runs on SBCL

Spec complete, implementation needed:

Dosage calculator - the spec with recursive Shen functions exists, the Go server is a stub
Order state machine - only a concept document, the most ambitious planned demo

Concept only:

Shen Prolog as active constraint solver for generative UI
Linear logic / graded modalities for provable concurrency in Go
Polyglot comparison: same spec, four frameworks (Hono, Axum, FastAPI, net/http)

<!-- EXAMPLE PROMPT: Build the polyglot comparison demo. Take the payment spec (simplest complete spec) and generate guard types in all four languages:

Go (shengen) — unexported fields, (T, error) return
TypeScript (shengen-ts) — private constructor, static create(), throws
Rust (shengen-rs) — private fields, Result<Self, GuardError>
Python standard (shengen-py) — frozen dataclass, post_init raises
Python hardened (shengen-py --mode hardened) — HMAC provenance chain

For each, write a small program that:

Creates a valid Amount (succeeds)
Attempts to create an invalid Amount (fails, show the error)
Attempts to bypass the constructor (show the compile/runtime error)
Creates a full safe-transfer proof chain

Put all five side by side in a single document or demo directory. This is the "one spec, five guarantees" story made concrete and runnable. It directly complements the "enforcement spectrum" blog post. -->

The Deeper Question

I have been circling around a question that I think is more important than any specific demo: why does this work? Why can a single declaration in a proof calculus produce enforcement at three different layers of the software stack?

I think the answer has to do with the structure of sequent calculus itself. A sequent calculus rule is simultaneously:

A logical judgment (the premises entail the conclusion)
A type definition (the premises are the fields, the conclusion is the type)
A validation specification (the side conditions are the checks)

These are not three different interpretations bolted together by clever engineering. They are three facets of the same mathematical object. The Curry-Howard correspondence tells us that proofs are programs and types are propositions. Shen's sequent calculus sits at the nexus where this correspondence becomes practically useful: the proof is the type is the validation. Shengen just mechanically separates the facets into their target-language representations.

This is why I think the approach has legs beyond the specific demos I have built. Any domain where you can express invariants as sequent calculus rules gets all three layers of enforcement automatically. The question is not "can we make this work for domain X" but "can we express domain X's invariants in this notation?" And Shen's type system, being Turing-complete and embedding Prolog, turns out to be more expressive than you might expect from a system that looks, at first glance, like simple type declarations.

<!-- EXAMPLE PROMPT: Write a "how to think in sequent calculus" tutorial. Target audience: developers who are comfortable with TypeScript/Go type systems but have never seen formal logic notation.

Cover:

The basic anatomy: premises above the line, conclusion below
How to read (>= X 0) : verified as a side condition
How composition works: using one type as a premise in another rule
The six shengen categories and how to recognize which one you're writing
Common patterns:
- Wrapping a primitive to give it a name
- Adding a constraint to a primitive
- Composing multiple proofs into a product
- Adding cross-field guards to a product
- Creating sum types (multiple rules, same conclusion)
- Using Shen functions as verification conditions (advanced)
Three worked examples from simple to complex:
- A validated email address (constrained wrapper)
- A date range where start < end (guarded composite)
- A safe database query that requires both auth and tenant proofs (proof chain)

This could be a blog post or a section of documentation. It is the missing onboarding material for the entire project. -->

The Abstraction Ladder Has Always Been There

Sun, 19 Apr 2026 00:00:00 GMT

There's a debate happening right now about whether writing formal specifications for AI agents is "real programming" or some kind of ceremonial overhead. The framing is wrong. A spec isn't an alternative to code — it's code at a different level of abstraction. And the choice of which level to work at has been the central question of software engineering since 1954.

What abstraction actually is

Every layer of a programming stack is a contract with the same shape: I will state explicitly the things I care about. I will let the layer below decide everything else.

When you write C instead of assembly, you've decided you care about control flow and memory layout but not register allocation. The compiler picks registers. When you write Python instead of C, you've decided you care about logic but not memory management. The runtime picks allocations. When you write SQL instead of a loop, you've decided you care about the shape of the result but not the join algorithm. The query planner picks.

At every step, two things are true simultaneously. You are doing less — fewer decisions, less code, less to get wrong. And you are doing more — more precisely naming the thing you actually want, because the layer below now needs a clear contract to fulfill.

This is the deal abstraction has always offered. It doesn't remove the work of knowing what you want. It just changes which parts of "what you want" you have to say out loud.

Prompts and specs are rungs on the same ladder

When someone writes a prompt like "build me a payment processor that handles refunds," they are programming. They're working at a very high level of abstraction — higher than Python, higher than SQL — where the contract is expressed in English and the layer below (the LLM) fills in essentially everything: language, architecture, data model, error handling, invariants.

When the same person writes a Shen type rule saying balance must be ≥ transaction amount before the transfer is marked verified, they're also programming. Same ladder. Different rung. This time the contract is expressed in sequent calculus, and the layer below (the LLM plus the type checker plus the compiler) fills in everything except that invariant.

Neither is more or less "real" than the other. Both require the same underlying skill: knowing what you want precisely enough to say it. The English prompt and the formal spec are two points on a continuum that also includes Python, C, and assembly. The only real question is which decisions you want to make explicitly and which you're willing to delegate.

The decisions you don't make get made for you

Here's the part that matters. Every decision you don't make explicitly, something else makes for you. This is not optional. It is a property of how computation works. Code has to run; the CPU has to do something; some decision gets made at every level whether you named it or not.

When you don't specify register allocation, the compiler picks — usually well. When you don't specify an allocation strategy, the runtime picks — usually well. When you don't specify the invariant that balances can't go negative, the LLM picks — and here, "well" is doing a lot of work. The LLM will produce code. That code will make an implicit choice about whether negative balances are possible. If you didn't state the invariant, you delegated the choice, and you won't know which way it went until something breaks.

The trap in modern AI coding is believing that not writing a spec is the same as not making a decision. It isn't. It's delegation to a system with opaque judgment. Sometimes that's fine — for a throwaway script, delegating everything to the model is exactly right. For a payment processor, it's catastrophic. The question is never whether to delegate. The question is which things are cheap to delegate and which things you need to nail down yourself.

Why formal specs are just another rung, not a different kind of thing

Formal methods have a reputation for being exotic — a separate discipline practiced by people in academic buildings. The approach this series is about treats them as something much more mundane: another place on the abstraction ladder you were already climbing.

A Shen type rule is a way of saying "here is an invariant I care about; everything else, figure out." That is structurally identical to writing a Python function signature that says "here is the shape of my input and output; everything else, figure out." The Python type system is just weaker — it lets more invariants stay implicit. Sequent calculus is stronger — it lets fewer invariants stay implicit. They're doing the same job with different amounts of precision.

Once you see this, the question "should I write formal specs for my AI coding loop?" stops being philosophical. It becomes the same question you've been answering your whole career: which level of abstraction matches the cost of getting this wrong? For a weekend project, prompt-level is fine. For code that moves money, you want a rung where invariants are checked, not hoped for.

This is not new and that's the point

Every generation of programmers has had a version of this fight. Assembly programmers thought C hid too much. C programmers thought Java hid too much. Java programmers thought Rails hid too much. Rails programmers think prompting hides too much. In every case, the complaint has the same structure: the new layer delegates decisions that the old guard considered essential.

And in every case, both sides were partly right. The new layer genuinely does hide things that sometimes matter. And the new layer genuinely does let you express intent more directly when those things don't matter. The resolution has always been the same: learn to move up and down the ladder. Pick the rung that matches the stakes. Don't mistake "I didn't write it" for "it isn't there."

AI coding is the newest rung, and formal specs are a tool for dropping back down a rung when you need to. That's all. It's not a paradigm shift. It's the same abstraction ladder you've been on the whole time, with one new step at the top and a useful reminder that the lower steps are still available when you need them.

The rest of this series is about what lives on those lower steps — deductive gates, oracles, defense in depth, and what "knowing what you want" actually looks like when the thing below you can write a thousand lines a minute.

One Spec, Five Languages, Five Different Guarantees

Sun, 19 Apr 2026 00:00:00 GMT

Here is a question that I think deserves more attention than it has received: if you have a formal invariant, say "amounts must be non-negative," and you generate enforcement code in five different languages from that single specification, do you actually get the same guarantee in each? The answer, it turns out, is no, and not even close. I suspect most developers have an intuitive sense that languages differ in their enforcement capabilities, but the nature of the divergence is more interesting than a simple ranking of "strong" versus "weak" might suggest. Understanding where and why the guarantees differ is, I think, the key to making this approach practical rather than merely elegant.

The Spec

Consider the following specification, written in Shen's type system:

(datatype amount
  X : number;
  (>= X 0) : verified;
  ====================
  X : amount;)

This is the same regardless of target language. Shen does not care about Go or TypeScript. It is pure math: if X is a number and X is greater than or equal to zero, then X is an amount. The specification exists in a space that is, in a meaningful sense, prior to any particular language's type system. This reminds me of how mathematical proofs exist independently of the notation used to express them, though the notation can constrain what is easy or hard to express. The interesting question, then, is what happens when this abstract specification becomes concrete code in a particular language. The answer reveals something about the hidden assumptions we carry when we think about type safety.

Tier 1: The Compiler as Structural Enforcer

In languages like Go, Rust, Java, Swift, Kotlin, and C#, the compiler itself becomes the enforcement mechanism. Consider Go:

type Amount struct{ v float64 }  // lowercase v = unexported

func NewAmount(x float64) (Amount, error) {
    if !(x >= 0) { return Amount{}, fmt.Errorf("x must be >= 0: %v", x) }
    return Amount{v: x}, nil
}

If code outside the package attempts Amount{v: -5}, this is a compile error. Not a warning, not a lint suggestion, but a hard refusal to produce a binary. The Go compiler is the enforcer. Now, Rust takes this further in an interesting way:

pub struct Amount { v: f64 }  // v is private by default

impl Amount {
    pub fn new(x: f64) -> Result<Self, String> {
        if !(x >= 0.0) { return Err(format!("x must be >= 0: {}", x)); }
        Ok(Amount { v: x })
    }
    pub fn val(&self) -> f64 { self.v }
}

Not only is v private, but shengen can derive no Clone on guarded types, preventing bypass via cloning a partially-constructed value. Rust's ownership system adds an enforcement layer that Go simply does not have. I find this instructive because it illustrates how the same formal specification, when projected onto different type systems, picks up additional guarantees almost as a byproduct of the target language's existing machinery. The ownership model was not designed for this purpose, but it serves it nonetheless.

What you get in Tier 1 is that structural bypass is a compile error. The constructor is the only path to the type. The compiler enforces the specification with zero runtime overhead beyond the validation check itself. If the code compiles, the invariant holds at every construction site. There is something deeply satisfying about this, and I think it is related to why formal methods have historically found their strongest foothold in languages with rich compile-time guarantees. The enforcement is not merely conventional; it is structural.

Tier 2: Compile-Time Guarantees with an Escape Hatch

TypeScript presents a more nuanced case, and I think the nuance is worth dwelling on because it reveals a hidden premise in how we typically think about type safety.

class Amount {
    private constructor(private readonly v: number) {}
    
    static create(x: number): Amount {
        if (!(x >= 0)) throw new Error(`x must be >= 0: ${x}`);
        return new Amount(x);
    }
    
    val(): number { return this.v; }
}

Writing new Amount(-5) produces a TypeScript compile error because the constructor is private. If you are writing TypeScript and running tsc, you get what appears to be a Tier 1-equivalent guarantee in your source code. But TypeScript transpiles to JavaScript, and after transpilation, private disappears. In the emitted JavaScript, new Amount(-5) works perfectly well. A determined developer, or an LLM that has been asked to write JavaScript rather than TypeScript, can bypass the constructor entirely. The runtime validation in create() still catches Amount.create(-5) because the throw survives transpilation, so you get runtime enforcement even in JavaScript, but the compile-time wall has a door in it.

I find this situation analogous to a concept from institutional economics: the difference between a rule that is enforced by physical constraint versus one enforced by social convention backed by occasional auditing. The TypeScript compiler provides genuine enforcement within its domain, but the transpilation boundary is a trust boundary. Code on the other side of that boundary operates under different rules entirely. Shengen for TypeScript also supports #private fields using the ECMAScript private fields syntax, which does survive transpilation, though browser support and tooling are still catching up. It seems to me that the TypeScript case is particularly revealing because it forces us to ask: what exactly do we mean when we say a type system "enforces" something? The enforcement is real, but it is bounded by the compilation context in a way that Tier 1 languages' enforcement is not.

Tier 3: Convention and Runtime

Python, Ruby, Lua, and plain JavaScript occupy a different position on this spectrum, and I think it is important to be precise about what is and is not lost.

class Amount:
    def __init__(self, x: float):
        if not (x >= 0):
            raise ValueError(f"x must be >= 0: {x}")
        self._v = x
    
    @property
    def val(self) -> float:
        return self._v

Python has no compile-time privacy. Writing amount._v = -5 works. The underscore is a naming convention, not a language guarantee. Mypy catches type mismatches but not privacy violations. There is no compilation step that catches structural bypass. Amount(-5) raises a runtime ValueError, so the validation does run, but nothing prevents post-construction mutation or direct field access. For stronger enforcement in Tier 3 languages, shengen can use closures:

def new_amount(x: float):
    if not (x >= 0):
        raise ValueError(f"x must be >= 0: {x}")
    def val():
        return x
    return val

The captured x has no externally accessible address. You cannot mutate it. You cannot inspect it except through val(). This is genuinely stronger than the _v convention, but ergonomically worse since every access becomes a function call rather than a property access. I suspect that in practice, the choice between class-based convention and closure-based enforcement in Python depends heavily on the specific context: how critical is the invariant, and how much ergonomic cost is the team willing to bear? This is not a purely technical question; it is a question about organizational trust and the probability of accidental violation, which depends on team norms and the likelihood of accidental violation.

Three Layers of Protection

Every language, regardless of tier, receives three layers of protection from this approach. The first layer is runtime validation: the constructor checks the invariant, and this works identically everywhere. The second layer is compiler enforcement, which is where languages diverge dramatically. Go and Rust produce hard compile errors for structural bypass. TypeScript produces compile errors within its own context but loses them after transpilation. Python relies on naming convention alone. The third layer is deductive verification: Shen's own type checker (shen tc+) proves the consistency of the specification as a subprocess, and this too is universal because it does not depend on the target language's type system at all.

Layers one and three are universal precisely because they do not depend on the target language's type system. Layer one runs in the constructor. Layer three runs in Shen. It is layer two where the interesting divergence occurs, and I think it is the most important layer for AI-assisted coding loops specifically because it is the layer that turns invariant violations into compiler errors, which are the kind of feedback that large language models are demonstrably best at acting on. There is a hidden premise here that I want to make explicit: the value of compile-time enforcement is not merely about catching errors early in some abstract sense. It is about the structure of the feedback signal. A compile error is unambiguous, localized, and actionable. A runtime error may be any of those things, or it may be none of them, depending on test coverage and the distance between the violation and its observable consequences.

Implications for AI-Assisted Development

In a Go coding loop, when an LLM writes Amount{v: -5}, it receives a clear, specific error message: "cannot refer to unexported field v in struct literal of type shenguard.Amount." The LLM knows exactly what to do with this. It uses NewAmount(-5) instead, which returns an error, which it handles. The type system guided it to the correct path through what amounts to a structured conversation between the model and the compiler.

In a Python coding loop, the same conceptual mistake does not produce a compile error. The LLM writes Amount(-5), gets a runtime ValueError, and fixes it. But it could also write amount._v = -5 and no tool would catch this until a test happens to exercise that particular path. This does not mean Python is useless for specification-driven development. It means the backpressure is softer: runtime errors instead of compile errors, convention instead of enforcement. The audit gate becomes correspondingly more important in Tier 3 languages. If someone edits the guard file to skip validation, the audit catches it even when the compiler cannot.

Choosing a Language for Formal Invariants

If you are choosing a language for a new project that will use AI coding with formal invariants, the enforcement spectrum matters in concrete ways. For safety-critical domains like healthcare, finance, or infrastructure, Tier 1 languages like Go or Rust make the compiler itself the enforcer, and this is probably where you want to be. For web applications with a TypeScript team, the tsc guarantees are real within your codebase, and the JavaScript escape hatch matters less if you control the build pipeline. For an existing Python codebase, Tier 3 still provides genuine value: the specification documents invariants, the runtime validates them, and shen tc+ proves consistency. You get perhaps sixty percent of the benefit with zero language migration cost, which is a tradeoff that may well be worth making depending on the circumstances.

The point, as I see it, is not that every language is equal. They manifestly are not. The point is that every language benefits from formal specification, and the specification itself is portable across all of them. Write the specification once, generate guard types for each language in your stack, and get the strongest guarantee each language can offer. This is a pragmatic position, not an idealistic one, and I think it is more honest about the messy reality of polyglot codebases than an approach that insists on a single "correct" language for formal methods.

The Self-Replicating Specification Chain

One final point that I think deserves attention: what about languages that shengen does not yet support? The create-shengen meta-specification is an 875-line document that teaches an LLM to build a shengen for any target language. It parameterizes the enforcement mechanism (unexported fields, private constructors, closures), error handling patterns (error returns, exceptions, Result types), naming conventions, value accessor syntax, sum type implementation, and set membership syntax. A single LLM invocation with this specification produces a working shengen for a new language. The formal verification supply chain is, in a meaningful sense, self-replicating. I am not entirely sure what to make of this. It is both powerful and somewhat unsettling, in the way that any self-referential system tends to be. The specification for generating specification-enforcers is itself a specification, and the question of who verifies the verifier is one that I suspect will become increasingly important as these systems proliferate.

The Spec Is Not a Gate, It Is a Substrate

Sun, 19 Apr 2026 00:00:00 GMT

This series is about Shen-Backpressure — a project that uses a small, formally-defined specification language (Shen) to constrain and guide an AI coding loop. The natural reading of a tool like this is that it adds a new kind of gate to your CI. You write a spec. The spec gets checked. If the check passes, the code advances. If it fails, the code stops. Gate in, gate out. Specifications as checkpoints.

That reading is correct, and it is small. The more useful way to see what's happening here is to notice that the same specs/core.shen file — the same handful of sequent-calculus rules about balances and transactions and tenants — is being enforced in four different places, by four different mechanisms, and all four are projections of one source. The spec isn't a gate. It's a substrate. Gates are just one of the surfaces it can render onto.

Once you hold that frame, the engineering problem reshapes itself. You stop asking "where in the pipeline should the check go?" and start asking "which surfaces does this invariant need to reach, and does my spec language let me project there?"

The four surfaces, made concrete

Surface one: the target language's type system, at build time. The shengen tool parses the spec and emits Go (or Rust, or TypeScript, or Python, or Java) where every domain type has private fields and a validating constructor. Amount{v: -5} does not compile. NewAmount(-5) returns an error. The Go compiler — not Shen, not the developer, not the LLM — enforces that you cannot hold an Amount whose value is negative. The spec projected into the type system of the target language; the language's own compiler became your proof checker.

Surface two: the spec's internal logic, at build time. shen tc+ runs the sequent calculus over the spec itself. This does not check your code. It checks your rules. If you wrote two premises that can never be simultaneously satisfied, or a conclusion the premises don't support, tc+ fails. This is the layer nobody in the empirical-validation school has an answer for, because no amount of running programs tells you whether your specification is internally consistent. You need a deductive check, and tc+ is one.

Surface three: the spec as a runtime oracle, at test time. This is where shen-derive lives, and it's the move I find most interesting. The same .shen file that drives shengen also contains (define ...) blocks — the obvious-correct, slow, functional definition of what your code should compute. shen-derive embeds an s-expression evaluator, runs the spec on sampled inputs, and emits a table-driven test that compares the hand-written implementation against the spec pointwise. The spec literally executes. Not as the thing that ships, but as the oracle that judges the thing that ships. If the efficient Go implementation diverges from the spec on any sampled input, the test fails.

Surface four: boundary enforcement at I/O. shengen can emit scoped DB wrappers. A TenantAccess proof produces a TenantAccessDB struct where the tenant ID is captured and cannot be changed. Every query through that wrapper is automatically scoped to the verified tenant. The proof that was erected at the HTTP boundary travels all the way down to the SQL query without anyone having to remember to pass the right argument. The spec reached past compilation, past testing, into the runtime call to the database.

Four surfaces. One file. Edit the spec, every surface updates on the next run. This is the thing the tool explainers don't say out loud: the architecture has collapsed what used to be four separate artifacts — type definitions, consistency checks, oracle tests, and authorization middleware — into a single source that projects into all of them.

Why this matters more than it sounds

In most production codebases, a business rule like "balance must cover transaction" lives in at least four places. There is a Confluence page. There is a validation function somewhere near the HTTP handler. There is a guard in the SQL layer. There are a handful of unit tests. Each of these is an independent restatement of the same rule, maintained by different people at different times with different incentives. They drift. Not because anyone is careless — because keeping four representations of the same idea synchronized is a task no process ever fully solves.

The substrate architecture collapses this by construction. The rule exists in one place. All enforcement surfaces are generated from that place. Drift is impossible not because you have good discipline but because there is no second copy to drift from. This is the same bet that schema-first tooling (Protobuf, OpenAPI, database migrations from models) makes at the data-shape layer, extended from shapes to logic.

The reason this newly matters in the AI-coding context is specific and important. LLMs are good at writing implementations from specs. They are bad at keeping specs and implementations synchronized across files by hand. If you generate everything downstream from one source, you don't need the model to remember the rule — you need it to produce code that the generated enforcement surfaces accept. The model can forget the invariant entirely and still not violate it, because the compiler, the test, the database wrapper, and the type checker have all been handed a projection of it.

This is the deeper meaning of backpressure. Backpressure in the Ralph sense is the signal that travels back upstream when a gate rejects work. Backpressure in the substrate sense is something larger: every surface the spec reaches is a place where bad work cannot accumulate, because the surface itself is a projection of the rule. The rule doesn't need to be checked against the surface — the surface is the rule, embodied in the local language of that enforcement context.

The substrate property and what it requires

For a specification language to work as a substrate, it has to have three properties. Not all spec languages have them.

It has to be logic you can hand to a program, not logic you can only hand to a human. A Confluence page is not a substrate. A Word document of requirements is not a substrate. An RFC is not a substrate. These are all artifacts where the rule is encoded in a form only humans can project from, which is why the four-places-drift problem is universal.

It has to be cheap to parse and manipulate. This is where Shen's Lisp syntax earns its keep. The spec is already an AST. Generating Go validators from it is a tree walk, not a parser project. Embedding its runtime is "load the file and call eval." The substrate property is theoretically available to any formally-defined spec language, but in practice the ones with clean s-expression or similar representations are the ones where the projection tooling is cheap enough to actually build.

It has to be expressive enough to span the surfaces you care about. A schema language like Protobuf handles data shapes beautifully but can't express "balance must cover transaction" as a relation across fields. A type system like Rust's handles static structural constraints but can't run at test time as an oracle. Shen's sequent calculus plus its executable (define ...) blocks sits in a sweet spot: it can express predicates, relations, proofs, and computations, and all three are parsable by the same tools.

The interesting question for anyone building in this space is not "which spec language is most rigorous?" but "which spec language projects cleanly into the most enforcement surfaces I need?" Rigor that can only enforce at the build-time boundary is a gate. Rigor that enforces at four boundaries is a substrate. The value of a specification scales with the number of surfaces it can reach, and the ceiling of that scaling is set by the language's portability, not its theoretical strength.

The shape this implies

If you take the substrate view seriously, a few consequences follow.

The spec file becomes the single most important artifact in the repository. Everything else is derived, regenerable, disposable. Editing generated code is a category error — you've stopped editing the substrate and started editing one of its projections, which will be erased on the next run. The discipline the repo describes ("never edit guards_gen.go") is not arbitrary; it's the logical consequence of treating the spec as source.

The number of enforcement surfaces should grow over time. If the spec is a substrate and you've only projected it into two places, you've left value on the table. The repo's trajectory — from shengen's type generation to scoped DB wrappers to shen-derive's runtime oracle — is the natural shape of a project whose authors have started asking "where else can this rule reach?" That question has no fixed answer. Every new surface is a new place where drift becomes impossible.

Specification skills become more valuable, and implementation skills become less so. If one spec file can drive four projections, the leverage on the spec is four times the leverage on any individual projection. The person who can write a precise, minimal, well-factored spec is doing something structurally different from the person who can write good Go. This is not a new idea — it's been true since the first DSL compiler — but the AI-coding context sharpens it. The LLM can write the Go. What it cannot do for you is decide which invariants the Go must respect. That decision lives in the spec, and the spec is now the thing every enforcement surface is answering to.

The right mental model for a tool like Shen-Backpressure is not "a type checker with a codegen step." It's "a system for projecting a single logical artifact into every context where it can do useful work." The gates are one projection. The runtime oracle is another. The DB wrappers are a third. The consistency check is a fourth. There will be more. That's the point.

The Spec Is the Oracle, Not the Generator

Sun, 19 Apr 2026 00:00:00 GMT

Shen-Backpressure — the project this series is about — contains a tool called shen-derive. There's a note in its design doc that I think contains one of the most important small ideas in the repository. It describes why the tool was rewritten. The v1 version was a code generator — a Bird-Meertens rewrite engine that parsed a typed lambda calculus, applied algebraic laws like foldr-fusion and map-fusion, and lowered the result into idiomatic Go loops. It worked. It hit a wall. The v2 version threw the code generator away and kept the evaluator. The spec, which had been the input to the code generator, became the oracle that judges a hand-written implementation.

The note frames this as an engineering pivot. I think it's a deeper point that quietly generalizes to almost every formal-methods project in history, and I want to unpack why.

Two ways to use a precise specification

Any precise, executable specification admits two different uses. You can generate implementations from it, or you can judge implementations against it. These look similar. They are not.

When you generate, you start with the spec and mechanically derive a target-language implementation. The rewrite engine walks the spec, applies transformations, produces code. The code is correct by construction, because every step in the derivation is sound. This is the Bird-Meertens dream; it's also the Coq-extraction dream, the Lean-to-binary dream, the "compile my proof to executable code" dream.

When you judge, you start with the implementation — written however you like, by whoever you like — and run the spec against it. You pick sample inputs. You evaluate the spec. You evaluate the implementation. If they disagree anywhere, the implementation fails. The spec never becomes the code; it becomes the thing the code has to answer to.

Both approaches use the same spec. Both end up with verified code. The difference is which direction the verification arrow points.

Why generation hits a ceiling

Generation from a spec sounds like the right move. It has proof-by-construction; it has no gap between the verified object and the shipping object; it gives you the fantasy of a language where "correct" and "efficient" are the same file. The people who try it are not fools. It's the reasonable thing to try.

It hits a ceiling for a reason that has nothing to do with any particular project. The ceiling is that the set of programs you can generate from a spec is bounded by the rewrite laws and lowering patterns you have implemented. Adding a new shape of computation means adding a new law. And a new lowering pattern. And proving the law sound. And testing the lowering. Every domain-specific wrinkle — guard types, constrained values, field accessors, performance idioms — fights the type system of the rewrite engine. The catalog grows, the engine gets more complex, and the set of specs it accepts gets more fragile rather than more expressive. You end up with a very clever thing that handles exactly the cases its authors thought of.

This is not a bug in any one rewrite engine. It is the shape of the problem. Generation is constructive — you are building the output from parts — and the parts have to cover the whole space of outputs you want. If your spec can express a shape of computation your engine cannot lower, you are stuck. Either you extend the engine (at cost) or you constrain the spec (at cost). Neither path scales.

Why judgment does not

The judgment approach — spec as oracle, implementation written freely — has the opposite shape. The set of programs you can verify is not bounded by the transformations you can perform. It is bounded only by what your spec can express. If Shen can say it, shen-derive can judge something against it. The implementation can use any idiom, any data structure, any optimization the author (or LLM) wants, as long as the outputs match the spec's outputs on the sampled inputs.

The ceiling has moved. Instead of being the expressiveness of your lowering patterns, it is the expressiveness of your spec language. For any reasonable spec language, that ceiling is much higher. You trade constructive correctness (the output came from the spec, so it's right by derivation) for pointwise correctness (the output matches the spec on the inputs we checked). You lose a theoretical guarantee and gain an enormous amount of practical reach.

This is the same trade that made property-based testing eat formal methods in industry. QuickCheck didn't win because generators of correct-by-construction programs were inadequate; it won because you could hand it a predicate about any program you had already written and get immediate feedback. It met code where code lived. The predicate-as-oracle approach is small, late-binding, and scales to whatever program you can describe.

The deeper point: precise specifications are more useful as judges than as constructors. The constructive path always has a ceiling that is the sum of the transformations you've implemented. The judgmental path has a ceiling that is only the expressiveness of the spec. In almost every engineering context, expressiveness is cheaper to add than transformations are.

The AI-coding twist

This inversion becomes sharper in the AI-coding setting, because the AI changes which end of the pipeline is cheap.

In the pre-AI world, the argument for generation was partly an argument about human effort. Humans are slow to write implementations. Humans make transcription errors. A code generator — even one with a low ceiling — saves labor in the region where it works. The fact that it doesn't scale to the full expressiveness of the spec language is a real cost but a bounded one; the generator does the 80% and humans handle the long tail.

In the AI-coding world, implementations are cheap. The LLM can write Go fast, and it can write variations of Go fast, and it can rewrite Go after feedback. What is expensive is not producing code — it is producing code that respects an invariant. The bottleneck moves from writing to judging.

This is precisely where the oracle model wins and the generator model loses. The oracle model treats the implementation as disposable — write it, judge it, rewrite it, judge again — and concentrates the precise work on the spec. The generator model treats the spec as disposable — write it, compile it, ship the output — and concentrates the precise work on the lowering engine. In an environment where implementations are nearly free and invariants are not, you want the precise work concentrated on the invariants.

The shen-derive pivot is really the repo noticing this and adjusting. The spec still exists, still matters, and still has to be right. But the role of the spec has moved. It is no longer the thing the implementation is derived from. It is the thing the implementation is judged against. The implementation is produced freely — by the LLM, by the human, by whatever process can put bytes on disk — and the spec's job is to say yes or no.

The generalization

I think this is the right way to see almost every formal-verification project.

If your spec language can both generate and judge, judge. The set of programs you can verify is larger, the toolchain is simpler, the spec can stay the same, and the thing being verified is the thing that ships. The hypothetical benefit of proof-by-construction is usually not worth the ceiling it imposes on what you can express.

If your spec language can only generate — if it's a code generator posing as a verification tool — the ceiling is real and will be hit. You can extend the catalog forever and always discover the next case you haven't covered. The path out is not more laws; it's a richer spec and a separate, simpler judgment harness.

If your spec language can only judge — if it's a predicate language without a code-generation story — you have most of what you want. The thing you are missing (generating implementations) was mostly a luxury anyway, and the LLM can fill it in cheaper than the generator could.

The spec is the oracle. Anything else you do with it is optional. The repo's pivot from rewrite engine to verification gate is not a retreat; it's an alignment with how precise specifications actually deliver value in programs people ship.