Created
April 22, 2026 21:47
-
-
Save oneryalcin/0f0491df0ae6ce35a3fcd6038bc4728a to your computer and use it in GitHub Desktop.
TLA Plus verifications
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| --- | |
| name: tla-verification | |
| description: TLA+/PlusCal formal verification for concurrent protocol design. Use when modeling async systems, checking race conditions, or verifying state machine correctness before implementation. | |
| --- | |
| # TLA+ Protocol Verification — Pilot's Checklist | |
| You are a formal verification specialist. You help design, write, debug, and run PlusCal/TLA+ specifications for concurrent systems, with a focus on asyncio-style cooperative concurrency patterns. You know the sharp edges of PlusCal, TLC model checking, and how formal models map to real implementation code. | |
| ## When to Use This Skill | |
| - Designing a protocol with multiple concurrent actors and shared mutable state | |
| - Reviewing a design doc for race conditions before implementation | |
| - Modeling async systems where `await` points create interleaving opportunities | |
| - Verifying state machines (lifecycle, cleanup, reconnection) for correctness | |
| - After discovering a concurrency bug — formalizing the fix to prove it's correct | |
| ## Architecture Context | |
| ``` | |
| PlusCal source (.tla) | |
| | | |
| | pcal.trans (translator) | |
| v | |
| TLA+ module (.tla with BEGIN TRANSLATION block) | |
| | | |
| | TLC model checker | |
| v | |
| Exhaustive state exploration → counterexample trace or "all states OK" | |
| ``` | |
| PlusCal is syntactic sugar over TLA+. Write PlusCal for readability, run TLC for verification. The translator inserts a `BEGIN TRANSLATION` / `END TRANSLATION` block into the same file. | |
| --- | |
| ## The 7 Rules | |
| These apply to every TLA+ verification: | |
| 1. **One PlusCal label = one atomic step.** Everything within a label executes without interleaving. This is the most important concept — it maps directly to "no `await` between these operations" in asyncio. | |
| 2. **Every `await` in your code = a label boundary in PlusCal.** If two operations have an `await` between them in real code, they MUST be in separate labels in the spec. If they must be atomic, they MUST be in the same label. | |
| 3. **Model the protocol, not the implementation.** You don't need every field — only the state that concurrent actors read and write. Strip everything else. | |
| 4. **Safety first, liveness second.** Safety (invariants) is cheap to check. Liveness (temporal properties) is expensive. Always run safety-only first. | |
| 5. **Bound the model intentionally.** Use explicit constants (`MaxEvents`, `MaxReservations`) to keep the state space finite. Document what each bound means and why it's sufficient. | |
| 6. **Treat TLC counterexamples as real until disproven.** Start from the assumption that the trace found a real bug. Then determine whether it is a protocol bug, a model bug, or a stale-translation/config problem. | |
| 7. **Verify the fix, don't just eyeball it.** After fixing a bug, re-run TLC. The fix itself may introduce new interleavings. | |
| --- | |
| ## Process | |
| ### Step 1: Identify Actors and Shared State | |
| List every concurrent process and the shared mutable state they touch. | |
| ``` | |
| | Actor | Reads | Writes | | |
| |-----------|--------------------|---------------------| | |
| | Trigger | registry, convMap | registry, convMap | | |
| | GenTask | registry, buffer | buffer, redis, status | | |
| | Reader | buffer, redis, status | readerEvents | | |
| | Stop | registry | status | | |
| | Sweep | registry, readers | status, registry | | |
| ``` | |
| ### Step 1.5: Define the Claims Up Front | |
| Before writing PlusCal, list: | |
| - **Safety properties**: "nothing bad ever happens" | |
| - **Liveness properties**: "something good eventually happens" | |
| - **Intended atomic regions**: operations that must happen with no interleaving | |
| - **Reusable bounds**: constants that keep the model finite enough for TLC | |
| Minimal template: | |
| ``` | |
| Safety: | |
| - At most one active generation per conversation | |
| - No reader misses events | |
| - Terminal persistence happens at most once | |
| Liveness: | |
| - Every active generation eventually terminates | |
| - Every attached reader eventually finishes | |
| Atomic regions: | |
| - Trigger check + reserve | |
| - GenTask running-check + buffer write | |
| Bounds: | |
| - MaxEvents | |
| - MaxReservations | |
| - ReaderIds | |
| ``` | |
| ### Step 2: Write PlusCal Source | |
| Start with the actors as `fair process` blocks. Model only the interesting interleavings. | |
| ### Step 3: Translate | |
| ```bash | |
| java -cp tla2tools.jar pcal.trans YourSpec.tla | |
| ``` | |
| ### Step 4: Run Safety | |
| ```bash | |
| java -XX:+UseParallelGC -Xmx4g -jar tla2tools.jar \ | |
| -config YourSpec-safety.cfg -workers auto YourSpec.tla | |
| ``` | |
| ### Step 5: Run Liveness (smaller model) | |
| ```bash | |
| java -XX:+UseParallelGC -jar tla2tools.jar \ | |
| -config YourSpec-liveness.cfg YourSpec.tla | |
| ``` | |
| --- | |
| ## PlusCal Sharp Edges | |
| These are the bugs we hit. Learn from them. | |
| ### `with` picks ONE, `while` iterates ALL | |
| ```pluscal | |
| \* WRONG: picks one arbitrary index, does NOT iterate | |
| with i \in 1..Len(buffer) do | |
| process(buffer[i]); | |
| end with; | |
| \* RIGHT: iterates the full buffer | |
| idx := 0; | |
| while idx < Len(buffer) do | |
| idx := idx + 1; | |
| process(buffer[idx]); | |
| end while; | |
| ``` | |
| `with x \in S do` is nondeterministic choice (TLC explores all choices), but each execution only processes ONE element. If you need to process ALL elements sequentially, use a `while` loop with an index variable. | |
| ### Labels after `goto` are mandatory | |
| PlusCal requires a label after any `if` block that contains a `goto`: | |
| ```pluscal | |
| \* ERROR: pcal.trans will reject this | |
| if condition then | |
| goto Done; | |
| end if; | |
| x := 1; \* <-- needs a label here | |
| \* CORRECT: | |
| if condition then | |
| goto Done; | |
| end if; | |
| NextStep: | |
| x := 1; | |
| ``` | |
| But beware: every label is an interleaving point. See the atomicity section below. | |
| ### Check-then-act MUST be same label | |
| If you check a condition and then mutate based on it, both must be in the same label. Otherwise another process can change the condition between check and act. | |
| ```pluscal | |
| \* BUG: TOCTOU — Stop can interleave between check and write | |
| CheckStatus: | |
| if status[g] = "running" then | |
| skip; | |
| end if; | |
| WriteBuffer: | |
| \* Status may no longer be "running" here! | |
| buffer[g] := Append(buffer[g], event); | |
| \* FIXED: check and write in same atomic step | |
| EmitEvent: | |
| if status[g] /= "running" then | |
| goto Terminate; | |
| else | |
| buffer[g] := Append(buffer[g], event); | |
| end if; | |
| ``` | |
| This is the PlusCal equivalent of "no `await` between check and mutation" in asyncio. | |
| ### Atomic reservation pattern | |
| For mutual exclusion (e.g., one active generation per conversation), the check and claim must be a single label: | |
| ```pluscal | |
| Reserve: | |
| \* ALL of this is one atomic step — no interleaving possible | |
| if activeGen[convId] /= "" then | |
| goto Conflict; | |
| elsif ~HasAvailableSlot then | |
| goto NoSlots; | |
| else | |
| genId := AvailableSlot; | |
| status[genId] := "pending"; | |
| activeGen[convId] := genId; | |
| end if; | |
| ``` | |
| Splitting this into multiple labels (check → verify slot → claim) creates a race window where two processes both pass the check. | |
| ### Subscribe-first for gap-free replay | |
| When a reader needs to catch up on history and then receive live events: | |
| ```pluscal | |
| \* WRONG: replay then subscribe — events emitted between replay | |
| \* end and subscribe start are lost (the "subscribe gap") | |
| ReplayBuffer: | |
| \* ... iterate buffer ... | |
| SubscribeRedis: | |
| redisIdx := Len(channel); | |
| GoLive: | |
| \* ... read from channel ... | |
| \* CORRECT: subscribe first, replay, drain, then go live | |
| Subscribe: | |
| redisIdx := Len(channel); \* mark position BEFORE replay | |
| ReplayBuffer: | |
| \* ... iterate buffer (events after redisIdx go to channel) ... | |
| DrainRedis: | |
| \* ... drain channel messages received during replay, with dedup ... | |
| GoLive: | |
| \* ... no gap possible ... | |
| ``` | |
| ### Reader must check terminal status on timeout | |
| A reader waiting on Redis can hang forever if the generation completes and no more messages arrive: | |
| ```pluscal | |
| ReaderLive: | |
| either | |
| \* Normal: new Redis message | |
| await redisIdx < Len(channel); | |
| \* ... process message ... | |
| or | |
| \* Timeout path: check if generation is done | |
| await status[genId] \in TerminalStates; | |
| goto CatchUp; \* replay remaining buffer events | |
| end either; | |
| ``` | |
| Without the timeout path, a reader that misses the terminal event via Redis will wait forever. | |
| --- | |
| ## TLC Sharp Edges | |
| ### Process IDs must be same type | |
| TLC's `ProcSet` is a union of all process ID sets. If some are strings and some are integers, TLC fails with "Attempted to check equality of string with non-string." | |
| ```pluscal | |
| \* WRONG: mixes integers and strings | |
| fair process Trigger \in {"t1", "t2"} | |
| fair process Reader \in 1..2 | |
| \* CORRECT: all strings | |
| fair process Trigger \in {"t1", "t2"} | |
| fair process Reader \in {"r1", "r2"} | |
| ``` | |
| ### Unbounded loops prevent exhaustion | |
| A `while TRUE` loop in a process (like a cleanup sweep) creates an infinite state space. TLC will run forever even with small constants. | |
| **Solution: bound with quiescence.** | |
| ```pluscal | |
| \* Define when the model is "done" | |
| ModelQuiescent == | |
| /\ reservationCount >= MaxReservations | |
| /\ \A g \in GenSlots : status[g] = "none" | |
| /\ \A r \in ReaderIds : done[r] | |
| \* Sweep exits when quiescent | |
| while ~ModelQuiescent do | |
| \* ... cleanup work ... | |
| end while; | |
| ``` | |
| In practice, the best sequence is: | |
| 1. Add explicit lifecycle bounds (`MaxReservations`, `MaxEvents`) | |
| 2. Split safety and liveness into separate configs | |
| 3. Guard background loops with "has real work" predicates | |
| 4. Add quiescence so the bounded model can terminate | |
| 5. Only then increase coverage | |
| ### CHECK_DEADLOCK FALSE for bounded models | |
| Bounded models intentionally reach states where no process can step (everyone is `Done`). This is expected quiescence, not a bug: | |
| ``` | |
| \* In .cfg file: | |
| CHECK_DEADLOCK FALSE | |
| ``` | |
| ### Safety vs Liveness configs | |
| Always split into two configs: | |
| ``` | |
| \* safety.cfg — fast, large model | |
| SPECIFICATION Spec | |
| CHECK_DEADLOCK FALSE | |
| CONSTANT MaxEvents = 3, MaxReservations = 3, ReaderIds = {"r1", "r2"} | |
| INVARIANT SafetyInvariant S4_NoMissedEvents | |
| \* liveness.cfg — slower, smaller model | |
| SPECIFICATION Spec | |
| CHECK_DEADLOCK FALSE | |
| CONSTANT MaxEvents = 2, MaxReservations = 2, ReaderIds = {"r1"} | |
| INVARIANT SafetyInvariant S4_NoMissedEvents | |
| PROPERTY L1_Terminates L2_ReadersDone | |
| ``` | |
| Temporal properties (PROPERTY) require TLC to build the full behavior graph. Keep the liveness model small. | |
| ### Counterexample triage | |
| When TLC finds a violation, classify it before changing the design: | |
| 1. **Protocol bug** | |
| The trace reflects an allowed interleaving of the intended protocol. | |
| Example: check-then-act split across labels. | |
| 2. **Model bug** | |
| The trace relies on an artifact of the model rather than the protocol. | |
| Example: invariant references recycled slot state instead of stable generation identity. | |
| 3. **Translation/config bug** | |
| The model checker ran something different from what you intended. | |
| Example: forgot to re-run `pcal.trans`, mixed process-id types, wrong cfg constants. | |
| Checklist: | |
| - Did you retranslate after editing PlusCal? | |
| - Does the failing trace respect the intended protocol contract? | |
| - Is the invariant/property phrased against stable identities or recycled slots? | |
| - Is a helper process doing low-value churn rather than meaningful work? | |
| - Did a verification bound remove the very behavior you were trying to check? | |
| ### `CHOOSE` is deterministic but fragile | |
| `CHOOSE x \in S : P(x)` picks one element satisfying `P`. If multiple elements satisfy it, TLC always picks the same one (deterministic but arbitrary). If NO element satisfies it, TLC crashes. | |
| ```pluscal | |
| \* Guard CHOOSE with existence check | |
| if \E g \in Slots : status[g] = "none" then | |
| genId := CHOOSE g \in Slots : status[g] = "none"; | |
| end if; | |
| ``` | |
| ### Re-translate after every PlusCal edit | |
| The `BEGIN TRANSLATION` block is generated code. After editing anything above it: | |
| ```bash | |
| java -cp tla2tools.jar pcal.trans YourSpec.tla | |
| ``` | |
| If you forget, TLC runs against stale translated TLA+ and your changes have no effect. The checksum line catches most cases, but not all. | |
| ### Model hygiene | |
| - Remove unused variables from the PlusCal source. Every extra variable increases state space. | |
| - Prefer stable identities in invariants. If slots are recycled, snapshot what a reader/generation is expected to see. | |
| - Distinguish **protocol semantics** from **verification bounds** in comments. | |
| - Keep helper nondeterminism purposeful. If a branch only adds churn and no new insight, constrain or remove it. | |
| - If a loop exists only for cleanup/maintenance, guard it with a "has work" predicate. | |
| --- | |
| ## Invariant Design | |
| ### Forward-only transitions (S2) | |
| Track previous state to verify lifecycle ordering: | |
| ```pluscal | |
| variables | |
| status = [g \in Slots |-> "none"], | |
| prevStatus = [g \in Slots |-> "none"]; | |
| \* On every transition: | |
| prevStatus[g] := status[g]; | |
| status[g] := newStatus; | |
| ``` | |
| ```tla | |
| \* Invariant: | |
| StatusOrder(s) == CASE s = "none" -> 0 [] s = "pending" -> 1 ... | |
| S2_ForwardOnly == | |
| \A g \in Slots : | |
| StatusOrder(status[g]) >= StatusOrder(prevStatus[g]) | |
| ``` | |
| For slot reuse (`evicted -> none`), add an explicit exception: | |
| ```tla | |
| AllowedTransition(old, new) == | |
| \/ StatusOrder(new) >= StatusOrder(old) | |
| \/ /\ old = "evicted" /\ new = "none" | |
| ``` | |
| ### No-missed-events (S4) | |
| Snapshot what the reader is expected to see at finish time, not what the slot currently contains (slot may be reused): | |
| ```pluscal | |
| \* At reader finish: | |
| readerExpectedEvents[self] := bufferSnapshot; | |
| ``` | |
| ```tla | |
| S4_NoMissedEvents == | |
| \A r \in ReaderIds : | |
| done[r] => | |
| \A i \in 1..Len(expectedEvents[r]) : | |
| ReaderHasEvent(r, expectedEvents[r][i]) | |
| ``` | |
| ### Slot reuse safety | |
| Sweep must only reset an evicted slot after all readers attached to it have finished: | |
| ```pluscal | |
| if /\ status[g] = "evicted" | |
| /\ \A r \in ReaderIds : readerGenId[r] /= g \/ done[r] | |
| then | |
| \* Safe to reset | |
| status[g] := "none"; | |
| buffer[g] := <<>>; | |
| end if; | |
| ``` | |
| --- | |
| ## Modeling asyncio Concurrency | |
| ### The key mapping | |
| | asyncio concept | PlusCal equivalent | | |
| |----------------|-------------------| | |
| | Synchronous code block (no `await`) | Single label (atomic) | | |
| | `await` point | Label boundary (interleaving point) | | |
| | `asyncio.Task` | Separate `process` | | |
| | Shared dict/list mutation | Variable assignment in shared state | | |
| | `asyncio.Lock` | Often modeled as one label unless lock queueing/ownership matters | | |
| | Cooperative scheduling | Fair process (TLC explores all fair interleavings) | | |
| ### What to model, what to skip | |
| **Model:** | |
| - State machines (status transitions) | |
| - Shared mutable registries | |
| - Message channels (Redis pub/sub as append-only sequences) | |
| - Check-then-act patterns | |
| - Lifecycle (creation, terminal, cleanup, reuse) | |
| **Skip:** | |
| - Exact payload contents (use sequence numbers) | |
| - Network latency (PlusCal's nondeterminism covers this) | |
| - Error details (model error as a nondeterministic terminal transition) | |
| - Database schemas (model as variable assignment) | |
| - Authentication / authorization (orthogonal to concurrency) | |
| ### Fairness | |
| - Use fairness primarily for **liveness** reasoning. | |
| - Do not use fairness to "fix" a bad safety model. | |
| - If a liveness claim only holds under strong fairness assumptions, write that down explicitly. | |
| - Prefer bounded liveness runs over huge unbounded fair models. | |
| ### Proof strength vs confidence | |
| - **Exhausted model, no violations**: strong bounded assurance for that model. | |
| - **Large non-exhausted run, no violations**: good bug-finding evidence, not proof. | |
| - **Reduced liveness model passes**: confidence in the protocol shape, not a proof for all larger parameter values. | |
| --- | |
| ## Quick Reference | |
| ### Install | |
| ```bash | |
| # macOS | |
| brew install openjdk | |
| curl -L -o tla2tools.jar https://github.com/tlaplus/tlaplus/releases/download/v1.7.4/tla2tools.jar | |
| ``` | |
| ### Commands | |
| ```bash | |
| # Translate PlusCal -> TLA+ | |
| java -cp tla2tools.jar pcal.trans Spec.tla | |
| # Run TLC (safety, multi-worker) | |
| java -XX:+UseParallelGC -Xmx4g -jar tla2tools.jar \ | |
| -config Spec-safety.cfg -workers auto Spec.tla | |
| # Run TLC (liveness, single worker) | |
| java -XX:+UseParallelGC -jar tla2tools.jar \ | |
| -config Spec-liveness.cfg Spec.tla | |
| ``` | |
| ### Config template | |
| ``` | |
| SPECIFICATION Spec | |
| CHECK_DEADLOCK FALSE | |
| CONSTANT | |
| MaxEvents = 3 | |
| ConversationIds = {"c1"} | |
| ReaderIds = {"r1", "r2"} | |
| MaxReservations = 3 | |
| INVARIANT | |
| SafetyInvariant | |
| \* Only in liveness config: | |
| PROPERTY | |
| L1_Terminates | |
| ``` | |
| ### Common TLC exit codes | |
| | Code | Meaning | | |
| |------|---------| | |
| | 0 | All properties verified, model exhausted | | |
| | 11 | Deadlock found (add `CHECK_DEADLOCK FALSE` if expected) | | |
| | 12 | Invariant violated (read the counterexample trace) | | |
| | 13 | Temporal property violated | | |
| | 75 | Semantic/parse error in spec | | |
| | 143/144 | Killed by signal (SIGTERM/SIGKILL) | | |
| --- | |
| ## Worked Example: DES-5271 | |
| These findings came from verifying the assistant service resumable streaming protocol. Treat them as a case study, not universal rules: | |
| 1. **TLC found a trigger race at 533 states** that no human reviewer caught. Two `POST /chat/ask` requests interleaved between check and claim, creating duplicate PENDING generations. Fix: single atomic label for check+claim. | |
| 2. **TLC found a TOCTOU at 161M states.** GenTask checked `RUNNING`, then Stop fired, then GenTask wrote to the buffer of a stopped generation. Fix: merge check and write into same label. | |
| 3. **Unbounded sweep loops prevent model exhaustion.** `while TRUE` + slot reuse = infinite state space. Fix: `MaxReservations` budget + `ModelQuiescent` guard. | |
| 4. **`with i \in 1..Len(buf)` does NOT iterate.** It picks one arbitrary index. Use `while` loops for sequential iteration. This silently made S4 (no-missed-events) vacuous. | |
| 5. **Slot reuse invalidates invariants that reference slot state.** After reset, a finished reader's `readerGenId` points to a recycled slot with different data. Fix: snapshot expected events at reader finish time. | |
| 6. **Safety and liveness need different model sizes.** Safety with 3 events + 2 readers + 3 reservations = 470M+ states (tractable). Same config with liveness = infinite (temporal property checking builds full behavior graph). Split configs, reduce liveness model. | |
| 7. **Both bugs share the same root pattern: check-then-act with an interleaving point in between.** In asyncio, `await` is the interleaving point. In PlusCal, a label boundary is. The fix is always the same: keep check and mutation in one atomic block. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment