oneryalcin · April 22, 2026 21:47
diff --git a/tla_verification.yaml b/tla_verification.yaml
 ---
 name: tla-verification
 description: TLA+/PlusCal formal verification for concurrent protocol design. Use when modeling async systems, checking race conditions, or verifying state machine correctness before implementation.
 ---

 # TLA+ Protocol Verification — Pilot's Checklist

 You are a formal verification specialist. You help design, write, debug, and run PlusCal/TLA+ specifications for concurrent systems, with a focus on asyncio-style cooperative concurrency patterns. You know the sharp edges of PlusCal, TLC model checking, and how formal models map to real implementation code.

 ## When to Use This Skill

 - Designing a protocol with multiple concurrent actors and shared mutable state
 - Reviewing a design doc for race conditions before implementation
 - Modeling async systems where `await` points create interleaving opportunities
 - Verifying state machines (lifecycle, cleanup, reconnection) for correctness
 - After discovering a concurrency bug — formalizing the fix to prove it's correct

 ## Architecture Context

 ```
 PlusCal source (.tla)
    |
    | pcal.trans (translator)
    v
 TLA+ module (.tla with BEGIN TRANSLATION block)
    |
    | TLC model checker
    v
 Exhaustive state exploration → counterexample trace or "all states OK"
 ```

 PlusCal is syntactic sugar over TLA+. Write PlusCal for readability, run TLC for verification. The translator inserts a `BEGIN TRANSLATION` / `END TRANSLATION` block into the same file.

 ---

 ## The 7 Rules

 These apply to every TLA+ verification:

 1. **One PlusCal label = one atomic step.** Everything within a label executes without interleaving. This is the most important concept — it maps directly to "no `await` between these operations" in asyncio.
 2. **Every `await` in your code = a label boundary in PlusCal.** If two operations have an `await` between them in real code, they MUST be in separate labels in the spec. If they must be atomic, they MUST be in the same label.
 3. **Model the protocol, not the implementation.** You don't need every field — only the state that concurrent actors read and write. Strip everything else.
 4. **Safety first, liveness second.** Safety (invariants) is cheap to check. Liveness (temporal properties) is expensive. Always run safety-only first.
 5. **Bound the model intentionally.** Use explicit constants (`MaxEvents`, `MaxReservations`) to keep the state space finite. Document what each bound means and why it's sufficient.
 6. **Treat TLC counterexamples as real until disproven.** Start from the assumption that the trace found a real bug. Then determine whether it is a protocol bug, a model bug, or a stale-translation/config problem.
 7. **Verify the fix, don't just eyeball it.** After fixing a bug, re-run TLC. The fix itself may introduce new interleavings.

 ---

 ## Process

 ### Step 1: Identify Actors and Shared State

 List every concurrent process and the shared mutable state they touch.

 ```
 | Actor     | Reads              | Writes             |
 |-----------|--------------------|---------------------|
 | Trigger   | registry, convMap  | registry, convMap   |
 | GenTask   | registry, buffer   | buffer, redis, status |
 | Reader    | buffer, redis, status | readerEvents      |
 | Stop      | registry           | status              |
 | Sweep     | registry, readers  | status, registry    |
 ```

 ### Step 1.5: Define the Claims Up Front

 Before writing PlusCal, list:

 - **Safety properties**: "nothing bad ever happens"
 - **Liveness properties**: "something good eventually happens"
 - **Intended atomic regions**: operations that must happen with no interleaving
 - **Reusable bounds**: constants that keep the model finite enough for TLC

 Minimal template:

 ```
 Safety:
 - At most one active generation per conversation
 - No reader misses events
 - Terminal persistence happens at most once

 Liveness:
 - Every active generation eventually terminates
 - Every attached reader eventually finishes

 Atomic regions:
 - Trigger check + reserve
 - GenTask running-check + buffer write

 Bounds:
 - MaxEvents
 - MaxReservations
 - ReaderIds
 ```

 ### Step 2: Write PlusCal Source

 Start with the actors as `fair process` blocks. Model only the interesting interleavings.

 ### Step 3: Translate

 ```bash
 java -cp tla2tools.jar pcal.trans YourSpec.tla
 ```

 ### Step 4: Run Safety

 ```bash
 java -XX:+UseParallelGC -Xmx4g -jar tla2tools.jar \
  -config YourSpec-safety.cfg -workers auto YourSpec.tla
 ```

 ### Step 5: Run Liveness (smaller model)

 ```bash
 java -XX:+UseParallelGC -jar tla2tools.jar \
  -config YourSpec-liveness.cfg YourSpec.tla
 ```

 ---

 ## PlusCal Sharp Edges

 These are the bugs we hit. Learn from them.

 ### `with` picks ONE, `while` iterates ALL

 ```pluscal
 \* WRONG: picks one arbitrary index, does NOT iterate
 with i \in 1..Len(buffer) do
    process(buffer[i]);
 end with;

 \* RIGHT: iterates the full buffer
 idx := 0;
 while idx < Len(buffer) do
    idx := idx + 1;
    process(buffer[idx]);
 end while;
 ```

 `with x \in S do` is nondeterministic choice (TLC explores all choices), but each execution only processes ONE element. If you need to process ALL elements sequentially, use a `while` loop with an index variable.

 ### Labels after `goto` are mandatory

 PlusCal requires a label after any `if` block that contains a `goto`:

 ```pluscal
 \* ERROR: pcal.trans will reject this
 if condition then
    goto Done;
 end if;
 x := 1;  \* <-- needs a label here

 \* CORRECT:
 if condition then
    goto Done;
 end if;
 NextStep:
 x := 1;
 ```

 But beware: every label is an interleaving point. See the atomicity section below.

 ### Check-then-act MUST be same label

 If you check a condition and then mutate based on it, both must be in the same label. Otherwise another process can change the condition between check and act.

 ```pluscal
 \* BUG: TOCTOU — Stop can interleave between check and write
 CheckStatus:
    if status[g] = "running" then
        skip;
    end if;
 WriteBuffer:
    \* Status may no longer be "running" here!
    buffer[g] := Append(buffer[g], event);

 \* FIXED: check and write in same atomic step
 EmitEvent:
    if status[g] /= "running" then
        goto Terminate;
    else
        buffer[g] := Append(buffer[g], event);
    end if;
 ```

 This is the PlusCal equivalent of "no `await` between check and mutation" in asyncio.

 ### Atomic reservation pattern

 For mutual exclusion (e.g., one active generation per conversation), the check and claim must be a single label:

 ```pluscal
 Reserve:
    \* ALL of this is one atomic step — no interleaving possible
    if activeGen[convId] /= "" then
        goto Conflict;
    elsif ~HasAvailableSlot then
        goto NoSlots;
    else
        genId := AvailableSlot;
        status[genId] := "pending";
        activeGen[convId] := genId;
    end if;
 ```

 Splitting this into multiple labels (check → verify slot → claim) creates a race window where two processes both pass the check.

 ### Subscribe-first for gap-free replay

 When a reader needs to catch up on history and then receive live events:

 ```pluscal
 \* WRONG: replay then subscribe — events emitted between replay
 \* end and subscribe start are lost (the "subscribe gap")
 ReplayBuffer:
    \* ... iterate buffer ...
 SubscribeRedis:
    redisIdx := Len(channel);
 GoLive:
    \* ... read from channel ...

 \* CORRECT: subscribe first, replay, drain, then go live
 Subscribe:
    redisIdx := Len(channel);  \* mark position BEFORE replay
 ReplayBuffer:
    \* ... iterate buffer (events after redisIdx go to channel) ...
 DrainRedis:
    \* ... drain channel messages received during replay, with dedup ...
 GoLive:
    \* ... no gap possible ...
 ```

 ### Reader must check terminal status on timeout

 A reader waiting on Redis can hang forever if the generation completes and no more messages arrive:

 ```pluscal
 ReaderLive:
    either
        \* Normal: new Redis message
        await redisIdx < Len(channel);
        \* ... process message ...
    or
        \* Timeout path: check if generation is done
        await status[genId] \in TerminalStates;
        goto CatchUp;  \* replay remaining buffer events
    end either;
 ```

 Without the timeout path, a reader that misses the terminal event via Redis will wait forever.

 ---

 ## TLC Sharp Edges

 ### Process IDs must be same type

 TLC's `ProcSet` is a union of all process ID sets. If some are strings and some are integers, TLC fails with "Attempted to check equality of string with non-string."

 ```pluscal
 \* WRONG: mixes integers and strings
 fair process Trigger \in {"t1", "t2"}
 fair process Reader \in 1..2

 \* CORRECT: all strings
 fair process Trigger \in {"t1", "t2"}
 fair process Reader \in {"r1", "r2"}
 ```

 ### Unbounded loops prevent exhaustion

 A `while TRUE` loop in a process (like a cleanup sweep) creates an infinite state space. TLC will run forever even with small constants.

 **Solution: bound with quiescence.**

 ```pluscal
 \* Define when the model is "done"
 ModelQuiescent ==
    /\ reservationCount >= MaxReservations
    /\ \A g \in GenSlots : status[g] = "none"
    /\ \A r \in ReaderIds : done[r]

 \* Sweep exits when quiescent
 while ~ModelQuiescent do
    \* ... cleanup work ...
 end while;
 ```

 In practice, the best sequence is:

 1. Add explicit lifecycle bounds (`MaxReservations`, `MaxEvents`)
 2. Split safety and liveness into separate configs
 3. Guard background loops with "has real work" predicates
 4. Add quiescence so the bounded model can terminate
 5. Only then increase coverage

 ### CHECK_DEADLOCK FALSE for bounded models

 Bounded models intentionally reach states where no process can step (everyone is `Done`). This is expected quiescence, not a bug:

 ```
 \* In .cfg file:
 CHECK_DEADLOCK FALSE
 ```

 ### Safety vs Liveness configs

 Always split into two configs:

 ```
 \* safety.cfg — fast, large model
 SPECIFICATION Spec
 CHECK_DEADLOCK FALSE
 CONSTANT MaxEvents = 3, MaxReservations = 3, ReaderIds = {"r1", "r2"}
 INVARIANT SafetyInvariant S4_NoMissedEvents

 \* liveness.cfg — slower, smaller model
 SPECIFICATION Spec
 CHECK_DEADLOCK FALSE
 CONSTANT MaxEvents = 2, MaxReservations = 2, ReaderIds = {"r1"}
 INVARIANT SafetyInvariant S4_NoMissedEvents
 PROPERTY L1_Terminates L2_ReadersDone
 ```

 Temporal properties (PROPERTY) require TLC to build the full behavior graph. Keep the liveness model small.

 ### Counterexample triage

 When TLC finds a violation, classify it before changing the design:

 1. **Protocol bug**
   The trace reflects an allowed interleaving of the intended protocol.
   Example: check-then-act split across labels.

 2. **Model bug**
   The trace relies on an artifact of the model rather than the protocol.
   Example: invariant references recycled slot state instead of stable generation identity.

 3. **Translation/config bug**
   The model checker ran something different from what you intended.
   Example: forgot to re-run `pcal.trans`, mixed process-id types, wrong cfg constants.

 Checklist:

 - Did you retranslate after editing PlusCal?
 - Does the failing trace respect the intended protocol contract?
 - Is the invariant/property phrased against stable identities or recycled slots?
 - Is a helper process doing low-value churn rather than meaningful work?
 - Did a verification bound remove the very behavior you were trying to check?

 ### `CHOOSE` is deterministic but fragile

 `CHOOSE x \in S : P(x)` picks one element satisfying `P`. If multiple elements satisfy it, TLC always picks the same one (deterministic but arbitrary). If NO element satisfies it, TLC crashes.

 ```pluscal
 \* Guard CHOOSE with existence check
 if \E g \in Slots : status[g] = "none" then
    genId := CHOOSE g \in Slots : status[g] = "none";
 end if;
 ```

 ### Re-translate after every PlusCal edit

 The `BEGIN TRANSLATION` block is generated code. After editing anything above it:

 ```bash
 java -cp tla2tools.jar pcal.trans YourSpec.tla
 ```

 If you forget, TLC runs against stale translated TLA+ and your changes have no effect. The checksum line catches most cases, but not all.

 ### Model hygiene

 - Remove unused variables from the PlusCal source. Every extra variable increases state space.
 - Prefer stable identities in invariants. If slots are recycled, snapshot what a reader/generation is expected to see.
 - Distinguish **protocol semantics** from **verification bounds** in comments.
 - Keep helper nondeterminism purposeful. If a branch only adds churn and no new insight, constrain or remove it.
 - If a loop exists only for cleanup/maintenance, guard it with a "has work" predicate.

 ---

 ## Invariant Design

 ### Forward-only transitions (S2)

 Track previous state to verify lifecycle ordering:

 ```pluscal
 variables
    status     = [g \in Slots |-> "none"],
    prevStatus = [g \in Slots |-> "none"];

 \* On every transition:
 prevStatus[g] := status[g];
 status[g] := newStatus;
 ```

 ```tla
 \* Invariant:
 StatusOrder(s) == CASE s = "none" -> 0 [] s = "pending" -> 1 ...

 S2_ForwardOnly ==
    \A g \in Slots :
        StatusOrder(status[g]) >= StatusOrder(prevStatus[g])
 ```

 For slot reuse (`evicted -> none`), add an explicit exception:

 ```tla
 AllowedTransition(old, new) ==
    \/ StatusOrder(new) >= StatusOrder(old)
    \/ /\ old = "evicted" /\ new = "none"
 ```

 ### No-missed-events (S4)

 Snapshot what the reader is expected to see at finish time, not what the slot currently contains (slot may be reused):

 ```pluscal
 \* At reader finish:
 readerExpectedEvents[self] := bufferSnapshot;
 ```

 ```tla
 S4_NoMissedEvents ==
    \A r \in ReaderIds :
        done[r] =>
            \A i \in 1..Len(expectedEvents[r]) :
                ReaderHasEvent(r, expectedEvents[r][i])
 ```

 ### Slot reuse safety

 Sweep must only reset an evicted slot after all readers attached to it have finished:

 ```pluscal
 if /\ status[g] = "evicted"
   /\ \A r \in ReaderIds : readerGenId[r] /= g \/ done[r]
 then
    \* Safe to reset
    status[g] := "none";
    buffer[g] := <<>>;
 end if;
 ```

 ---

 ## Modeling asyncio Concurrency

 ### The key mapping

 | asyncio concept | PlusCal equivalent |
 |----------------|-------------------|
 | Synchronous code block (no `await`) | Single label (atomic) |
 | `await` point | Label boundary (interleaving point) |
 | `asyncio.Task` | Separate `process` |
 | Shared dict/list mutation | Variable assignment in shared state |
 | `asyncio.Lock` | Often modeled as one label unless lock queueing/ownership matters |
 | Cooperative scheduling | Fair process (TLC explores all fair interleavings) |

 ### What to model, what to skip

 **Model:**
 - State machines (status transitions)
 - Shared mutable registries
 - Message channels (Redis pub/sub as append-only sequences)
 - Check-then-act patterns
 - Lifecycle (creation, terminal, cleanup, reuse)

 **Skip:**
 - Exact payload contents (use sequence numbers)
 - Network latency (PlusCal's nondeterminism covers this)
 - Error details (model error as a nondeterministic terminal transition)
 - Database schemas (model as variable assignment)
 - Authentication / authorization (orthogonal to concurrency)

 ### Fairness

 - Use fairness primarily for **liveness** reasoning.
 - Do not use fairness to "fix" a bad safety model.
 - If a liveness claim only holds under strong fairness assumptions, write that down explicitly.
 - Prefer bounded liveness runs over huge unbounded fair models.

 ### Proof strength vs confidence

 - **Exhausted model, no violations**: strong bounded assurance for that model.
 - **Large non-exhausted run, no violations**: good bug-finding evidence, not proof.
 - **Reduced liveness model passes**: confidence in the protocol shape, not a proof for all larger parameter values.

 ---

 ## Quick Reference

 ### Install

 ```bash
 # macOS
 brew install openjdk
 curl -L -o tla2tools.jar https://github.com/tlaplus/tlaplus/releases/download/v1.7.4/tla2tools.jar
 ```

 ### Commands

 ```bash
 # Translate PlusCal -> TLA+
 java -cp tla2tools.jar pcal.trans Spec.tla

 # Run TLC (safety, multi-worker)
 java -XX:+UseParallelGC -Xmx4g -jar tla2tools.jar \
  -config Spec-safety.cfg -workers auto Spec.tla

 # Run TLC (liveness, single worker)
 java -XX:+UseParallelGC -jar tla2tools.jar \
  -config Spec-liveness.cfg Spec.tla
 ```

 ### Config template

 ```
 SPECIFICATION Spec
 CHECK_DEADLOCK FALSE

 CONSTANT
    MaxEvents = 3
    ConversationIds = {"c1"}
    ReaderIds = {"r1", "r2"}
    MaxReservations = 3

 INVARIANT
    SafetyInvariant

 \* Only in liveness config:
 PROPERTY
    L1_Terminates
 ```

 ### Common TLC exit codes

 | Code | Meaning |
 |------|---------|
 | 0 | All properties verified, model exhausted |
 | 11 | Deadlock found (add `CHECK_DEADLOCK FALSE` if expected) |
 | 12 | Invariant violated (read the counterexample trace) |
 | 13 | Temporal property violated |
 | 75 | Semantic/parse error in spec |
 | 143/144 | Killed by signal (SIGTERM/SIGKILL) |

 ---

 ## Worked Example: DES-5271

 These findings came from verifying the assistant service resumable streaming protocol. Treat them as a case study, not universal rules:

 1. **TLC found a trigger race at 533 states** that no human reviewer caught. Two `POST /chat/ask` requests interleaved between check and claim, creating duplicate PENDING generations. Fix: single atomic label for check+claim.

 2. **TLC found a TOCTOU at 161M states.** GenTask checked `RUNNING`, then Stop fired, then GenTask wrote to the buffer of a stopped generation. Fix: merge check and write into same label.

 3. **Unbounded sweep loops prevent model exhaustion.** `while TRUE` + slot reuse = infinite state space. Fix: `MaxReservations` budget + `ModelQuiescent` guard.

 4. **`with i \in 1..Len(buf)` does NOT iterate.** It picks one arbitrary index. Use `while` loops for sequential iteration. This silently made S4 (no-missed-events) vacuous.

 5. **Slot reuse invalidates invariants that reference slot state.** After reset, a finished reader's `readerGenId` points to a recycled slot with different data. Fix: snapshot expected events at reader finish time.

 6. **Safety and liveness need different model sizes.** Safety with 3 events + 2 readers + 3 reservations = 470M+ states (tractable). Same config with liveness = infinite (temporal property checking builds full behavior graph). Split configs, reduce liveness model.

 7. **Both bugs share the same root pattern: check-then-act with an interleaving point in between.** In asyncio, `await` is the interleaving point. In PlusCal, a label boundary is. The fix is always the same: keep check and mutation in one atomic block.
	---
	name: tla-verification
	description: TLA+/PlusCal formal verification for concurrent protocol design. Use when modeling async systems, checking race conditions, or verifying state machine correctness before implementation.
	---

	# TLA+ Protocol Verification — Pilot's Checklist

	You are a formal verification specialist. You help design, write, debug, and run PlusCal/TLA+ specifications for concurrent systems, with a focus on asyncio-style cooperative concurrency patterns. You know the sharp edges of PlusCal, TLC model checking, and how formal models map to real implementation code.

	## When to Use This Skill

	- Designing a protocol with multiple concurrent actors and shared mutable state
	- Reviewing a design doc for race conditions before implementation
	- Modeling async systems where `await` points create interleaving opportunities
	- Verifying state machines (lifecycle, cleanup, reconnection) for correctness
	- After discovering a concurrency bug — formalizing the fix to prove it's correct

	## Architecture Context

	```
	PlusCal source (.tla)
	\|
	\| pcal.trans (translator)
	v
	TLA+ module (.tla with BEGIN TRANSLATION block)
	\|
	\| TLC model checker
	v
	Exhaustive state exploration → counterexample trace or "all states OK"
	```

	PlusCal is syntactic sugar over TLA+. Write PlusCal for readability, run TLC for verification. The translator inserts a `BEGIN TRANSLATION` / `END TRANSLATION` block into the same file.

	---

	## The 7 Rules

	These apply to every TLA+ verification:

	1. One PlusCal label = one atomic step. Everything within a label executes without interleaving. This is the most important concept — it maps directly to "no `await` between these operations" in asyncio.
	2. Every `await` in your code = a label boundary in PlusCal. If two operations have an `await` between them in real code, they MUST be in separate labels in the spec. If they must be atomic, they MUST be in the same label.
	3. Model the protocol, not the implementation. You don't need every field — only the state that concurrent actors read and write. Strip everything else.
	4. Safety first, liveness second. Safety (invariants) is cheap to check. Liveness (temporal properties) is expensive. Always run safety-only first.
	5. Bound the model intentionally. Use explicit constants (`MaxEvents`, `MaxReservations`) to keep the state space finite. Document what each bound means and why it's sufficient.
	6. Treat TLC counterexamples as real until disproven. Start from the assumption that the trace found a real bug. Then determine whether it is a protocol bug, a model bug, or a stale-translation/config problem.
	7. Verify the fix, don't just eyeball it. After fixing a bug, re-run TLC. The fix itself may introduce new interleavings.

	---

	## Process

	### Step 1: Identify Actors and Shared State

	List every concurrent process and the shared mutable state they touch.

	```
	\| Actor \| Reads \| Writes \|
	\|-----------\|--------------------\|---------------------\|
	\| Trigger \| registry, convMap \| registry, convMap \|
	\| GenTask \| registry, buffer \| buffer, redis, status \|
	\| Reader \| buffer, redis, status \| readerEvents \|
	\| Stop \| registry \| status \|
	\| Sweep \| registry, readers \| status, registry \|
	```

	### Step 1.5: Define the Claims Up Front

	Before writing PlusCal, list:

	- Safety properties: "nothing bad ever happens"
	- Liveness properties: "something good eventually happens"
	- Intended atomic regions: operations that must happen with no interleaving
	- Reusable bounds: constants that keep the model finite enough for TLC

	Minimal template:

	```
	Safety:
	- At most one active generation per conversation
	- No reader misses events
	- Terminal persistence happens at most once

	Liveness:
	- Every active generation eventually terminates
	- Every attached reader eventually finishes

	Atomic regions:
	- Trigger check + reserve
	- GenTask running-check + buffer write

	Bounds:
	- MaxEvents
	- MaxReservations
	- ReaderIds
	```

	### Step 2: Write PlusCal Source

	Start with the actors as `fair process` blocks. Model only the interesting interleavings.

	### Step 3: Translate

	```bash
	java -cp tla2tools.jar pcal.trans YourSpec.tla
	```

	### Step 4: Run Safety

	```bash
	java -XX:+UseParallelGC -Xmx4g -jar tla2tools.jar \
	-config YourSpec-safety.cfg -workers auto YourSpec.tla
	```

	### Step 5: Run Liveness (smaller model)

	```bash
	java -XX:+UseParallelGC -jar tla2tools.jar \
	-config YourSpec-liveness.cfg YourSpec.tla
	```

	---

	## PlusCal Sharp Edges

	These are the bugs we hit. Learn from them.

	### `with` picks ONE, `while` iterates ALL

	```pluscal
	\* WRONG: picks one arbitrary index, does NOT iterate
	with i \in 1..Len(buffer) do
	process(buffer[i]);
	end with;

	\* RIGHT: iterates the full buffer
	idx := 0;
	while idx < Len(buffer) do
	idx := idx + 1;
	process(buffer[idx]);
	end while;
	```

	`with x \in S do` is nondeterministic choice (TLC explores all choices), but each execution only processes ONE element. If you need to process ALL elements sequentially, use a `while` loop with an index variable.

	### Labels after `goto` are mandatory

	PlusCal requires a label after any `if` block that contains a `goto`:

	```pluscal
	\* ERROR: pcal.trans will reject this
	if condition then
	goto Done;
	end if;
	x := 1; \* <-- needs a label here

	\* CORRECT:
	if condition then
	goto Done;
	end if;
	NextStep:
	x := 1;
	```

	But beware: every label is an interleaving point. See the atomicity section below.

	### Check-then-act MUST be same label

	If you check a condition and then mutate based on it, both must be in the same label. Otherwise another process can change the condition between check and act.

	```pluscal
	\* BUG: TOCTOU — Stop can interleave between check and write
	CheckStatus:
	if status[g] = "running" then
	skip;
	end if;
	WriteBuffer:
	\* Status may no longer be "running" here!
	buffer[g] := Append(buffer[g], event);

	\* FIXED: check and write in same atomic step
	EmitEvent:
	if status[g] /= "running" then
	goto Terminate;
	else
	buffer[g] := Append(buffer[g], event);
	end if;
	```

	This is the PlusCal equivalent of "no `await` between check and mutation" in asyncio.

	### Atomic reservation pattern

	For mutual exclusion (e.g., one active generation per conversation), the check and claim must be a single label:

	```pluscal
	Reserve:
	\* ALL of this is one atomic step — no interleaving possible
	if activeGen[convId] /= "" then
	goto Conflict;
	elsif ~HasAvailableSlot then
	goto NoSlots;
	else
	genId := AvailableSlot;
	status[genId] := "pending";
	activeGen[convId] := genId;
	end if;
	```

	Splitting this into multiple labels (check → verify slot → claim) creates a race window where two processes both pass the check.

	### Subscribe-first for gap-free replay

	When a reader needs to catch up on history and then receive live events:

	```pluscal
	\* WRONG: replay then subscribe — events emitted between replay
	\* end and subscribe start are lost (the "subscribe gap")
	ReplayBuffer:
	\* ... iterate buffer ...
	SubscribeRedis:
	redisIdx := Len(channel);
	GoLive:
	\* ... read from channel ...

	\* CORRECT: subscribe first, replay, drain, then go live
	Subscribe:
	redisIdx := Len(channel); \* mark position BEFORE replay
	ReplayBuffer:
	\* ... iterate buffer (events after redisIdx go to channel) ...
	DrainRedis:
	\* ... drain channel messages received during replay, with dedup ...
	GoLive:
	\* ... no gap possible ...
	```

	### Reader must check terminal status on timeout

	A reader waiting on Redis can hang forever if the generation completes and no more messages arrive:

	```pluscal
	ReaderLive:
	either
	\* Normal: new Redis message
	await redisIdx < Len(channel);
	\* ... process message ...
	or
	\* Timeout path: check if generation is done
	await status[genId] \in TerminalStates;
	goto CatchUp; \* replay remaining buffer events
	end either;
	```

	Without the timeout path, a reader that misses the terminal event via Redis will wait forever.

	---

	## TLC Sharp Edges

	### Process IDs must be same type

	TLC's `ProcSet` is a union of all process ID sets. If some are strings and some are integers, TLC fails with "Attempted to check equality of string with non-string."

	```pluscal
	\* WRONG: mixes integers and strings
	fair process Trigger \in {"t1", "t2"}
	fair process Reader \in 1..2

	\* CORRECT: all strings
	fair process Trigger \in {"t1", "t2"}
	fair process Reader \in {"r1", "r2"}
	```

	### Unbounded loops prevent exhaustion

	A `while TRUE` loop in a process (like a cleanup sweep) creates an infinite state space. TLC will run forever even with small constants.

	Solution: bound with quiescence.

	```pluscal
	\* Define when the model is "done"
	ModelQuiescent ==
	/\ reservationCount >= MaxReservations
	/\ \A g \in GenSlots : status[g] = "none"
	/\ \A r \in ReaderIds : done[r]

	\* Sweep exits when quiescent
	while ~ModelQuiescent do
	\* ... cleanup work ...
	end while;
	```

	In practice, the best sequence is:

	1. Add explicit lifecycle bounds (`MaxReservations`, `MaxEvents`)
	2. Split safety and liveness into separate configs
	3. Guard background loops with "has real work" predicates
	4. Add quiescence so the bounded model can terminate
	5. Only then increase coverage

	### CHECK_DEADLOCK FALSE for bounded models

	Bounded models intentionally reach states where no process can step (everyone is `Done`). This is expected quiescence, not a bug:

	```
	\* In .cfg file:
	CHECK_DEADLOCK FALSE
	```

	### Safety vs Liveness configs

	Always split into two configs:

	```
	\* safety.cfg — fast, large model
	SPECIFICATION Spec
	CHECK_DEADLOCK FALSE
	CONSTANT MaxEvents = 3, MaxReservations = 3, ReaderIds = {"r1", "r2"}
	INVARIANT SafetyInvariant S4_NoMissedEvents

	\* liveness.cfg — slower, smaller model
	SPECIFICATION Spec
	CHECK_DEADLOCK FALSE
	CONSTANT MaxEvents = 2, MaxReservations = 2, ReaderIds = {"r1"}
	INVARIANT SafetyInvariant S4_NoMissedEvents
	PROPERTY L1_Terminates L2_ReadersDone
	```

	Temporal properties (PROPERTY) require TLC to build the full behavior graph. Keep the liveness model small.

	### Counterexample triage

	When TLC finds a violation, classify it before changing the design:

	1. Protocol bug
	The trace reflects an allowed interleaving of the intended protocol.
	Example: check-then-act split across labels.

	2. Model bug
	The trace relies on an artifact of the model rather than the protocol.
	Example: invariant references recycled slot state instead of stable generation identity.

	3. Translation/config bug
	The model checker ran something different from what you intended.
	Example: forgot to re-run `pcal.trans`, mixed process-id types, wrong cfg constants.

	Checklist:

	- Did you retranslate after editing PlusCal?
	- Does the failing trace respect the intended protocol contract?
	- Is the invariant/property phrased against stable identities or recycled slots?
	- Is a helper process doing low-value churn rather than meaningful work?
	- Did a verification bound remove the very behavior you were trying to check?

	### `CHOOSE` is deterministic but fragile

	`CHOOSE x \in S : P(x)` picks one element satisfying `P`. If multiple elements satisfy it, TLC always picks the same one (deterministic but arbitrary). If NO element satisfies it, TLC crashes.

	```pluscal
	\* Guard CHOOSE with existence check
	if \E g \in Slots : status[g] = "none" then
	genId := CHOOSE g \in Slots : status[g] = "none";
	end if;
	```

	### Re-translate after every PlusCal edit

	The `BEGIN TRANSLATION` block is generated code. After editing anything above it:

	```bash
	java -cp tla2tools.jar pcal.trans YourSpec.tla
	```

	If you forget, TLC runs against stale translated TLA+ and your changes have no effect. The checksum line catches most cases, but not all.

	### Model hygiene

	- Remove unused variables from the PlusCal source. Every extra variable increases state space.
	- Prefer stable identities in invariants. If slots are recycled, snapshot what a reader/generation is expected to see.
	- Distinguish protocol semantics from verification bounds in comments.
	- Keep helper nondeterminism purposeful. If a branch only adds churn and no new insight, constrain or remove it.
	- If a loop exists only for cleanup/maintenance, guard it with a "has work" predicate.

	---

	## Invariant Design

	### Forward-only transitions (S2)

	Track previous state to verify lifecycle ordering:

	```pluscal
	variables
	status = [g \in Slots \|-> "none"],
	prevStatus = [g \in Slots \|-> "none"];

	\* On every transition:
	prevStatus[g] := status[g];
	status[g] := newStatus;
	```

	```tla
	\* Invariant:
	StatusOrder(s) == CASE s = "none" -> 0 [] s = "pending" -> 1 ...

	S2_ForwardOnly ==
	\A g \in Slots :
	StatusOrder(status[g]) >= StatusOrder(prevStatus[g])
	```

	For slot reuse (`evicted -> none`), add an explicit exception:

	```tla
	AllowedTransition(old, new) ==
	\/ StatusOrder(new) >= StatusOrder(old)
	\/ /\ old = "evicted" /\ new = "none"
	```

	### No-missed-events (S4)

	Snapshot what the reader is expected to see at finish time, not what the slot currently contains (slot may be reused):

	```pluscal
	\* At reader finish:
	readerExpectedEvents[self] := bufferSnapshot;
	```

	```tla
	S4_NoMissedEvents ==
	\A r \in ReaderIds :
	done[r] =>
	\A i \in 1..Len(expectedEvents[r]) :
	ReaderHasEvent(r, expectedEvents[r][i])
	```

	### Slot reuse safety

	Sweep must only reset an evicted slot after all readers attached to it have finished:

	```pluscal
	if /\ status[g] = "evicted"
	/\ \A r \in ReaderIds : readerGenId[r] /= g \/ done[r]
	then
	\* Safe to reset
	status[g] := "none";
	buffer[g] := <<>>;
	end if;
	```

	---

	## Modeling asyncio Concurrency

	### The key mapping

	\| asyncio concept \| PlusCal equivalent \|
	\|----------------\|-------------------\|
	\| Synchronous code block (no `await`) \| Single label (atomic) \|
	\| `await` point \| Label boundary (interleaving point) \|
	\| `asyncio.Task` \| Separate `process` \|
	\| Shared dict/list mutation \| Variable assignment in shared state \|
	\| `asyncio.Lock` \| Often modeled as one label unless lock queueing/ownership matters \|
	\| Cooperative scheduling \| Fair process (TLC explores all fair interleavings) \|

	### What to model, what to skip

	Model:
	- State machines (status transitions)
	- Shared mutable registries
	- Message channels (Redis pub/sub as append-only sequences)
	- Check-then-act patterns
	- Lifecycle (creation, terminal, cleanup, reuse)

	Skip:
	- Exact payload contents (use sequence numbers)
	- Network latency (PlusCal's nondeterminism covers this)
	- Error details (model error as a nondeterministic terminal transition)
	- Database schemas (model as variable assignment)
	- Authentication / authorization (orthogonal to concurrency)

	### Fairness

	- Use fairness primarily for liveness reasoning.
	- Do not use fairness to "fix" a bad safety model.
	- If a liveness claim only holds under strong fairness assumptions, write that down explicitly.
	- Prefer bounded liveness runs over huge unbounded fair models.

	### Proof strength vs confidence

	- Exhausted model, no violations: strong bounded assurance for that model.
	- Large non-exhausted run, no violations: good bug-finding evidence, not proof.
	- Reduced liveness model passes: confidence in the protocol shape, not a proof for all larger parameter values.

	---

	## Quick Reference

	### Install

	```bash
	# macOS
	brew install openjdk
	curl -L -o tla2tools.jar https://github.com/tlaplus/tlaplus/releases/download/v1.7.4/tla2tools.jar
	```

	### Commands

	```bash
	# Translate PlusCal -> TLA+
	java -cp tla2tools.jar pcal.trans Spec.tla

	# Run TLC (safety, multi-worker)
	java -XX:+UseParallelGC -Xmx4g -jar tla2tools.jar \
	-config Spec-safety.cfg -workers auto Spec.tla

	# Run TLC (liveness, single worker)
	java -XX:+UseParallelGC -jar tla2tools.jar \
	-config Spec-liveness.cfg Spec.tla
	```

	### Config template

	```
	SPECIFICATION Spec
	CHECK_DEADLOCK FALSE

	CONSTANT
	MaxEvents = 3
	ConversationIds = {"c1"}
	ReaderIds = {"r1", "r2"}
	MaxReservations = 3

	INVARIANT
	SafetyInvariant

	\* Only in liveness config:
	PROPERTY
	L1_Terminates
	```

	### Common TLC exit codes

	\| Code \| Meaning \|
	\|------\|---------\|
	\| 0 \| All properties verified, model exhausted \|
	\| 11 \| Deadlock found (add `CHECK_DEADLOCK FALSE` if expected) \|
	\| 12 \| Invariant violated (read the counterexample trace) \|
	\| 13 \| Temporal property violated \|
	\| 75 \| Semantic/parse error in spec \|
	\| 143/144 \| Killed by signal (SIGTERM/SIGKILL) \|

	---

	## Worked Example: DES-5271

	These findings came from verifying the assistant service resumable streaming protocol. Treat them as a case study, not universal rules:

	1. TLC found a trigger race at 533 states that no human reviewer caught. Two `POST /chat/ask` requests interleaved between check and claim, creating duplicate PENDING generations. Fix: single atomic label for check+claim.

	2. TLC found a TOCTOU at 161M states. GenTask checked `RUNNING`, then Stop fired, then GenTask wrote to the buffer of a stopped generation. Fix: merge check and write into same label.

	3. Unbounded sweep loops prevent model exhaustion. `while TRUE` + slot reuse = infinite state space. Fix: `MaxReservations` budget + `ModelQuiescent` guard.

	4. `with i \in 1..Len(buf)` does NOT iterate. It picks one arbitrary index. Use `while` loops for sequential iteration. This silently made S4 (no-missed-events) vacuous.

	5. Slot reuse invalidates invariants that reference slot state. After reset, a finished reader's `readerGenId` points to a recycled slot with different data. Fix: snapshot expected events at reader finish time.

	6. Safety and liveness need different model sizes. Safety with 3 events + 2 readers + 3 reservations = 470M+ states (tractable). Same config with liveness = infinite (temporal property checking builds full behavior graph). Split configs, reduce liveness model.

	7. Both bugs share the same root pattern: check-then-act with an interleaving point in between. In asyncio, `await` is the interleaving point. In PlusCal, a label boundary is. The fix is always the same: keep check and mutation in one atomic block.
No results found