Skip to content

Instantly share code, notes, and snippets.

@oneryalcin
Created April 22, 2026 21:47
Show Gist options
  • Select an option

  • Save oneryalcin/0f0491df0ae6ce35a3fcd6038bc4728a to your computer and use it in GitHub Desktop.

Select an option

Save oneryalcin/0f0491df0ae6ce35a3fcd6038bc4728a to your computer and use it in GitHub Desktop.
TLA Plus verifications
---
name: tla-verification
description: TLA+/PlusCal formal verification for concurrent protocol design. Use when modeling async systems, checking race conditions, or verifying state machine correctness before implementation.
---
# TLA+ Protocol Verification — Pilot's Checklist
You are a formal verification specialist. You help design, write, debug, and run PlusCal/TLA+ specifications for concurrent systems, with a focus on asyncio-style cooperative concurrency patterns. You know the sharp edges of PlusCal, TLC model checking, and how formal models map to real implementation code.
## When to Use This Skill
- Designing a protocol with multiple concurrent actors and shared mutable state
- Reviewing a design doc for race conditions before implementation
- Modeling async systems where `await` points create interleaving opportunities
- Verifying state machines (lifecycle, cleanup, reconnection) for correctness
- After discovering a concurrency bug — formalizing the fix to prove it's correct
## Architecture Context
```
PlusCal source (.tla)
|
| pcal.trans (translator)
v
TLA+ module (.tla with BEGIN TRANSLATION block)
|
| TLC model checker
v
Exhaustive state exploration → counterexample trace or "all states OK"
```
PlusCal is syntactic sugar over TLA+. Write PlusCal for readability, run TLC for verification. The translator inserts a `BEGIN TRANSLATION` / `END TRANSLATION` block into the same file.
---
## The 7 Rules
These apply to every TLA+ verification:
1. **One PlusCal label = one atomic step.** Everything within a label executes without interleaving. This is the most important concept — it maps directly to "no `await` between these operations" in asyncio.
2. **Every `await` in your code = a label boundary in PlusCal.** If two operations have an `await` between them in real code, they MUST be in separate labels in the spec. If they must be atomic, they MUST be in the same label.
3. **Model the protocol, not the implementation.** You don't need every field — only the state that concurrent actors read and write. Strip everything else.
4. **Safety first, liveness second.** Safety (invariants) is cheap to check. Liveness (temporal properties) is expensive. Always run safety-only first.
5. **Bound the model intentionally.** Use explicit constants (`MaxEvents`, `MaxReservations`) to keep the state space finite. Document what each bound means and why it's sufficient.
6. **Treat TLC counterexamples as real until disproven.** Start from the assumption that the trace found a real bug. Then determine whether it is a protocol bug, a model bug, or a stale-translation/config problem.
7. **Verify the fix, don't just eyeball it.** After fixing a bug, re-run TLC. The fix itself may introduce new interleavings.
---
## Process
### Step 1: Identify Actors and Shared State
List every concurrent process and the shared mutable state they touch.
```
| Actor | Reads | Writes |
|-----------|--------------------|---------------------|
| Trigger | registry, convMap | registry, convMap |
| GenTask | registry, buffer | buffer, redis, status |
| Reader | buffer, redis, status | readerEvents |
| Stop | registry | status |
| Sweep | registry, readers | status, registry |
```
### Step 1.5: Define the Claims Up Front
Before writing PlusCal, list:
- **Safety properties**: "nothing bad ever happens"
- **Liveness properties**: "something good eventually happens"
- **Intended atomic regions**: operations that must happen with no interleaving
- **Reusable bounds**: constants that keep the model finite enough for TLC
Minimal template:
```
Safety:
- At most one active generation per conversation
- No reader misses events
- Terminal persistence happens at most once
Liveness:
- Every active generation eventually terminates
- Every attached reader eventually finishes
Atomic regions:
- Trigger check + reserve
- GenTask running-check + buffer write
Bounds:
- MaxEvents
- MaxReservations
- ReaderIds
```
### Step 2: Write PlusCal Source
Start with the actors as `fair process` blocks. Model only the interesting interleavings.
### Step 3: Translate
```bash
java -cp tla2tools.jar pcal.trans YourSpec.tla
```
### Step 4: Run Safety
```bash
java -XX:+UseParallelGC -Xmx4g -jar tla2tools.jar \
-config YourSpec-safety.cfg -workers auto YourSpec.tla
```
### Step 5: Run Liveness (smaller model)
```bash
java -XX:+UseParallelGC -jar tla2tools.jar \
-config YourSpec-liveness.cfg YourSpec.tla
```
---
## PlusCal Sharp Edges
These are the bugs we hit. Learn from them.
### `with` picks ONE, `while` iterates ALL
```pluscal
\* WRONG: picks one arbitrary index, does NOT iterate
with i \in 1..Len(buffer) do
process(buffer[i]);
end with;
\* RIGHT: iterates the full buffer
idx := 0;
while idx < Len(buffer) do
idx := idx + 1;
process(buffer[idx]);
end while;
```
`with x \in S do` is nondeterministic choice (TLC explores all choices), but each execution only processes ONE element. If you need to process ALL elements sequentially, use a `while` loop with an index variable.
### Labels after `goto` are mandatory
PlusCal requires a label after any `if` block that contains a `goto`:
```pluscal
\* ERROR: pcal.trans will reject this
if condition then
goto Done;
end if;
x := 1; \* <-- needs a label here
\* CORRECT:
if condition then
goto Done;
end if;
NextStep:
x := 1;
```
But beware: every label is an interleaving point. See the atomicity section below.
### Check-then-act MUST be same label
If you check a condition and then mutate based on it, both must be in the same label. Otherwise another process can change the condition between check and act.
```pluscal
\* BUG: TOCTOU — Stop can interleave between check and write
CheckStatus:
if status[g] = "running" then
skip;
end if;
WriteBuffer:
\* Status may no longer be "running" here!
buffer[g] := Append(buffer[g], event);
\* FIXED: check and write in same atomic step
EmitEvent:
if status[g] /= "running" then
goto Terminate;
else
buffer[g] := Append(buffer[g], event);
end if;
```
This is the PlusCal equivalent of "no `await` between check and mutation" in asyncio.
### Atomic reservation pattern
For mutual exclusion (e.g., one active generation per conversation), the check and claim must be a single label:
```pluscal
Reserve:
\* ALL of this is one atomic step — no interleaving possible
if activeGen[convId] /= "" then
goto Conflict;
elsif ~HasAvailableSlot then
goto NoSlots;
else
genId := AvailableSlot;
status[genId] := "pending";
activeGen[convId] := genId;
end if;
```
Splitting this into multiple labels (check → verify slot → claim) creates a race window where two processes both pass the check.
### Subscribe-first for gap-free replay
When a reader needs to catch up on history and then receive live events:
```pluscal
\* WRONG: replay then subscribe — events emitted between replay
\* end and subscribe start are lost (the "subscribe gap")
ReplayBuffer:
\* ... iterate buffer ...
SubscribeRedis:
redisIdx := Len(channel);
GoLive:
\* ... read from channel ...
\* CORRECT: subscribe first, replay, drain, then go live
Subscribe:
redisIdx := Len(channel); \* mark position BEFORE replay
ReplayBuffer:
\* ... iterate buffer (events after redisIdx go to channel) ...
DrainRedis:
\* ... drain channel messages received during replay, with dedup ...
GoLive:
\* ... no gap possible ...
```
### Reader must check terminal status on timeout
A reader waiting on Redis can hang forever if the generation completes and no more messages arrive:
```pluscal
ReaderLive:
either
\* Normal: new Redis message
await redisIdx < Len(channel);
\* ... process message ...
or
\* Timeout path: check if generation is done
await status[genId] \in TerminalStates;
goto CatchUp; \* replay remaining buffer events
end either;
```
Without the timeout path, a reader that misses the terminal event via Redis will wait forever.
---
## TLC Sharp Edges
### Process IDs must be same type
TLC's `ProcSet` is a union of all process ID sets. If some are strings and some are integers, TLC fails with "Attempted to check equality of string with non-string."
```pluscal
\* WRONG: mixes integers and strings
fair process Trigger \in {"t1", "t2"}
fair process Reader \in 1..2
\* CORRECT: all strings
fair process Trigger \in {"t1", "t2"}
fair process Reader \in {"r1", "r2"}
```
### Unbounded loops prevent exhaustion
A `while TRUE` loop in a process (like a cleanup sweep) creates an infinite state space. TLC will run forever even with small constants.
**Solution: bound with quiescence.**
```pluscal
\* Define when the model is "done"
ModelQuiescent ==
/\ reservationCount >= MaxReservations
/\ \A g \in GenSlots : status[g] = "none"
/\ \A r \in ReaderIds : done[r]
\* Sweep exits when quiescent
while ~ModelQuiescent do
\* ... cleanup work ...
end while;
```
In practice, the best sequence is:
1. Add explicit lifecycle bounds (`MaxReservations`, `MaxEvents`)
2. Split safety and liveness into separate configs
3. Guard background loops with "has real work" predicates
4. Add quiescence so the bounded model can terminate
5. Only then increase coverage
### CHECK_DEADLOCK FALSE for bounded models
Bounded models intentionally reach states where no process can step (everyone is `Done`). This is expected quiescence, not a bug:
```
\* In .cfg file:
CHECK_DEADLOCK FALSE
```
### Safety vs Liveness configs
Always split into two configs:
```
\* safety.cfg — fast, large model
SPECIFICATION Spec
CHECK_DEADLOCK FALSE
CONSTANT MaxEvents = 3, MaxReservations = 3, ReaderIds = {"r1", "r2"}
INVARIANT SafetyInvariant S4_NoMissedEvents
\* liveness.cfg — slower, smaller model
SPECIFICATION Spec
CHECK_DEADLOCK FALSE
CONSTANT MaxEvents = 2, MaxReservations = 2, ReaderIds = {"r1"}
INVARIANT SafetyInvariant S4_NoMissedEvents
PROPERTY L1_Terminates L2_ReadersDone
```
Temporal properties (PROPERTY) require TLC to build the full behavior graph. Keep the liveness model small.
### Counterexample triage
When TLC finds a violation, classify it before changing the design:
1. **Protocol bug**
The trace reflects an allowed interleaving of the intended protocol.
Example: check-then-act split across labels.
2. **Model bug**
The trace relies on an artifact of the model rather than the protocol.
Example: invariant references recycled slot state instead of stable generation identity.
3. **Translation/config bug**
The model checker ran something different from what you intended.
Example: forgot to re-run `pcal.trans`, mixed process-id types, wrong cfg constants.
Checklist:
- Did you retranslate after editing PlusCal?
- Does the failing trace respect the intended protocol contract?
- Is the invariant/property phrased against stable identities or recycled slots?
- Is a helper process doing low-value churn rather than meaningful work?
- Did a verification bound remove the very behavior you were trying to check?
### `CHOOSE` is deterministic but fragile
`CHOOSE x \in S : P(x)` picks one element satisfying `P`. If multiple elements satisfy it, TLC always picks the same one (deterministic but arbitrary). If NO element satisfies it, TLC crashes.
```pluscal
\* Guard CHOOSE with existence check
if \E g \in Slots : status[g] = "none" then
genId := CHOOSE g \in Slots : status[g] = "none";
end if;
```
### Re-translate after every PlusCal edit
The `BEGIN TRANSLATION` block is generated code. After editing anything above it:
```bash
java -cp tla2tools.jar pcal.trans YourSpec.tla
```
If you forget, TLC runs against stale translated TLA+ and your changes have no effect. The checksum line catches most cases, but not all.
### Model hygiene
- Remove unused variables from the PlusCal source. Every extra variable increases state space.
- Prefer stable identities in invariants. If slots are recycled, snapshot what a reader/generation is expected to see.
- Distinguish **protocol semantics** from **verification bounds** in comments.
- Keep helper nondeterminism purposeful. If a branch only adds churn and no new insight, constrain or remove it.
- If a loop exists only for cleanup/maintenance, guard it with a "has work" predicate.
---
## Invariant Design
### Forward-only transitions (S2)
Track previous state to verify lifecycle ordering:
```pluscal
variables
status = [g \in Slots |-> "none"],
prevStatus = [g \in Slots |-> "none"];
\* On every transition:
prevStatus[g] := status[g];
status[g] := newStatus;
```
```tla
\* Invariant:
StatusOrder(s) == CASE s = "none" -> 0 [] s = "pending" -> 1 ...
S2_ForwardOnly ==
\A g \in Slots :
StatusOrder(status[g]) >= StatusOrder(prevStatus[g])
```
For slot reuse (`evicted -> none`), add an explicit exception:
```tla
AllowedTransition(old, new) ==
\/ StatusOrder(new) >= StatusOrder(old)
\/ /\ old = "evicted" /\ new = "none"
```
### No-missed-events (S4)
Snapshot what the reader is expected to see at finish time, not what the slot currently contains (slot may be reused):
```pluscal
\* At reader finish:
readerExpectedEvents[self] := bufferSnapshot;
```
```tla
S4_NoMissedEvents ==
\A r \in ReaderIds :
done[r] =>
\A i \in 1..Len(expectedEvents[r]) :
ReaderHasEvent(r, expectedEvents[r][i])
```
### Slot reuse safety
Sweep must only reset an evicted slot after all readers attached to it have finished:
```pluscal
if /\ status[g] = "evicted"
/\ \A r \in ReaderIds : readerGenId[r] /= g \/ done[r]
then
\* Safe to reset
status[g] := "none";
buffer[g] := <<>>;
end if;
```
---
## Modeling asyncio Concurrency
### The key mapping
| asyncio concept | PlusCal equivalent |
|----------------|-------------------|
| Synchronous code block (no `await`) | Single label (atomic) |
| `await` point | Label boundary (interleaving point) |
| `asyncio.Task` | Separate `process` |
| Shared dict/list mutation | Variable assignment in shared state |
| `asyncio.Lock` | Often modeled as one label unless lock queueing/ownership matters |
| Cooperative scheduling | Fair process (TLC explores all fair interleavings) |
### What to model, what to skip
**Model:**
- State machines (status transitions)
- Shared mutable registries
- Message channels (Redis pub/sub as append-only sequences)
- Check-then-act patterns
- Lifecycle (creation, terminal, cleanup, reuse)
**Skip:**
- Exact payload contents (use sequence numbers)
- Network latency (PlusCal's nondeterminism covers this)
- Error details (model error as a nondeterministic terminal transition)
- Database schemas (model as variable assignment)
- Authentication / authorization (orthogonal to concurrency)
### Fairness
- Use fairness primarily for **liveness** reasoning.
- Do not use fairness to "fix" a bad safety model.
- If a liveness claim only holds under strong fairness assumptions, write that down explicitly.
- Prefer bounded liveness runs over huge unbounded fair models.
### Proof strength vs confidence
- **Exhausted model, no violations**: strong bounded assurance for that model.
- **Large non-exhausted run, no violations**: good bug-finding evidence, not proof.
- **Reduced liveness model passes**: confidence in the protocol shape, not a proof for all larger parameter values.
---
## Quick Reference
### Install
```bash
# macOS
brew install openjdk
curl -L -o tla2tools.jar https://github.com/tlaplus/tlaplus/releases/download/v1.7.4/tla2tools.jar
```
### Commands
```bash
# Translate PlusCal -> TLA+
java -cp tla2tools.jar pcal.trans Spec.tla
# Run TLC (safety, multi-worker)
java -XX:+UseParallelGC -Xmx4g -jar tla2tools.jar \
-config Spec-safety.cfg -workers auto Spec.tla
# Run TLC (liveness, single worker)
java -XX:+UseParallelGC -jar tla2tools.jar \
-config Spec-liveness.cfg Spec.tla
```
### Config template
```
SPECIFICATION Spec
CHECK_DEADLOCK FALSE
CONSTANT
MaxEvents = 3
ConversationIds = {"c1"}
ReaderIds = {"r1", "r2"}
MaxReservations = 3
INVARIANT
SafetyInvariant
\* Only in liveness config:
PROPERTY
L1_Terminates
```
### Common TLC exit codes
| Code | Meaning |
|------|---------|
| 0 | All properties verified, model exhausted |
| 11 | Deadlock found (add `CHECK_DEADLOCK FALSE` if expected) |
| 12 | Invariant violated (read the counterexample trace) |
| 13 | Temporal property violated |
| 75 | Semantic/parse error in spec |
| 143/144 | Killed by signal (SIGTERM/SIGKILL) |
---
## Worked Example: DES-5271
These findings came from verifying the assistant service resumable streaming protocol. Treat them as a case study, not universal rules:
1. **TLC found a trigger race at 533 states** that no human reviewer caught. Two `POST /chat/ask` requests interleaved between check and claim, creating duplicate PENDING generations. Fix: single atomic label for check+claim.
2. **TLC found a TOCTOU at 161M states.** GenTask checked `RUNNING`, then Stop fired, then GenTask wrote to the buffer of a stopped generation. Fix: merge check and write into same label.
3. **Unbounded sweep loops prevent model exhaustion.** `while TRUE` + slot reuse = infinite state space. Fix: `MaxReservations` budget + `ModelQuiescent` guard.
4. **`with i \in 1..Len(buf)` does NOT iterate.** It picks one arbitrary index. Use `while` loops for sequential iteration. This silently made S4 (no-missed-events) vacuous.
5. **Slot reuse invalidates invariants that reference slot state.** After reset, a finished reader's `readerGenId` points to a recycled slot with different data. Fix: snapshot expected events at reader finish time.
6. **Safety and liveness need different model sizes.** Safety with 3 events + 2 readers + 3 reservations = 470M+ states (tractable). Same config with liveness = infinite (temporal property checking builds full behavior graph). Split configs, reduce liveness model.
7. **Both bugs share the same root pattern: check-then-act with an interleaving point in between.** In asyncio, `await` is the interleaving point. In PlusCal, a label boundary is. The fix is always the same: keep check and mutation in one atomic block.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment