Skip to content

Instantly share code, notes, and snippets.

@meshula
Created November 7, 2025 05:09
Show Gist options
  • Select an option

  • Save meshula/fdd10d8087b7908c6858d07075d861a4 to your computer and use it in GitHub Desktop.

Select an option

Save meshula/fdd10d8087b7908c6858d07075d861a4 to your computer and use it in GitHub Desktop.
llm.md

त्रित्रयम् Understanding LLM State Management

This is a structural question about LLM chat implementations. When one "attaches a file" to a chat, or when an LLM reads a code base and forms a model in order to be able to work with it and reason, I can imagine a "static" implementation which tabula-rasa, creates a simply enormous json record, which is sent across a port to a server, which effectively, having never seen any of that before, analyzes it from scratch at incredible, and increasing expense with each iteration. As an engineer I opine that this makes no sense. So I have to posit that server side there is ephemeral persistent information so that intermediate results are available for subsequent steps. What is the actual mechanism?

Motion: Exploring the reality behind chat persistence mechanisms
Memory: Distinguishing API statelessness from internal optimizations
Field: Understanding the engineering truth beneath the abstraction

Overview

the actual mechanisms:

KV Cache (Key-Value Cache) - This is the primary mechanism. When a transformer processes tokens, it computes key and value vectors for each token in the self-attention mechanism. These can be cached and reused for subsequent generations, so you don't have to recompute them.

Prompt Caching - Some systems (like Anthropic's Claude) have explicit prompt caching where identical prefixes are cached server-side and reused across requests.

Context Windows - The model has a fixed context window (e.g., 4K, 8K, 128K tokens), but you don't always fill it.

Stateless vs Stateful

Most LLM APIs are actually stateless from an API perspective - each request is independent BUT they have internal optimizations like KV caching Some systems (like llama.cpp server) can maintain session state

Attaching a file:

The file content is typically sent in the messages array With prompt caching (Anthropic), identical file content is cached Without prompt caching, it's resent each time (expensive!) Some systems use embeddings for retrieval (RAG pattern)

Reading a codebase:

Typically uses embeddings + vector database (RAG) File contents retrieved as needed, not all sent at once Alternatively: send file listing, then fetch files on demand Or: use function calling to read files as needed

For llama.cpp specifically:

Multiple optimization strategies are at play. The system uses a sophisticated slot-based approach, maintaining key-value caches per request during generation. This allows efficient token processing by preserving computational work across generation steps. Parallel batching enables handling multiple requests simultaneously, while prompt evaluation results can be cached to reduce redundant computations.

For commercial APIs:

The approach differs slightly. They maintain stateless API interactions, with internal KV caching during generation. Anthropic's approach, for instance, explicitly caches prompt prefixes, reducing computational overhead while preserving the independent request model. The fundamental perspective reveals nuanced optimization techniques. While APIs appear stateless, crucial performance improvements happen through targeted caching mechanisms. These strategies transform token generation from a potentially exponential computational challenge to a linear, efficient process.

The Core Truth: Multiple Layers of State

1. KV Cache (The Primary Optimization)

This is the fundamental mechanism that makes transformers practical:

┌─────────────────────────────────────────────────┐
│         Transformer Self-Attention              │
├─────────────────────────────────────────────────┤
│                                                 │
│  For each token, compute:                       │
│    Query (Q), Key (K), Value (V) vectors        │
│                                                 │
│  Attention = softmax(Q·K^T) · V                 │
│                                                 │
│  KEY INSIGHT:                                   │
│  For all previous tokens, K and V never change! │
│  → Cache them, reuse for next token             │
│                                                 │
└─────────────────────────────────────────────────┘

What this means:

  • First request: Compute KV for entire prompt (expensive)
  • Generate token 1: Reuse cached KV, only compute new token's K,V
  • Generate token 2: Reuse cached KV + token 1's K,V
  • Cost: O(prompt) + O(n) not O(prompt × n)

Where it lives: GPU memory during generation

2. Prompt Caching (Server-Side Optimization)

Some systems (Anthropic Claude, recent llama.cpp) cache across requests:

Request 1:
┌─────────────────────────────────────┐
│ System: "You are a helpful..."      │ ← Cached (hash-based)
│ Documents: [10KB of context]        │ ← Cached
│ User: "Explain the first function"  │ ← Not cached
└─────────────────────────────────────┘

Request 2:
┌─────────────────────────────────────┐
│ System: "You are a helpful..."      │ ← Cache HIT! (reuse KV)
│ Documents: [10KB of context]        │ ← Cache HIT! (reuse KV)
│ User: "Now explain the second"      │ ← New computation
└─────────────────────────────────────┘

Anthropic's system:

  • Cache by content hash
  • Minimum 1024 tokens for caching
  • 5-minute TTL
  • Charged at 10% of regular rate for cached tokens

llama.cpp's system:

  • --slot parameter maintains session state
  • --ctx-size defines KV cache size
  • Can maintain multiple concurrent sessions
  • Each slot has persistent KV cache until cleared

3. The Actual Mechanism for "Attaching Files"

What Really Happens:

Application maintains conversation array:
┌────────────────────────────────────────────────┐
│ messages = [                                   │
│   {role: "system", content: "You are..."},     │
│   {role: "user", content: "Here's my file:\n"  │
│      + file_contents},  // ← 50KB              │
│   {role: "assistant", content: "I see..."},    │
│   {role: "user", content: "Explain line 42"}   │
│ ]                                              │
└────────────────────────────────────────────────┘

Each request sends ENTIRE array to server
(Yes, really! The stateless HTTP model)

BUT with optimizations:
  - Prompt caching recognizes duplicate prefix
  - Only new tokens need full computation
  - Billing reflects cache hits

Without prompt caching:

  • Request 1: Process 50KB + 10 tokens = expensive
  • Request 2: Process 50KB + 100 tokens = MORE expensive
  • Request 3: Process 50KB + 200 tokens = EVEN MORE expensive

With prompt caching:

  • Request 1: Process 50KB (cached) + 10 tokens
  • Request 2: Reuse 50KB cache + process 100 new tokens
  • Request 3: Reuse 50KB cache + process 200 new tokens

4. The Codebase Problem: RAG Pattern

For large codebases, the smart approach is NOT to send everything:

┌─────────────────────────────────────────────────┐
│            Retrieval-Augmented Generation       │
├─────────────────────────────────────────────────┤
│                                                 │
│  1. Preprocess: Create embeddings database      │
│     codebase/ → embed each file → vector DB     │
│                                                 │
│  2. Query time:                                 │
│     User: "How does auth work?"                 │
│        ↓                                        │
│     Embed query → search vector DB              │
│        ↓                                        │
│     Retrieve top-k relevant files (3-10)        │
│        ↓                                        │
│     Send ONLY relevant context + query to LLM   │
│                                                 │
└─────────────────────────────────────────────────┘

This is how modern code assistants work:

  • Cursor, Copilot: Local embeddings + selective context
  • GitHub Copilot Workspace: Graph-based code understanding
  • Aider: Smart file selection based on git diff + mentions

5. llama.cpp Specific Mechanisms

Your Provider is talking to llama.cpp server. Here's what it supports:

# Start server with session support
llama-server \
  --ctx-size 8192 \      # KV cache size
  --n-slots 4 \          # Parallel sessions
  --cache-prompt \       # Enable prompt caching
  --parallel 4           # Batch processing

Session/Slot System:

Slot 0: [User A's KV cache] - conversation ongoing
Slot 1: [User B's KV cache] - conversation ongoing  
Slot 2: [User C's KV cache] - conversation ongoing
Slot 3: [Available]

Each slot maintains KV cache as long as the slot is held.

Our Provider currently: Sends full history each time (stateless)

Could optimize: Use llama.cpp's session API:

{
  "session_id": "abc123",  // Maintain server-side state
  "messages": [...],
  "cache_prompt": true
}

The Engineering Reality

What's Actually Happening:

  1. Your Application (LabRaven Activity):

    • Maintains conversation array in memory
    • Sends FULL array each request
    • This is stateless from API perspective
  2. llama.cpp Server:

    • Without sessions: Processes from scratch, KV cache only lives during generation
    • With sessions/slots: Maintains KV cache between requests for same session
    • Prompt caching: Recognizes duplicate prompts, reuses computation
  3. During Generation (Always):

    • KV cache makes generation O(new tokens) not O(all tokens)
    • This is why you see "tokens per second" stay constant

The Tabula Rasa Reality Check:

You're right to be skeptical! The baseline model IS wasteful:

Request 1: 10,000 tokens context + 100 tokens generated
Request 2: 10,100 tokens context + 100 tokens generated  
Request 3: 10,200 tokens context + 100 tokens generated

Without optimizations: Each request recomputes EVERYTHING

With KV cache (always): Only new tokens computed during generation

With prompt caching (optional): Repeated prefix recognized across requests

With sessions (optional): KV cache persists between requests

Concrete Example: Your Provider

Current Implementation:

// Provider builds this every time:
messages = [
  {role: "system", content: "You are helpful..."},
  {role: "user", content: "Hello"},
  {role: "assistant", content: "Hi there!"},
  {role: "user", content: "What's 2+2?"},
  {role: "assistant", content: "4"},
  {role: "user", content: "And 3+3?"}  // NEW
]

// Sends to: POST /v1/chat/completions
// Server processes: ALL 5 messages + generates response

What llama.cpp does internally:

  1. Tokenize all messages
  2. Check if prompt matches cached prompt (if --cache-prompt)
  3. If match: reuse KV cache from last time
  4. If no match: compute KV for all tokens (but only once)
  5. Generate new tokens using cached KV (fast)

Optimization opportunity:

// Could use llama.cpp's session API:
POST /completion {
  "session_id": "labraven_user_123",
  "prompt": "And 3+3?",  // Only send NEW content
  "cache_prompt": true
}

// Server maintains session's KV cache
// Only processes new tokens

Why The Confusion Exists

The API appears stateless:

  • Send messages array each time
  • No explicit "session" in basic API
  • Looks like tabula rasa

But internally:

  • KV cache (always optimization)
  • Prompt caching (sometimes available)
  • Session support (implementation dependent)

The truth: It's a stateless API with stateful optimizations

Recommendations for Large Context

For File Attachments:

  1. Small files (<100KB): Send directly, rely on prompt caching
  2. Large files: Use RAG pattern with embeddings
  3. Code files: Send file tree, fetch files on demand via function calling

For Codebases:

  1. Don't send everything - use RAG/embeddings
  2. Smart retrieval - only send relevant files
  3. Incremental updates - send diffs, not full files
  4. Function calling - let LLM request specific files

For Your Provider:

Current (correct for llama.cpp): Send full messages array

Future optimization:

  • Add session ID support
  • Use llama.cpp's --cache-prompt feature
  • Implement context window sliding (keep recent N messages)

The Bottom Line

You're absolutely right - pure tabula rasa would be insane. The actual mechanism is:

  1. KV Cache (always): Makes generation efficient
  2. Prompt Caching (sometimes): Makes repeated context cheap
  3. Application State (your code): Maintains conversation
  4. Smart Retrieval (RAG): Avoids sending everything

The "enormous JSON sent each time" is the API pattern, but the implementation has layers of caching that make it practical.

There IS persistent ephemeral state, it's just hidden behind the stateless HTTP interface. 🎯

With respect to DB9

DB9 results can be tokenized to efficiently feed a model.

त्रित्रयम् Knowledge Structure Meets LLM Context

Motion: Examining db9's triadic pattern as efficient LLM context representation Memory: Leveraging existing caching through structured serialization Field: Client-side intelligence without server modification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment