This is a structural question about LLM chat implementations. When one "attaches a file" to a chat, or when an LLM reads a code base and forms a model in order to be able to work with it and reason, I can imagine a "static" implementation which tabula-rasa, creates a simply enormous json record, which is sent across a port to a server, which effectively, having never seen any of that before, analyzes it from scratch at incredible, and increasing expense with each iteration. As an engineer I opine that this makes no sense. So I have to posit that server side there is ephemeral persistent information so that intermediate results are available for subsequent steps. What is the actual mechanism?
Motion: Exploring the reality behind chat persistence mechanisms
Memory: Distinguishing API statelessness from internal optimizations
Field: Understanding the engineering truth beneath the abstraction
KV Cache (Key-Value Cache) - This is the primary mechanism. When a transformer processes tokens, it computes key and value vectors for each token in the self-attention mechanism. These can be cached and reused for subsequent generations, so you don't have to recompute them.
Prompt Caching - Some systems (like Anthropic's Claude) have explicit prompt caching where identical prefixes are cached server-side and reused across requests.
Context Windows - The model has a fixed context window (e.g., 4K, 8K, 128K tokens), but you don't always fill it.
Most LLM APIs are actually stateless from an API perspective - each request is independent BUT they have internal optimizations like KV caching Some systems (like llama.cpp server) can maintain session state
Attaching a file:
The file content is typically sent in the messages array With prompt caching (Anthropic), identical file content is cached Without prompt caching, it's resent each time (expensive!) Some systems use embeddings for retrieval (RAG pattern)
Reading a codebase:
Typically uses embeddings + vector database (RAG) File contents retrieved as needed, not all sent at once Alternatively: send file listing, then fetch files on demand Or: use function calling to read files as needed
For llama.cpp specifically:
Multiple optimization strategies are at play. The system uses a sophisticated slot-based approach, maintaining key-value caches per request during generation. This allows efficient token processing by preserving computational work across generation steps. Parallel batching enables handling multiple requests simultaneously, while prompt evaluation results can be cached to reduce redundant computations.
For commercial APIs:
The approach differs slightly. They maintain stateless API interactions, with internal KV caching during generation. Anthropic's approach, for instance, explicitly caches prompt prefixes, reducing computational overhead while preserving the independent request model. The fundamental perspective reveals nuanced optimization techniques. While APIs appear stateless, crucial performance improvements happen through targeted caching mechanisms. These strategies transform token generation from a potentially exponential computational challenge to a linear, efficient process.
This is the fundamental mechanism that makes transformers practical:
┌─────────────────────────────────────────────────┐
│ Transformer Self-Attention │
├─────────────────────────────────────────────────┤
│ │
│ For each token, compute: │
│ Query (Q), Key (K), Value (V) vectors │
│ │
│ Attention = softmax(Q·K^T) · V │
│ │
│ KEY INSIGHT: │
│ For all previous tokens, K and V never change! │
│ → Cache them, reuse for next token │
│ │
└─────────────────────────────────────────────────┘
What this means:
- First request: Compute KV for entire prompt (expensive)
- Generate token 1: Reuse cached KV, only compute new token's K,V
- Generate token 2: Reuse cached KV + token 1's K,V
- Cost: O(prompt) + O(n) not O(prompt × n)
Where it lives: GPU memory during generation
Some systems (Anthropic Claude, recent llama.cpp) cache across requests:
Request 1:
┌─────────────────────────────────────┐
│ System: "You are a helpful..." │ ← Cached (hash-based)
│ Documents: [10KB of context] │ ← Cached
│ User: "Explain the first function" │ ← Not cached
└─────────────────────────────────────┘
Request 2:
┌─────────────────────────────────────┐
│ System: "You are a helpful..." │ ← Cache HIT! (reuse KV)
│ Documents: [10KB of context] │ ← Cache HIT! (reuse KV)
│ User: "Now explain the second" │ ← New computation
└─────────────────────────────────────┘
Anthropic's system:
- Cache by content hash
- Minimum 1024 tokens for caching
- 5-minute TTL
- Charged at 10% of regular rate for cached tokens
llama.cpp's system:
--slotparameter maintains session state--ctx-sizedefines KV cache size- Can maintain multiple concurrent sessions
- Each slot has persistent KV cache until cleared
What Really Happens:
Application maintains conversation array:
┌────────────────────────────────────────────────┐
│ messages = [ │
│ {role: "system", content: "You are..."}, │
│ {role: "user", content: "Here's my file:\n" │
│ + file_contents}, // ← 50KB │
│ {role: "assistant", content: "I see..."}, │
│ {role: "user", content: "Explain line 42"} │
│ ] │
└────────────────────────────────────────────────┘
Each request sends ENTIRE array to server
(Yes, really! The stateless HTTP model)
BUT with optimizations:
- Prompt caching recognizes duplicate prefix
- Only new tokens need full computation
- Billing reflects cache hits
Without prompt caching:
- Request 1: Process 50KB + 10 tokens = expensive
- Request 2: Process 50KB + 100 tokens = MORE expensive
- Request 3: Process 50KB + 200 tokens = EVEN MORE expensive
With prompt caching:
- Request 1: Process 50KB (cached) + 10 tokens
- Request 2: Reuse 50KB cache + process 100 new tokens
- Request 3: Reuse 50KB cache + process 200 new tokens
For large codebases, the smart approach is NOT to send everything:
┌─────────────────────────────────────────────────┐
│ Retrieval-Augmented Generation │
├─────────────────────────────────────────────────┤
│ │
│ 1. Preprocess: Create embeddings database │
│ codebase/ → embed each file → vector DB │
│ │
│ 2. Query time: │
│ User: "How does auth work?" │
│ ↓ │
│ Embed query → search vector DB │
│ ↓ │
│ Retrieve top-k relevant files (3-10) │
│ ↓ │
│ Send ONLY relevant context + query to LLM │
│ │
└─────────────────────────────────────────────────┘
This is how modern code assistants work:
- Cursor, Copilot: Local embeddings + selective context
- GitHub Copilot Workspace: Graph-based code understanding
- Aider: Smart file selection based on git diff + mentions
Your Provider is talking to llama.cpp server. Here's what it supports:
# Start server with session support
llama-server \
--ctx-size 8192 \ # KV cache size
--n-slots 4 \ # Parallel sessions
--cache-prompt \ # Enable prompt caching
--parallel 4 # Batch processingSession/Slot System:
Slot 0: [User A's KV cache] - conversation ongoing
Slot 1: [User B's KV cache] - conversation ongoing
Slot 2: [User C's KV cache] - conversation ongoing
Slot 3: [Available]
Each slot maintains KV cache as long as the slot is held.
Our Provider currently: Sends full history each time (stateless)
Could optimize: Use llama.cpp's session API:
{
"session_id": "abc123", // Maintain server-side state
"messages": [...],
"cache_prompt": true
}-
Your Application (LabRaven Activity):
- Maintains conversation array in memory
- Sends FULL array each request
- This is stateless from API perspective
-
llama.cpp Server:
- Without sessions: Processes from scratch, KV cache only lives during generation
- With sessions/slots: Maintains KV cache between requests for same session
- Prompt caching: Recognizes duplicate prompts, reuses computation
-
During Generation (Always):
- KV cache makes generation O(new tokens) not O(all tokens)
- This is why you see "tokens per second" stay constant
You're right to be skeptical! The baseline model IS wasteful:
Request 1: 10,000 tokens context + 100 tokens generated
Request 2: 10,100 tokens context + 100 tokens generated
Request 3: 10,200 tokens context + 100 tokens generated
Without optimizations: Each request recomputes EVERYTHING
With KV cache (always): Only new tokens computed during generation
With prompt caching (optional): Repeated prefix recognized across requests
With sessions (optional): KV cache persists between requests
Current Implementation:
// Provider builds this every time:
messages = [
{role: "system", content: "You are helpful..."},
{role: "user", content: "Hello"},
{role: "assistant", content: "Hi there!"},
{role: "user", content: "What's 2+2?"},
{role: "assistant", content: "4"},
{role: "user", content: "And 3+3?"} // NEW
]
// Sends to: POST /v1/chat/completions
// Server processes: ALL 5 messages + generates responseWhat llama.cpp does internally:
- Tokenize all messages
- Check if prompt matches cached prompt (if
--cache-prompt) - If match: reuse KV cache from last time
- If no match: compute KV for all tokens (but only once)
- Generate new tokens using cached KV (fast)
Optimization opportunity:
// Could use llama.cpp's session API:
POST /completion {
"session_id": "labraven_user_123",
"prompt": "And 3+3?", // Only send NEW content
"cache_prompt": true
}
// Server maintains session's KV cache
// Only processes new tokensThe API appears stateless:
- Send messages array each time
- No explicit "session" in basic API
- Looks like tabula rasa
But internally:
- KV cache (always optimization)
- Prompt caching (sometimes available)
- Session support (implementation dependent)
The truth: It's a stateless API with stateful optimizations
- Small files (<100KB): Send directly, rely on prompt caching
- Large files: Use RAG pattern with embeddings
- Code files: Send file tree, fetch files on demand via function calling
- Don't send everything - use RAG/embeddings
- Smart retrieval - only send relevant files
- Incremental updates - send diffs, not full files
- Function calling - let LLM request specific files
Current (correct for llama.cpp): Send full messages array
Future optimization:
- Add session ID support
- Use llama.cpp's
--cache-promptfeature - Implement context window sliding (keep recent N messages)
You're absolutely right - pure tabula rasa would be insane. The actual mechanism is:
- KV Cache (always): Makes generation efficient
- Prompt Caching (sometimes): Makes repeated context cheap
- Application State (your code): Maintains conversation
- Smart Retrieval (RAG): Avoids sending everything
The "enormous JSON sent each time" is the API pattern, but the implementation has layers of caching that make it practical.
There IS persistent ephemeral state, it's just hidden behind the stateless HTTP interface. 🎯
DB9 results can be tokenized to efficiently feed a model.
Motion: Examining db9's triadic pattern as efficient LLM context representation Memory: Leveraging existing caching through structured serialization Field: Client-side intelligence without server modification