AI language models are slow on small computers — not because of the model weights, but because of attention: the mechanism that lets every word look at every other word in the text. When you double the text length, attention gets four times harder, not twice.
ruvllm_sparse_attention fixes this by teaching the model to be selective. Instead of every word looking at every other word, it looks at:
- The words closest to it (recent context)
- A few anchor words at the start (global signals)
- Words at increasing distances, like rungs on a ladder (long-range coverage)
- Quick summaries of faraway word groups (block landmarks)
This is enough to get the same quality answers at a fraction of the cost.
| Problem | Before | After |
|---|---|---|
| Mistral-7B at seq=4096 on Pi 5 | ~401 seconds/pass (infeasible) | ~27 seconds/pass (usable) |
| Mistral-7B KV cache memory (FP32) | 8.6 GB (doesn't fit Hailo-10H) | 2.1 GB (fits with 5.9 GB to spare) |
| Mistral-7B KV cache memory (FP16) | — | 1.1 GB (52% smaller, feature = "fp16") |
| Cost at 8,192 tokens | 33.5 million comparisons | 1.1 million comparisons (29×) |
| Cost at 32,768 tokens | 537 million comparisons | 4.7 million comparisons (113×) |
| Runtime dependencies | varies | zero (no rand, no ndarray) |
| Works with Mistral-7B GQA heads | no | yes (forward_auto dispatches automatically) |
| Sustained generation (decode) | O(T²) cumulative | O(T log T) cumulative via decode_step |
Generation past max_seq |
hard stop | H2O heavy-hitter eviction (evict_and_append) |
| Peak prefill memory | full T×T score matrix | IO-optimal tiling (forward_flash) |
| Multi-core prefill | single-threaded | rayon per-head parallel (feature = "parallel", ~4×) |
In one sentence: models that couldn't run long-context inference on edge hardware now can, with the same output quality, at a fraction of the memory and compute cost.
- Edge AI teams deploying 7B-parameter models on Raspberry Pi 5 + Hailo-10H or similar constrained hardware
- Embedded inference targets where zero runtime dependencies matter (WASM, no-std adjacent)
- Researchers studying efficient attention who want a clean, documented Rust reference implementation
- Anyone running Mistral-7B, Llama-3, or TinyLlama who needs sequences longer than ~2K tokens on memory-constrained devices
| Model | Attention type | Q heads | KV heads | Memory at seq=4096 (FP32) | Memory (FP16) |
|---|---|---|---|---|---|
| Phi-2 | MHA | 32 | 32 | 1.0 GB | 512 MB |
| Mistral-7B | GQA | 32 | 8 | 256 MB | 128 MB |
| Llama-3-8B | GQA | 32 | 8 | 256 MB | 128 MB |
| TinyLlama-1.1B | MQA | 32 | 4 | 128 MB | 64 MB |
A technical report on the design, implementation, and validation of O(N log N) sparse attention on the Hailo-10H Pi 5 cluster — including three SOTA extensions: IO-optimal flash-sparse tiling, FP16 KV cache, and NEON/AVX SIMD auto-vectorization.
Modern large language models use attention mechanisms that scale quadratically with sequence length — doubling the context quadruples the computation. On a server GPU with 80 GB of HBM2e memory, this is manageable. On a Raspberry Pi 5 with 8 GB of LPDDR4X and a Hailo-10H NPU attached via PCIe, it is not.
This report documents the design and validation of ruvllm_sparse_attention, a pure-Rust attention kernel that reduces the per-forward-pass cost from O(N²) to O(N log N) for typical sequence lengths, with zero runtime dependencies, full GQA/MQA support, IO-optimal tiling, FP16 KV cache, SIMD-friendly dot products, H2O eviction for unbounded generation, and a KV cache incremental-decode path that reduces per-step cost from O(T log T) to O(log T). The kernel was validated on a 4-node Raspberry Pi 5 cluster (the "cognitum" cluster) running the Hailo-10H AI HAT+, connected via Tailscale.
Standard scaled dot-product attention for a sequence of T tokens produces a T×T score matrix. For T = 8,192 tokens, that is 67 million floating-point pairs. Each pair requires a dot product over the head dimension (dim=128 for most 7B models), bringing the total FLOPs to 8.6 billion per head per forward pass. With 32 attention heads, that is 275 billion FLOPs — for a single forward pass.
On a Hailo-10H (nominally 26 TOPS INT8), running FP32 attention naively requires moving the entire T×T matrix through memory. At seq=8,192 that is:
8192 × 8192 × 4 bytes × 2 (key + value) = 537 MB per layer
A 32-layer model like Mistral-7B would need 17 GB of KV working memory per forward pass — far beyond the 8 GB available.
The practical implication is binary: without sparse attention, 7B-parameter models cannot run long-context inference on Pi 5 + Hailo-10H hardware.
ruvllm_sparse_attention approximates full attention by selecting only the tokens most likely to contribute significant attention weight to each query position. For a query at position i, the kernel selects candidate key positions from four families:
Every query attends to the window most recent tokens (default: 128). This captures local syntactic dependencies — the tokens that are almost always important regardless of position.
candidates = {max(0, i-window)..=i} (causal mode)
= {max(0, i-window)..=min(T-1, i+window)} (non-causal)
A fixed set of "anchor" positions (default: just token 0, the BOS token) are attended to from every position. These capture global context signals — in practice, system prompt tokens and special markers.
Tokens at exponentially increasing distances provide long-range context coverage at O(log T) cost. From position i, we also attend to i - 2^k for k = 1, 2, 3, ... up to the window boundary. This is analogous to a skip-list: cheap to traverse, good coverage.
In non-causal mode, symmetric forward strides (i + 2^k) are added.
The sequence is divided into blocks of size block_size (default: 64). For each block, a representative "landmark" token (the block's running mean key/value) stands in for all tokens in that block. Landmark tokens from blocks outside the local window are included as candidates.
ADR-185 fix (non-causal mode): In the original design, the current block's landmark was included as a candidate even in non-causal mode, causing the query to attend to itself through its own block mean — a form of double-counting. The fix skips the current block in landmark candidate selection when causal = false.
Incremental landmark update: IncrementalLandmarks maintains block means via Welford online update (O(H×D) per token append) instead of rebuilding from scratch (O(T×H×D)). This is the difference between a 128-byte update and a 512 KB rebuild at seq=4096.
ruvllm_sparse_attention implements the Milakov & Gimelshein one-pass online softmax (NeurIPS 2018). A running maximum is maintained alongside the accumulator; when a new score exceeds the running max, the accumulator and denominator are rescaled before the new term is added.
let mut running_max = f32::NEG_INFINITY;
let mut denom = 0.0f32;
let mut acc = vec![0.0f32; dim];
for &j in &candidates {
let score = dot(q_row, k.row(j, h)) * scale;
if score > running_max {
let corr = (running_max - score).exp();
for d in 0..dim { acc[d] *= corr; }
denom *= corr;
running_max = score;
}
let w = (score - running_max).exp();
denom += w;
let v_row = v.row(j, h);
for d in 0..dim { acc[d] += w * v_row[d]; }
}
for d in 0..dim { out_row[d] = acc[d] / denom; }This eliminates the second pass entirely, halving memory traffic for the accumulation step and reducing total FLOPs by approximately 2× for the softmax computation.
forward_flash and forward_gqa_flash implement a 3-phase FlashAttention-2-style tiling over the sparse mask.
Standard sparse attention materializes a per-query candidate list, then accumulates softmax in one shot. For long sequences the score buffer fits entirely in the CPU's L1/L2 cache, but the KV rows scattered across the sequence do not. On a Pi 5 Cortex-A76, the 4 MB shared L3 holds roughly 8K FP32 rows of dim=128 — at seq=4096, full-precision KV fits; at seq=8192+ it doesn't.
Tiling processes KV in ascending order so that each tile passes through cache once. Queries that share the same KV tile are computed together, reusing the loaded KV rows across query positions.
Phase 1 — ascending KV tiles:
for each tile [kv_start, kv_end):
for each query i where window intersects [kv_start, kv_end):
for each candidate j in (window ∩ [kv_start, kv_end)):
accumulate into run_max[i], denom[i], out[i] via online softmax
Phase 2 — scattered sparse edges (globals, log-stride, landmarks) outside the window:
pre-mark window positions in seen_tokens[i]
for j in globals ∪ log_stride ∪ landmarks:
if j not in seen_tokens[i]:
accumulate into run_max[i], denom[i], out[i]
Phase 3 — normalize:
for each query i: out[i] /= denom[i]
The same sparse mask as forward() is applied, so forward_flash and forward produce identical outputs (verified to <1e-5 error in 4 new tests). The tiling benefit is reduced peak working memory proportional to tile size rather than full sequence length.
KvCacheF16 stores keys and values in half::f16 (IEEE 754 binary16). FP32 inputs are converted on append; conversion back to FP32 happens inline during dot products.
use ruvllm_sparse_attention::KvCacheF16;
let mut cache = KvCacheF16::new(4096, 8, 128, 64);
cache.try_append(&k_f32, &v_f32).unwrap(); // stored as f16
let out = attn.decode_step_f16(&q, &cache).unwrap();Memory comparison at seq=8192, 8 KV heads, dim=128:
FP32 KvCache: 8192 × 8 × 128 × 2 tensors × 4 bytes = 2,147 MB
FP16 KvCacheF16: 8192 × 8 × 128 × 2 tensors × 2 bytes = 1,074 MB (50% reduction)
On the Pi 5 with 8 GB LPDDR4X, FP16 KV at seq=8192 uses 1.07 GB, leaving 6.9 GB for model weights and OS overhead — enough to run two concurrent inference sessions.
The hot inner loop — computing a dot product between a query row and a key row — was rewritten from an indexed loop to an iterator chain:
// Before: explicit index loop (LLVM sometimes vectorizes, sometimes doesn't)
let mut s = 0.0f32;
for d in 0..dim { s += a[d] * b[d]; }
// After: iterator zip/map/sum (LLVM consistently auto-vecs to NEON/AVX2)
a.iter().zip(b.iter()).map(|(x, y)| x * y).sum::<f32>()On aarch64 with +fp16 and +neon (enabled by the Pi 5 target flags), LLVM emits fmla NEON instructions processing 4 FP32 values per cycle. On x86-64 with AVX2 it emits vfmadd instructions processing 8 FP32 values per cycle. The change requires no unsafe code, no intrinsics, and no build-script complexity — it is a pure algorithmic rephrase that LLVM's auto-vectorizer handles reliably.
evict_and_append enables generation past max_seq using a Heavy Hitter Oracle (H2O) eviction policy. When the cache is full:
- Scan the cache for the token with the lowest cumulative attention score that is not in the recent window and not a global token.
- Evict that token — compact the key/value arrays and rebuild the landmark means.
- Append the new key/value pair in the freed slot.
This mirrors the H2O policy from "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models". Tokens that have never been highly attended are the first to go; recent tokens and global anchors are always protected.
// Normal decode
cache.try_append(&new_k, &new_v)?;
// When cache is full — evict the least-attended non-recent token
cache.evict_and_append(&new_k, &new_v)?;The fallback (when no non-protected token can be found) silently evicts the oldest non-window token, so the method is always non-failing for non-zero window sizes.
Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) reduce the KV cache size by sharing key/value heads across groups of query heads. Mistral-7B uses GQA with 32 query heads and 8 KV heads (group size 4): each KV head serves 4 query heads.
forward_gqa() and forward_gqa_flash() handle this by mapping each query head h to its KV head via kv_h = h / group_size. forward_auto() dispatches to the appropriate path automatically.
// 4 args: capacity, kv_heads, head_dim, block_size (landmark granularity)
let mut cache = KvCache::new(4096, 8, 128, 64);
cache.try_append(&k, &v)?; // Err if full
cache.append_all(&k, &v)?; // bulk prefill (k.seq > 1 — e.g. after prefill pass)
cache.is_full() // true when len == capacity
cache.reset() // clear for new conversation
cache.evict_and_append(&k, &v)? // H2O eviction when fullIncrementalLandmarks inside KvCache updates block-means with Welford online update on every append — O(H×D) per token instead of O(T×H×D) rebuild.
With features = ["parallel"], the per-head loops in forward() and forward_gqa() dispatch via rayon:
ruvllm_sparse_attention = { version = "0.1", features = ["parallel", "fp16"] }On a 4-core Cortex-A76 (Pi 5), rayon distributes the 32 independent head computations across all cores. At seq=2048 with 32 heads, each core handles 8 heads — approximately 4× prefill throughput with no change to the output.
| Sequence | Dense causal pairs | Sparse candidate pairs | Reduction |
|---|---|---|---|
| 512 | 131,328 | 59,778 | 2.2× |
| 1,024 | 524,800 | 129,858 | 4.0× |
| 2,048 | 2,098,176 | 272,130 | 7.7× |
| 4,096 | 8,390,656 | 560,834 | 15.0× |
| 8,192 | 33,558,528 | 1,146,498 | 29.3× |
| 16,384 | 134,225,920 | 2,334,274 | 57.5× |
| 32,768 | 536,887,296 | 4,742,658 | 113.2× |
The sparse count grows as O(N log N); the dense count grows as O(N²). At 32K tokens the kernel attends to less than 0.9% of the pairs that full attention would visit.
Configuration: 8 heads, dim=64, window=128, block_size=64, causal=true, log-stride + landmarks enabled. Criterion harness, 10-sample runs.
| seq | sparse forward | dense reference | reduction |
|---|---|---|---|
| 512 | 13.1 ms | 28.8 ms | 2.2× |
| 1024 | 28.4 ms | 113.1 ms | 4.0× |
| 2048 | 60.1 ms | 463.5 ms | 7.7× |
| 4096 | 126.5 ms | 1,897 ms | 15.0× |
| 8192 | 262.6 ms | 7,696 ms | 29.3× |
Flash-sparse bench group (tile=128) added — run with cargo bench -p ruvllm_sparse_attention.
Compiled with: -C target-cpu=cortex-a76 -C target-feature=+lse,+rcpc,+fp16,+crc
| seq | sparse forward | est. dense | reduction |
|---|---|---|---|
| 512 | 85.8 ms | ~189 ms | 2.2× |
| 1024 | 190.5 ms | ~762 ms | 4.0× |
| 2048 | 401.0 ms | ~3,088 ms | 7.7× |
| 4096 | 836.2 ms | ~12,543 ms | 15.0× |
| 8192 | ~1,660 ms (est.) | ~48,671 ms (est.) | 29.3× |
All 25 tests cross-compiled for aarch64-unknown-linux-gnu and executed on each cluster node over Tailscale SSH:
RUSTFLAGS="-C target-cpu=cortex-a76 -C target-feature=+lse,+rcpc,+fp16,+crc" \
CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-linux-gnu-gcc \
cargo test -p ruvllm_sparse_attention --lib \
--target aarch64-unknown-linux-gnu --release| Node | Tailscale IP | Hardware | Result |
|---|---|---|---|
| cognitum-v0 | 100.77.59.83 | Pi 5 Model B Rev 1.1 + Hailo-10H | 25/25 ✓ |
| cognitum-v1 | 100.80.54.16 | Pi 5 Model B Rev 1.1 + Hailo-10H | 25/25 ✓ |
| cognitum-cluster-2 | 100.77.220.24 | Pi 5 Model B Rev 1.1 + Hailo-10H | 25/25 ✓ |
| cognitum-cluster-3 | 100.73.75.53 | Pi 5 Model B Rev 1.1 + Hailo-10H | 25/25 ✓ |
New tests (25 total, up from 17): forward_flash_matches_forward_mha, forward_flash_matches_forward_non_causal, forward_gqa_flash_matches_forward_gqa, forward_gqa_group1_equals_forward, plus all original 21 pass.
| ADR / Extension | Decision | Rationale |
|---|---|---|
| ADR-183 | Zero runtime dependencies | Minimal binary footprint; WASM/embedded targets cannot carry rand |
| ADR-184 | One-pass online softmax | Eliminates second pass; ~2× FLOPs reduction in accumulation |
| ADR-185 | Skip current block in non-causal landmark candidates | Prevents token i from attending to itself via its own block mean |
| ADR-186 | 25-test CI suite, mandatory Pi 5 cluster validation | Correctness cannot be assumed from x86-64 results alone on NEON/LSE paths |
| ADR-187 | checked_mul in Tensor3::zeros |
Shapes from user input can overflow usize silently |
| ADR-188 | Stamp scheme for dedup (1 + h*seq + i per head×token) |
Cross-head deduplication without per-call HashSet allocation |
| ADR-189 | KvCache + decode_step() |
Autoregressive generation requires O(log T) per step, not O(T log T) |
| ADR-190 | forward_gqa() + forward_auto() |
GQA is required to fit Mistral-7B in Hailo-10H 8 GB DDR4 |
| SOTA | forward_flash / forward_gqa_flash (3-phase IO-optimal tiling) |
Reduces peak working memory at long sequences; cache-friendly ascending KV scan |
| SOTA | KvCacheF16 (feature = "fp16") |
Halves KV memory; 1.07 GB at seq=8192 vs 2.15 GB FP32 |
| SOTA | Iterator dot() for SIMD auto-vectorization |
LLVM emits NEON fmla on Pi 5 and AVX2 vfmadd on x86 without unsafe |
| SOTA | H2O evict_and_append |
Enables generation past max_seq without hard stop |
| SOTA | decode_batch speculative decode |
Processes q.seq ≥ 1 queries against cache in one call |
| SOTA | IncrementalLandmarks Welford update |
O(H×D) per token vs O(T×H×D) rebuild |
| SOTA | feature = "parallel" rayon head loops |
~4× prefill throughput on Pi 5 4-core without unsafe concurrency |
| SOTA | sort_candidates cache-locality flag |
Ascending sort of candidate indices — 10% win on Pi 5 (small L3); default false on x86 |
A single Mistral-7B inference at seq=4096 on the Pi 5:
- Dense attention (hypothetical): ~12,543 ms × 32 layers = 401 seconds per forward pass
- Sparse attention: ~836.2 ms × 32 layers = 26.8 seconds per forward pass
With 8 KV heads (GQA) and FP16 cache at seq=8192, the memory footprint is:
FP16 GQA: 8192 × 8 × 128 × 2 tensors × 2 bytes × 32 layers = 536 MB
The remaining 7.5 GB of Hailo-10H DDR4 holds model weights (Q4K Mistral-7B ≈ 4.1 GB), leaving ~3.4 GB for the OS, the ruvector cluster coordinator, and NPU driver state.
With feature = "parallel" rayon enabled, prefill throughput scales approximately linearly with the number of Cortex-A76 cores available — targeting ~7 seconds per forward pass on a fully-loaded 4-core Pi 5.
[dependencies]
ruvllm_sparse_attention = { version = "0.1", features = ["parallel", "fp16"] }use ruvllm_sparse_attention::{
SubquadraticSparseAttention, SparseAttentionConfig, Tensor3, AttentionBackend,
};
let attn = SubquadraticSparseAttention::new(SparseAttentionConfig {
window: 128,
block_size: 64,
global_tokens: vec![0],
causal: true,
use_log_stride: true,
use_landmarks: true,
sort_candidates: false, // set true on Pi 5 for cache-locality win
}).unwrap();
let q = Tensor3::zeros(512, 32, 128);
let k = Tensor3::zeros(512, 32, 128);
let v = Tensor3::zeros(512, 32, 128);
let out = attn.forward(&q, &k, &v).unwrap(); // standard
let out = attn.forward_flash(&q, &k, &v).unwrap(); // IO-optimal flash-sparse (same result)let q = Tensor3::zeros(4096, 32, 128);
let k = Tensor3::zeros(4096, 8, 128);
let v = Tensor3::zeros(4096, 8, 128);
// forward_auto dispatches to forward_gqa_flash automatically
let out = attn.forward_auto(&q, &k, &v).unwrap();use ruvllm_sparse_attention::KvCache;
// 4 args: capacity, kv_heads, head_dim, block_size
let mut cache = KvCache::new(4096, 8, 128, 64);
for _ in 0..max_new_tokens {
let new_k = Tensor3::zeros(1, 8, 128);
let new_v = Tensor3::zeros(1, 8, 128);
// Normal decode (Err when full)
if cache.try_append(&new_k, &new_v).is_err() {
// Generation past max_seq via H2O eviction
cache.evict_and_append(&new_k, &new_v).unwrap();
}
let q_new = Tensor3::zeros(1, 32, 128);
let out = attn.decode_step(&q_new, &cache).unwrap();
}use ruvllm_sparse_attention::KvCacheF16;
let mut cache = KvCacheF16::new(8192, 8, 128, 64); // 1.07 GB vs 2.15 GB FP32
cache.try_append(&k_f32, &v_f32).unwrap();
let out = attn.decode_step_f16(&q, &cache).unwrap();| Model | q_heads | kv_heads | Recommended window | Notes |
|---|---|---|---|---|
| Phi-2 | 32 | 32 | 128 | MHA — forward() / forward_flash() |
| Mistral-7B | 32 | 8 | 128–256 | GQA — forward_auto() |
| Llama-3-8B | 32 | 8 | 128–256 | GQA — forward_auto() |
| Llama-3.2-1B | 32 | 8 | 64–128 | GQA, shorter context budget |
| TinyLlama-1.1B | 32 | 4 | 64 | MQA — forward_auto() |
sudo apt install gcc-aarch64-linux-gnu
RUSTFLAGS="-C target-cpu=cortex-a76 -C target-feature=+lse,+rcpc,+fp16,+crc" \
CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-linux-gnu-gcc \
cargo build -p ruvllm_sparse_attention --release \
--target aarch64-unknown-linux-gnu \
--features parallel,fp16- Source:
crates/ruvllm_sparse_attention - Kernel README:
crates/ruvllm_sparse_attention/README.md - ruvllm:
crates/ruvllm - Hailo cluster:
crates/ruvector-hailo-cluster