Paper reviewed: Likov, I. (March 2026). Memory for All: SAGE — Spatial Associative Geometric Embeddings — A Weight-Free Geometric Memory Architecture with Hippocampal-Inspired Consolidation.
Read the full paper. Stripped of rhetoric, the system is this:
- A fixed 3D lattice of memory slots.
- Each slot stores a learned embedding vector.
- Retrieval is a similarity search against all stored embeddings.
- Writing moves one or more stored vectors toward a target using local updates.
- Additional heuristics handle momentum, repulsion, temperature scheduling, and spatial regularization.
- A partitioned variant separates "subject" and "object" slots along the x-axis.
- A sequence variant adds an explicit dictionary pointer for transitions.
That is a legitimate architecture. But it is not literally "knowledge stored as coordinates" in the sense the prose suggests. The coordinates are fixed addresses; the operative learned content is the embedding table. A more precise and harder-to-attack description would be:
A fixed-address external associative memory with a 3D address lattice, local non-backprop write rules, and explicit short-term/long-term partitioning.
That framing is less grandiose and much more accurate. It would also serve the author better in review, because the underlying design instinct — decouple online memory updates from gradient-based weight updates using an explicit external memory substrate — is timely and genuinely interesting.
The paper targets a real problem. There is intense current interest in memory systems that can be updated without retraining the full base model. Recent work on external, hybrid, and structured memory (MemoryLLM, LongMem, Titans, HiMeS, Panini, large-scale trainable memory layers) shows this is a live frontier, not a solved problem.
There is a coherent design instinct. Fixed storage addresses, local writes, explicit role partitioning, modular scaling by adding cubes instead of growing one monolith. The architecture is not nonsense. There is a real attempt to trade fitting accuracy for updateability and retention.
The paper is unusually explicit about mechanisms. Compared with memory papers that hide everything in an end-to-end black box, this one names the mechanisms and separates versions (V1–V4). That is good scientific practice.
The SAGESequenceCube section is refreshingly honest about one limitation. The paper explicitly notes that sequence states are collision-free but not spatially clustered. That is one of the better moments in the paper because it narrows the claim rather than inflating it.
The retrieval-index / association-pointer separation is the cleanest idea in the paper. Separating "finding the right slot" (geometric cosine search) from "knowing what comes next" (explicit dictionary) is an architectural pattern that transfers beyond this specific system. The paper treats this as one of six contributions when it may be the most practically useful one.
The V1→V2 analogy jump is the most interesting empirical result. Going from 16.7% to 58.3% on GloVe analogies shows that direction training is doing real geometric work. That a Hebbian-updated 3D grid can do any analogy arithmetic at all is non-trivial.
The 92% less forgetting claim (Table 2) is measured against an unspecified "neural network" with no architecture details, no hyperparameters, and no regularization disclosed. That is not the right baseline for 2026.
Since Kirkpatrick et al. (2017) introduced EWC, the continual learning field has moved substantially. A December 2025 paper in Scientific Reports demonstrated 24% forgetting reduction over state-of-the-art using Neural ODE + memory-augmented transformer hybrids on Split CIFAR-100 and CORe50. Google's Nested Learning paper (March 2026) proposes a full paradigm for avoiding catastrophic forgetting. Recent work on KANs investigated whether localized spline-based activations offer intrinsic resistance to forgetting, finding promise in low-dimensional settings but vulnerability in high-dimensional domains — a finding that likely applies to SAGE's 3D grid even more strongly.
At minimum the paper should compare against: rehearsal/replay, regularization-based continual learning (EWC-style), nearest-neighbor baselines, external memory baselines, and at least one modern memory-augmented architecture. Without that, the 92% forgetting reduction number has no reference frame. It tells you SAGE beats a vanilla network from 2017. It does not tell you where SAGE sits in the actual field.
"Perfect retention across 200 concepts" on a 16³ cube (4,096 slots) is 4.9% utilization. At that density, cosine-directed slot assignment makes collisions near-impossible by construction. The anti-forgetting guarantee is structural, but it is structural because the cube is mostly empty.
The right test is a load sweep: 5%, 20%, 40%, 60%, 80%, 95% occupancy, with retrieval quality, collision rate, overwrite rate, and latency all reported. The paper demonstrates that low-load operation is easy. It does not demonstrate that the architecture handles pressure.
The anti-collision argument is also not mathematically justified. The paper claims collision probability is proportional to semantic similarity because "semantically distinct concepts produce orthogonal embedding vectors." That is an intuition, not a theorem. Real embeddings are not generally orthogonal for semantically distinct items. Nearest-prototype assignment in finite-capacity systems can still create collisions. Anisotropy and hubness in embedding spaces make nearest-neighbor behavior highly distribution-dependent. To support this claim, the paper needs either a formal result under stated assumptions or empirical collision-rate curves under controlled load. It currently offers neither.
The paper claims 0.000% sparsity activation and "infinite efficiency gain" over MLPs. That does not survive inspection.
Retrieval computes similarity between the query and all N³ stored embeddings. Every stored vector participates in the read operation. That is dense content lookup, even if the output weighting concentrates on a few top items. This is the same distinction people routinely make in retrieval systems: sparse selection does not imply sparse compute. Top-k output concentration does not imply that only top-k parameters were active during scoring.
The right comparison would be: dense MLP forward pass versus dense similarity scan plus sparse response mixture. Not "100% active" versus "0.000% active."
The parameter accounting compounds this. The reported SAGE parameter count for the 32³, 64d cube is 2,195,456. But 32³ × 64 = 2,097,152 (embeddings) and 32³ × 3 = 98,304 (coordinates), totaling 2,195,456. So the table counts fixed coordinates as parameters. If fixed coordinates enter the parameter budget denominator, it is inconsistent to simultaneously imply that essentially none of them are active. This needs cleaning up.
A compute-aware analysis should report: number of slots scored per query, number of vector multiplications per query, wall-clock latency against approximate nearest-neighbor baselines, and memory bandwidth cost. Right now the paper claims compute sparsity while performing a full-table similarity scan.
The system stores and updates a full embedding vector at each slot. That embedding table is precisely the learned numerical state that determines behavior. The Hebbian update rule modifies those vectors: e_i ← e_i + α · score_i · (t − e_i). The system has 2.19M floating-point parameters being iteratively optimized. The update is local and gradient-free, which is a meaningful distinction from backpropagation. But calling the system "weight-free" when 2.19M stored parameters change during learning is a framing choice that invites skepticism.
More precise and harder-to-attack descriptions: "matrix-free retrieval path," "gradient-free memory writing," or "non-parametric external memory with fixed addresses." "Weight-free" suggests the system has no learned continuous values. It clearly does.
The retrieval equation is not cosine unless normalization is assumed. The paper says retrieval uses cosine similarity but writes: scores = softmax((E · q) / τ). That is a dot product, not cosine, unless both stored vectors and queries are already L2-normalized. The manuscript renormalizes stored vectors after updates but does not clearly state that queries are normalized at every retrieval call. Either state the assumption explicitly or write the actual cosine formula.
The "Hebbian" update is not standard Hebbian learning. The update e_i ← e_i + α · score_i · (t − e_i) is better described as an activation-weighted local prototype update. Standard Hebbian language usually refers to strengthening proportional to co-activity (outer-product style or correlation-based). A reviewer with associative-memory background will notice this immediately. Describe it as "an activation-weighted local prototype update" and only secondarily note that it is "Hebbian-like" in locality.
The Lennard-Jones section is mathematically incomplete. The "force" is written as a scalar, not a vector — a force acting in 3D position space should include a direction term. Positions were earlier declared fixed and never changing, but the LJ section says the force acts in 3D position space. So either positions are not actually fixed, or the force does not operate on positions, in which case the prose is wrong. And the actual formulation (cosine-gated Gaussian falloff with a sinusoidal multiplier) is not Lennard-Jones in the standard technical sense. A fairer name would be "similarity-gated Gaussian attraction/repulsion field."
Direction training is missing the actual objective. The paper says the system learns the direction vector (t − q) but gives no explicit objective, no loss, and no update rule for how that directional signal enters the memory vectors. This is the mechanism behind the most important empirical result (the V1→V2 analogy jump), and it is under-specified.
Contrastive repulsion is ambiguous. The update for non-target points does not state whether vectors are renormalized afterward, whether the update applies to all negatives or only a sample, how negatives are chosen, or what prevents hub formation or norm blow-up.
Training pairs for GloVe experiments. Section 4.1 says 2,060 pairs for 200 epochs. Tables 3 and 6 reference 40,036 pairs. If 40,036 is the total update count after epoch expansion, say that explicitly. As written, it reads like a contradiction.
V1 analogy accuracy. Table 3 gives SAGE V1 = 16.7%. Section 4.7 later says V1 achieves 33.3% on real GloVe embeddings. If these were different evaluation protocols, that must be spelled out.
Parameter accounting. Fixed coordinates are counted in the SAGE parameter budget but then the system is called "weight-free" and compared directly to MLP parameter counts. Pick one framing.
The GloVe analogy test uses 12 items. The standard Google analogy benchmark has ~19,544 questions. On 12 examples, a change of one or two items swings the headline percentage by 8+ points. A system that gets 7/12 has not yet established robust relational reasoning. Running V2 on the full benchmark would either validate the direction training mechanism or reveal where it fails. Either outcome makes the paper stronger.
Three of the six claimed contributions depend on results from Likov (2026b), listed as "in preparation." The SAGEDivided + MultiCube ablation, the 60% recall improvement, and the 72-step agent demonstration are all load-bearing citations that readers cannot verify. If those results are real, they should be in this paper or in a simultaneously released companion. As it stands, the architectural claims outrun the available evidence.
The introduction says, in effect, that modern AI systems must retrain from scratch to learn new things, and that memory-augmented systems either require backprop through memory access or use static stores that cannot update during inference.
That is too sweeping for 2026. kNN-LM already showed a pretrained LM can be augmented with a nearest-neighbor datastore that changes behavior without retraining. LongMem explicitly frames its contribution as decoupled long-term memory that can be cached and updated without staleness. MemoryLLM is explicitly about self-updatable long-term memory with retention under large numbers of updates. Panini is explicitly a non-parametric continual-learning system. HiMeS explicitly uses hippocampus/neocortex inspiration with short-term and long-term memory modules.
The paper's true differentiator is not "we are the first system to update memory without retraining." That is no longer a defensible claim. The real differentiator, if the author can defend it, is narrower:
A fixed-address geometric memory with local non-backprop updates and a hardcoded spatial partition.
That narrower claim is still interesting. The broader one is not credible.
- Run the forgetting experiment against EWC and experience replay on Split CIFAR-100. A weekend of GPU time. Would either validate or calibrate the 92% claim against the actual field.
- Evaluate GloVe analogies on the full 19,544-question benchmark. The V2 direction training result deserves a real statistical evaluation.
- Stress-test at high density. Load sweep from 5% to 95% utilization, measuring retrieval quality, collision rate, and retention degradation. 200 concepts at 4.9% proves the mechanism works when it is easy. 3,000 concepts at 73% proves it works when it matters.
- Publish the companion paper simultaneously or fold the key ablation results into this one.
- Clean up the math. Write the real cosine equation or state normalization assumptions. Provide the actual direction training update rule. Define spatial cohesion formally. Fix the LJ section so it is either a real vector field or honestly renamed. Clarify whether positions are fixed or movable.
- Replace the sparsity claim with a compute-aware analysis. Report slots scored per query, vector multiplications per query, wall-clock latency against ANN baselines.
- Narrow the novelty framing. Replace "the geometry computes; the weights are not needed" with something precise and defensible. The architecture is more interesting than the slogans suggest, and slogans that invite easy counterattack hurt the work.
The paper contains a potentially publishable idea but not yet a publishable paper. The core design instinct — fixed-address geometric memory with local updates, explicit role partitioning, and modular horizontal scaling — is coherent and timely. The SAGESequenceCube's retrieval/association separation is a genuinely clean pattern. The V1→V2 analogy result shows direction training doing real work.
What prevents the claims from landing: the forgetting baseline is too weak to support the headline number, the density regime is too low to test the architecture under pressure, the sparsity framing is misleading, the math has gaps that block reproducibility, and the novelty claims are overstated relative to 2024–2026 memory literature.
The underlying intuition has legs. The evidence needs to catch up to the claims. Do the load sweep, run the real benchmarks, fix the equations, narrow the framing — and this becomes a much harder paper to dismiss.
- Wang et al. (2024). MEMORYLLM: Towards Self-Updatable Large Language Models. arXiv:2402.04624.
- Rajesh et al. (2026). Panini: Continual Learning in Token Space via Structured Memory. arXiv:2602.15156.
- Berges et al. (2024). Memory Layers at Scale. arXiv:2412.09764.
- Wang et al. (2023). Augmenting Language Models with Long-Term Memory (LongMem). NeurIPS 2023.
- Behrouz et al. (2024). Titans: Learning to Memorize at Test Time. arXiv:2501.00663.
- HiMeS (2026). Hippocampus-inspired Memory System for Personalized AI Assistants. arXiv:2601.06152.
- Zhou & Li (2025). Mitigating catastrophic forgetting in lifelong learning: Neural ODEs with memory-augmented transformers. Scientific Reports.
- Google Research (2026). Introducing Nested Learning: A new ML paradigm for continual learning.
- Khandelwal et al. (2020). Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020.
- Steck, Ekanadham & Kallus (2024). Is Cosine-Similarity of Embeddings Really About Similarity? arXiv:2403.05440.
- Erb et al. (2025). Training Neural Networks by Optimizing Neuron Positions. arXiv:2506.13410.
- Bricken & Pehlevan (2021). Attention Approximates Sparse Distributed Memory. NeurIPS 2021.
- Kirkpatrick et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS.