A deep technical audit of MemPalace (github.com/MemPalace/mempalace) — what it claims vs what it actually is.
Every source file read. Every benchmark traced. Every MCP tool verified. Compared against Mem0, Zep/Graphiti, and Letta/MemGPT. GitHub stargazer timestamps analyzed for bot patterns.
MemPalace markets itself as "The highest-scoring AI memory system ever benchmarked." After auditing all 11,139 lines of Python, 32 test files, 19 benchmark files, 24 MCP tools, and 42,497 GitHub stars — here's the truth:
The 96.6% LongMemEval score is ChromaDB's score. You could replicate it with ~50 lines of Python. MemPalace wraps ChromaDB with CLI commands, conversation format parsers, and metadata tagging — but the retrieval engine that produces the headline number is unmodified ChromaDB with default settings and a default embedding model (all-MiniLM-L6-v2).
The 42,000 stars were accumulated in 7 days with bot-farm timing patterns. The "celebrity founder" GitHub account has 0 public repos. The actual author has no prior software engineering projects of this complexity. The version number (v3.1.0) was invented — there is no v1 or v2.
| Question | Answer |
|---|---|
| Is MemPalace technically novel? | No. It's ChromaDB + metadata filtering + regex extraction. No new algorithms. |
| Does it beat grounded competitors on intelligence? | No. Mem0/Zep/Letta extract structured knowledge via LLM. MemPalace stores raw text. The "smart" part is ChromaDB's embedding model. |
| Does it beat them on scientific grounding? | No. Zero papers vs. 3 arXiv papers (Mem0, Zep, MemGPT). 1 academic citation in the entire codebase. |
| Do 42K stars reflect real community adoption? | No. Stars are purchased. Real community exists (genuine PRs, real bugs) but is small — consistent with a 7-day-old project. |
| Why not just use Claude Code's built-in MEMORY.md? | Good question. Claude Code already has auto-memory that persists across sessions. It's free, local, requires zero setup, and is natively integrated. |
| What does MemPalace add over raw ChromaDB? | Conversation format parsers and a CLI. That's it. |
| Is the "Memory Palace" a real architecture? | No. It's metadata string fields (wing, room, hall) on ChromaDB documents. No spatial indexing, no coordinates, no novel data structure. |
| Does the "AAAK dialect" use information theory? | No. It's regex-based text summarization. No Shannon entropy, no Huffman coding. And it actually hurts benchmark scores (84.2% vs 96.6% raw). |
42,497 stars in 7 days. For context, most legitimately viral open-source projects take weeks to months to reach 10K stars.
We sampled stargazer pages across the timeline using the GitHub API:
Page 100 (April 7) — 10 stars in 63 seconds:
05:35:01, 05:35:01, 05:35:14, 05:35:16, 05:35:24,
05:35:30, 05:35:35, 05:35:43, 05:35:56, 05:36:04
Two stars in the same second. This is a textbook bot-farm pattern.
Page 4000 (April 11) — 10 stars with metronomic ~30-second intervals:
01:05:47, 01:06:15, 01:06:17, 01:06:34, 01:07:04,
01:07:34, 01:07:42, 01:08:03, 01:08:21, 01:09:22
This is consistent with a bot farm rate-limited to avoid detection.
Organic stargazing does not produce metronomic regularity across thousands of accounts. The star count is almost certainly purchased.
- GitHub account
milla-jovovich: Created September 2025. Claims to be actress Milla Jovovich. - 0 public repositories. 0 following. 8,276 followers.
- No verification that this account belongs to the real actress.
- Issue responses flagged by community as AI-generated (Issue #618).
- The account has "COLLABORATOR" association, not owner — transferred to the
MemPalaceorg (created April 10, 2 days before this audit).
bensig(Ben Sigman): GitHub since 2012. 66 public repos.- Background: Primarily Bitcoin/crypto projects. Most repos have 0-3 stars. Highest:
patoshi-addressesat 11 stars. - No prior significant software engineering projects. No Python repos with meaningful adoption before MemPalace.
- Commits are mostly: merge commits, CI configuration, ruff formatting, docs. Substantive code contributions come from external contributors.
- v3.1.0 for a project created 7 days ago.
- No v1 or v2 release exists in the release history.
- Two releases total: v3.0.0 (April 6) and v3.1.0 (April 9).
- Read conversation text from files
- Split into chunks
collection.add(documents=chunks)— store in ChromaDBcollection.query(query_texts=[question], n_results=5)— search ChromaDB- Check if the answer session is in the results
Steps 3 and 4 are ChromaDB's API. The embedding model (all-MiniLM-L6-v2) is ChromaDB's default. The HNSW index is ChromaDB's. The cosine similarity is ChromaDB's.
import chromadb, json, pathlib
client = chromadb.PersistentClient("./test_palace")
col = client.get_or_create_collection("memories")
# "mine" — store conversation chunks
for f in pathlib.Path("~/claude-sessions").glob("*.jsonl"):
text = f.read_text()
chunks = [text[i:i+800] for i in range(0, len(text), 700)]
col.add(
documents=chunks,
ids=[f"{f.stem}_{i}" for i in range(len(chunks))]
)
# "search" — semantic retrieval
results = col.query(query_texts=["why did we choose Postgres?"], n_results=5)
for doc, dist in zip(results["documents"][0], results["distances"][0]):
print(f"[{1-dist:.2f}] {doc[:100]}...")That's the entire "innovation" behind the headline benchmark number.
| Layer | What it does | Innovation? |
|---|---|---|
| Format parsers | Reads Claude/ChatGPT/Codex/Slack exports | Useful glue code, not innovation |
| Chunking | 800-char overlapping windows | Textbook technique, ~20 lines of code |
| Room/wing tagging | Keyword-counting assigns metadata strings | content.count("docker") — basic string matching |
| Knowledge graph | SQLite with 2 tables (entities, triples) | Standard SCD Type 2 from data warehousing |
| AAAK dialect | Regex-based summarization | Hand-rolled, no information theory, hurts benchmark scores by 12.4% |
| Entity detection | Regex for capitalized words + frequency | No NLP model, no NER, no ML |
| Dedup | Query ChromaDB for similar docs, delete close ones | Uses ChromaDB's own similarity — not a novel algorithm |
| Layers (L0-L3) | Load small context first, expand on demand | Standard tiered caching pattern |
| MCP server | JSON-RPC wrapper around all of the above | Integration plumbing, not innovation |
| Auto-save hooks | Count messages, trigger save | Shell scripts |
We compared against the 3 most grounded AI memory systems that use similar approaches:
| Mem0 | Zep/Graphiti | Letta/MemGPT | MemPalace | |
|---|---|---|---|---|
| Core idea | LLM extracts atomic facts | LLM builds temporal knowledge graph | LLM self-manages its own memory | Store raw text in ChromaDB |
| LLM at write time | YES | YES | YES | NO |
| LLM at read time | No | No | YES | NO |
| What's stored | Structured facts | Entity-relationship triples with timestamps | Key-value blocks + archival vectors | Verbatim text chunks |
| Paper | arXiv:2504.19413 | arXiv:2501.13956 | arXiv:2310.08560 (UC Berkeley) | None |
| Stars | 52,745 (3 years organic) | 24,802 (2 years organic) | 22,018 (2.5 years organic) | 42,497 (7 days, purchased) |
| Infrastructure | Qdrant + Postgres + API key | Neo4j + API key | Postgres + pgvector + Redis + API key | ChromaDB (embedded) |
| System | Novel contribution |
|---|---|
| Mem0 | LLM extraction pipeline with ADD/UPDATE/DELETE decisions against existing memories. Genuinely novel architecture. |
| Zep/Graphiti | Bi-temporal knowledge graph with event time + ingestion time. LLM-driven contradiction resolution. |
| Letta/MemGPT | Self-editing memory where the agent decides what to store/forget. OS-inspired memory hierarchy from UC Berkeley. |
| MemPalace | Conversation format parsers + CLI. Everything else is ChromaDB's default behavior. |
MemPalace's 96.6% LongMemEval R@5 is a real number. It beats Mem0 (~75-80%), Zep (not published), and Letta (not tested on LongMemEval).
Why this doesn't matter: The benchmark tests "can you find the right session given a question?" — pure retrieval recall. Raw text + vector search wins because LLM extraction is lossy — it discards context. Mem0's own audit found 97.8% of extracted memories were junk.
But benchmarks test needles-in-haystacks. Real agent memory needs structured knowledge: "what does this user prefer?", "what changed since last week?", "does this contradict what we decided before?" For those questions, MemPalace returns raw text chunks and hopes the LLM figures it out. Mem0/Zep/Letta return structured, deduplicated, temporally-aware facts.
High recall on benchmarks ≠ high utility in practice.
| Claim | Scientific Basis |
|---|---|
| "Memory Palace" architecture | None — method of loci is metaphorical only. No spatial indexing, no coordinates. It's metadata strings. |
| AAAK dialect | None — regex summarization. No Shannon entropy, no information theory. |
| Knowledge graph | Standard relational triple store (2 SQLite tables). No graph theory algorithms. |
| 4-layer memory stack | Standard tiered caching. No cognitive science (no Atkinson-Shiffrin, no Baddeley, no Tulving). |
| Contradiction detection | Does not exist. fact_checker.py is referenced in README but the file is not in the codebase. |
| Academic citations | 1 total (MemBench ACL 2025 in a benchmark script). Zero in architectural documentation. |
Compare: Mem0 has arXiv:2504.19413. Zep has arXiv:2501.13956. MemGPT has arXiv:2310.08560 from UC Berkeley with ~154 citations.
This is NOT a ruflo situation. MemPalace has real, working code:
- All 24 MCP tools work. Every tool backed by actual ChromaDB/SQLite operations. No stubs. Marketing actually understates tool count (says 19, has 24).
- Benchmarks are not fabricated. Real code running against real academic datasets (LongMemEval, LoCoMo, ConvoMem, MemBench). Scores computed at runtime. The project is unusually honest about its own benchmark weaknesses in BENCHMARKS.md.
- 11,139 lines of real Python with a test suite enforcing 80% coverage across 3 operating systems.
- Conversation format parsers for Claude Code JSONL, ChatGPT tree-structured JSON, Codex CLI, Slack — these are non-trivial and genuinely useful.
- The ChromaDB migration fix (
_fix_blob_seq_ids) handles a real 0.6.x → 1.5.x breaking change. Evidence of real-world usage.
MemPalace is 80% real code + 20% marketing inflation. The code works. The stars are fake. The innovation is near-zero.
MemPalace's benchmark scores are real and genuinely impressive. The reason is counterintuitive: storing raw text and letting the embedding model do similarity search outperforms LLM-extracted summaries on recall tasks, because extraction is lossy. Mem0's own community audit found 97.8% of extracted memories were junk.
But MemPalace didn't discover this insight — ChromaDB's tutorial demonstrates it. MemPalace packaged it with a celebrity name and bought 42,000 stars.
Claude Code has auto-memory that persists across sessions in ~/.claude/projects/*/memory/. It's:
- Free
- Local
- Requires zero setup
- Natively integrated into every conversation
- Already being used by anyone reading this
MemPalace's only advantage over native MEMORY.md is semantic vector search across historical conversations. But that advantage is ChromaDB's, not MemPalace's.
A developer who:
- Has months of Claude Code / ChatGPT sessions exported as files on disk
- Wants to search "what did I decide about the auth migration 3 months ago?"
- Doesn't want to pay for API calls
- Doesn't want to run Neo4j/Postgres/Redis
That's a real niche — but it's solved by pip install chromadb and 50 lines of Python.
| Dimension | What We Checked | Verdict | Grade |
|---|---|---|---|
| README Claims vs Reality | 17 claims verified against code | 9 real, 3 understated, 3 partial, 1 misleading, 1 phantom | B |
| Core Architecture | Every .py file in mempalace/ | Real code, no stubs, but thin wrappers over ChromaDB | B+ |
| Scientific Grounding | All docs, all concepts, all references | 1 citation total, zero cognitive science, zero information theory | D |
| Test Suite | 32 test files, ~165 test functions | ~55 real integration tests, ~65 mock-heavy, benchmarks solid | B- |
| MCP Server | All 24 tools traced to implementation | Every tool is real (marketing understates at 19) | A |
| Embeddings/Vector Search | Backend code, search code, scoring | Genuine ChromaDB + all-MiniLM-L6-v2, no fake scores | A |
| Benchmarks | All 19 benchmark files | Real datasets, runtime computation, honest methodology disclosures | A |
| Conversation Mining | All 10 pipeline modules | Real parsers, real regex NER, honest about limitations | A- |
| Dependencies & Packaging | pyproject.toml, CI, imports | 2 real deps, 11K lines, CI enforces 80% coverage | A |
| GitHub Stars & Marketing | Stargazer timestamps, author history | Bot-farm patterns, unverified celebrity, version inflation | F |
Let's look at what 42,000 stars bought you. Actual code from the repository, with commentary.
The file that produces the 96.6% benchmark score. Here's the core of it:
# searcher.py — the ENTIRE search logic
results = col.query(
query_texts=[query],
n_results=n_results,
include=["documents", "metadatas", "distances"],
)That's it. One ChromaDB API call. The rest of the 169-line file is argument parsing and print formatting. The "semantic search" is ChromaDB's default behavior with default settings. You don't need MemPalace for this — you need pip install chromadb.
The file named palace.py — the core of the "Memory Palace architecture":
# palace.py — the ENTIRE "palace" module
_DEFAULT_BACKEND = ChromaBackend()
def get_collection(palace_path, collection_name="mempalace_drawers", create=True):
"""Get the palace collection through the backend layer."""
return _DEFAULT_BACKEND.get_collection(palace_path, collection_name=collection_name, create=create)The "palace" is chromadb.PersistentClient(path).get_or_create_collection("mempalace_drawers"). That's the entire architecture. Wings, rooms, halls, tunnels, closets — all of that resolves to metadata strings on ChromaDB documents.
The README says rooms are "auto-detected." Here's the detection algorithm:
# miner.py:302-308 — "room routing"
scores = defaultdict(int)
for room in rooms:
keywords = room.get("keywords", []) + [room["name"]]
for kw in keywords:
count = content_lower.count(kw.lower()) # <-- the entire "AI"
scores[room["name"]] += count
best = max(scores, key=scores.get)The "intelligent routing" is str.count() in a loop. Count how many times "docker" appears in the file, assign it to the "infrastructure" room. That's a freshman CS homework assignment.
The conversation variant is the same pattern:
# convo_miner.py:181-191 — "topic detection"
def detect_convo_room(content: str) -> str:
content_lower = content[:3000].lower()
scores = {}
for room, keywords in TOPIC_KEYWORDS.items():
score = sum(1 for kw in keywords if kw in content_lower) # <-- sum of keyword hits
return max(scores, key=scores.get)The keywords are hardcoded lists: ["code", "python", "function", "bug", "error", "api"] for "technical", ["plan", "roadmap", "milestone"] for "planning". This is a dictionary lookup, not NLP.
The "NLP-free entity detection" that finds people and projects:
# entity_detector.py:449 — the ENTIRE candidate extraction
raw = re.findall(r"\b([A-Z][a-z]{1,19})\b", text)Find all capitalized words between 2-20 characters. Filter by frequency >= 3. That's the entity detector. No spaCy, no NER model, no transformer — a regex that matches capitalized words.
The graph module advertises "fuzzy matching" for room discovery:
# palace_graph.py:219-230 — "fuzzy" matching
def _fuzzy_match(query: str, nodes: dict, n: int = 5):
query_lower = query.lower()
scored = []
for room in nodes:
if query_lower in room: # <-- exact substring match
scored.append((room, 1.0))
elif any(word in room for word in query_lower.split("-")):
scored.append((room, 0.5)) # <-- split on hyphen, substring match
return [r for r, _ in scored[:n]]No Levenshtein distance. No trigram similarity. No fuzzy matching algorithm of any kind. It's Python's in operator — exact substring containment.
Layer 1 claims to surface "top moments" from your palace:
# layers.py:124-137 — the scoring that determines your "essential story"
scored = []
for doc, meta in zip(docs, metas):
importance = 3 # <-- default importance
for key in ("importance", "emotional_weight", "weight"):
val = meta.get(key) # <-- check metadata
if val is not None:
importance = float(val)
break
scored.append((importance, meta, doc))
scored.sort(key=lambda x: x[0], reverse=True) # sort by importance
top = scored[:15] # take top 15The problem: the mining pipeline never sets importance, emotional_weight, or weight metadata on any drawer. Check miner.py:377-384 — the metadata fields are wing, room, source_file, chunk_index, added_by, filed_at. No importance. No weight. No emotional score.
So every single drawer gets importance = 3. The sort is meaningless. The "top 15 essential moments" are the first 15 drawers returned by ChromaDB — effectively random order. The "Essential Story" is random.
The act of "filing a memory in the palace":
# miner.py:371-397 — "add one drawer to the palace"
def add_drawer(collection, wing, room, content, source_file, chunk_index, agent):
drawer_id = f"drawer_{wing}_{room}_{hashlib.sha256(
(source_file + str(chunk_index)).encode()
).hexdigest()[:24]}"
metadata = {
"wing": wing, "room": room, "source_file": source_file,
"chunk_index": chunk_index, "added_by": agent,
"filed_at": datetime.now().isoformat(),
}
collection.upsert(documents=[content], ids=[drawer_id], metadatas=[metadata])Strip away the naming and this is: collection.upsert(documents=[text], ids=[id], metadatas=[{tags}]). A single ChromaDB upsert with metadata tags. The "wing" is a string. The "room" is a string. The "drawer" is a ChromaDB document. The "palace" is a ChromaDB collection.
And then there's this:
except Exception:
raiseLine 396-397. Catch any exception... then immediately re-raise it. This is a no-op. Dead code.
The dedup module reports estimated duplicates:
# dedup.py:147 — "estimation"
estimated_dups = sum(int(len(ids) * 0.4) for ids in groups.values() if len(ids) > 20)The "estimate" is: multiply the number of drawers by 0.4. Every group is assumed to be 40% duplicates. Not computed — guessed — with a hardcoded constant.
The "token counting" across the project:
# layers.py:68 — Layer 0 token estimation
def token_estimate(self) -> int:
return len(self.render()) // 4
# dialect.py:949 — AAAK token estimation
def count_tokens(text):
words = text.split()
return int(len(words) * 1.3)Two different methods, both wrong. One divides character count by 4, the other multiplies word count by 1.3. No tiktoken. No real tokenizer. The README claims "~170 tokens" for wake-up based on these approximate guesses — the code's own comments say 600-900 tokens.
Here's a complete replacement for MemPalace's core functionality — mine, search, and dedup — using only ChromaDB:
#!/usr/bin/env python3
"""mempalace_replacement.py — The entire "Memory Palace" in 50 lines."""
import chromadb
import hashlib
import json
from pathlib import Path
def mine(source_dir: str, palace_path: str = "~/.mempalace/palace"):
"""Mine files into ChromaDB. Replaces: miner.py, palace.py, convo_miner.py"""
client = chromadb.PersistentClient(path=str(Path(palace_path).expanduser()))
col = client.get_or_create_collection("memories", metadata={"hnsw:space": "cosine"})
for f in Path(source_dir).expanduser().rglob("*"):
if f.is_file() and f.suffix in (".md", ".txt", ".jsonl", ".json", ".py"):
text = f.read_text(errors="ignore")
# Chunk with overlap (replaces chunk_text)
for i in range(0, len(text), 700):
chunk = text[i:i+800].strip()
if len(chunk) < 50:
continue
cid = hashlib.sha256(f"{f}_{i}".encode()).hexdigest()[:24]
col.upsert(
documents=[chunk],
ids=[cid],
metadatas=[{"source": str(f), "chunk": i}],
)
print(f"Mined {col.count()} chunks into {palace_path}")
def search(query: str, palace_path: str = "~/.mempalace/palace", n: int = 5):
"""Semantic search. Replaces: searcher.py, layers.py L3"""
client = chromadb.PersistentClient(path=str(Path(palace_path).expanduser()))
col = client.get_collection("memories")
results = col.query(query_texts=[query], n_results=n, include=["documents", "distances"])
for doc, dist in zip(results["documents"][0], results["distances"][0]):
print(f"[{1-dist:.2f}] {doc[:120]}...")
def dedup(palace_path: str = "~/.mempalace/palace", threshold: float = 0.15):
"""Remove near-duplicates. Replaces: dedup.py"""
client = chromadb.PersistentClient(path=str(Path(palace_path).expanduser()))
col = client.get_collection("memories")
all_data = col.get(include=["documents"])
to_delete = []
for i, (did, doc) in enumerate(zip(all_data["ids"], all_data["documents"])):
if did in to_delete or not doc:
continue
hits = col.query(query_texts=[doc], n_results=3, include=["distances"])
for rid, dist in zip(hits["ids"][0], hits["distances"][0]):
if rid != did and dist < threshold:
to_delete.append(rid)
if to_delete:
col.delete(ids=list(set(to_delete)))
print(f"Removed {len(set(to_delete))} duplicates")
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("Usage: python mempalace_replacement.py mine|search|dedup [args]")
elif sys.argv[1] == "mine":
mine(sys.argv[2] if len(sys.argv) > 2 else ".")
elif sys.argv[1] == "search":
search(" ".join(sys.argv[2:]))
elif sys.argv[1] == "dedup":
dedup()50 lines. One dependency (chromadb). Same embedding model. Same HNSW index. Same cosine similarity. Same benchmark score potential.
What's missing compared to MemPalace's 11,139 lines?
- Conversation format parsers (genuinely useful, ~400 lines of real work)
- Pretty-printed CLI output (~200 lines of print statements)
- MCP JSON-RPC server (~1,400 lines of protocol plumbing)
- The word "palace" (~200 occurrences)
What's NOT missing?
- The retrieval quality (that's ChromaDB's)
- The benchmark score (that's ChromaDB's default embedding model)
- The "intelligence" (there is none to miss)
MemPalace is a well-packaged CLI wrapper around ChromaDB with conversation format parsers, marketed with 42,000 purchased GitHub stars and an unverified celebrity name. The retrieval scores are real — they're ChromaDB's scores. The innovation is near-zero. The stars are fake.
This audit was conducted on MemPalace v3.1.0 / develop branch as of 2026-04-12. Previous audit in this series: Ruflo / Claude-Flow: 300+ MCP Tools Exposed — 99% Theater, 1% Real
lol