Skip to content

Instantly share code, notes, and snippets.

@roman-rr
Last active April 23, 2026 22:02
Show Gist options
  • Select an option

  • Save roman-rr/0569fc487cc620f54a70c90ab50d32e3 to your computer and use it in GitHub Desktop.

Select an option

Save roman-rr/0569fc487cc620f54a70c90ab50d32e3 to your computer and use it in GitHub Desktop.
MemPalace Exposed: 42,000 Purchased Stars, Zero Innovation — It's ChromaDB With a Celebrity Name

MemPalace Exposed: 42,000 Purchased Stars, Zero Innovation — It's ChromaDB With a Celebrity Name

A deep technical audit of MemPalace (github.com/MemPalace/mempalace) — what it claims vs what it actually is.

Every source file read. Every benchmark traced. Every MCP tool verified. Compared against Mem0, Zep/Graphiti, and Letta/MemGPT. GitHub stargazer timestamps analyzed for bot patterns.


TL;DR

MemPalace markets itself as "The highest-scoring AI memory system ever benchmarked." After auditing all 11,139 lines of Python, 32 test files, 19 benchmark files, 24 MCP tools, and 42,497 GitHub stars — here's the truth:

The 96.6% LongMemEval score is ChromaDB's score. You could replicate it with ~50 lines of Python. MemPalace wraps ChromaDB with CLI commands, conversation format parsers, and metadata tagging — but the retrieval engine that produces the headline number is unmodified ChromaDB with default settings and a default embedding model (all-MiniLM-L6-v2).

The 42,000 stars were accumulated in 7 days with bot-farm timing patterns. The "celebrity founder" GitHub account has 0 public repos. The actual author has no prior software engineering projects of this complexity. The version number (v3.1.0) was invented — there is no v1 or v2.


The Uncomfortable Questions

Question Answer
Is MemPalace technically novel? No. It's ChromaDB + metadata filtering + regex extraction. No new algorithms.
Does it beat grounded competitors on intelligence? No. Mem0/Zep/Letta extract structured knowledge via LLM. MemPalace stores raw text. The "smart" part is ChromaDB's embedding model.
Does it beat them on scientific grounding? No. Zero papers vs. 3 arXiv papers (Mem0, Zep, MemGPT). 1 academic citation in the entire codebase.
Do 42K stars reflect real community adoption? No. Stars are purchased. Real community exists (genuine PRs, real bugs) but is small — consistent with a 7-day-old project.
Why not just use Claude Code's built-in MEMORY.md? Good question. Claude Code already has auto-memory that persists across sessions. It's free, local, requires zero setup, and is natively integrated.
What does MemPalace add over raw ChromaDB? Conversation format parsers and a CLI. That's it.
Is the "Memory Palace" a real architecture? No. It's metadata string fields (wing, room, hall) on ChromaDB documents. No spatial indexing, no coordinates, no novel data structure.
Does the "AAAK dialect" use information theory? No. It's regex-based text summarization. No Shannon entropy, no Huffman coding. And it actually hurts benchmark scores (84.2% vs 96.6% raw).

The Stars Are Purchased — Proof

42,497 stars in 7 days. For context, most legitimately viral open-source projects take weeks to months to reach 10K stars.

Stargazer Timestamp Analysis

We sampled stargazer pages across the timeline using the GitHub API:

Page 100 (April 7) — 10 stars in 63 seconds:

05:35:01, 05:35:01, 05:35:14, 05:35:16, 05:35:24,
05:35:30, 05:35:35, 05:35:43, 05:35:56, 05:36:04

Two stars in the same second. This is a textbook bot-farm pattern.

Page 4000 (April 11) — 10 stars with metronomic ~30-second intervals:

01:05:47, 01:06:15, 01:06:17, 01:06:34, 01:07:04,
01:07:34, 01:07:42, 01:08:03, 01:08:21, 01:09:22

This is consistent with a bot farm rate-limited to avoid detection.

Organic stargazing does not produce metronomic regularity across thousands of accounts. The star count is almost certainly purchased.

The "Celebrity Founder"

  • GitHub account milla-jovovich: Created September 2025. Claims to be actress Milla Jovovich.
  • 0 public repositories. 0 following. 8,276 followers.
  • No verification that this account belongs to the real actress.
  • Issue responses flagged by community as AI-generated (Issue #618).
  • The account has "COLLABORATOR" association, not owner — transferred to the MemPalace org (created April 10, 2 days before this audit).

The Actual Author

  • bensig (Ben Sigman): GitHub since 2012. 66 public repos.
  • Background: Primarily Bitcoin/crypto projects. Most repos have 0-3 stars. Highest: patoshi-addresses at 11 stars.
  • No prior significant software engineering projects. No Python repos with meaningful adoption before MemPalace.
  • Commits are mostly: merge commits, CI configuration, ruff formatting, docs. Substantive code contributions come from external contributors.

Version Number Inflation

  • v3.1.0 for a project created 7 days ago.
  • No v1 or v2 release exists in the release history.
  • Two releases total: v3.0.0 (April 6) and v3.1.0 (April 9).

The 96.6% Score Belongs to ChromaDB

What MemPalace's benchmark actually does:

  1. Read conversation text from files
  2. Split into chunks
  3. collection.add(documents=chunks) — store in ChromaDB
  4. collection.query(query_texts=[question], n_results=5) — search ChromaDB
  5. Check if the answer session is in the results

Steps 3 and 4 are ChromaDB's API. The embedding model (all-MiniLM-L6-v2) is ChromaDB's default. The HNSW index is ChromaDB's. The cosine similarity is ChromaDB's.

You can replicate this in ~50 lines:

import chromadb, json, pathlib

client = chromadb.PersistentClient("./test_palace")
col = client.get_or_create_collection("memories")

# "mine" — store conversation chunks
for f in pathlib.Path("~/claude-sessions").glob("*.jsonl"):
    text = f.read_text()
    chunks = [text[i:i+800] for i in range(0, len(text), 700)]
    col.add(
        documents=chunks,
        ids=[f"{f.stem}_{i}" for i in range(len(chunks))]
    )

# "search" — semantic retrieval
results = col.query(query_texts=["why did we choose Postgres?"], n_results=5)
for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"[{1-dist:.2f}] {doc[:100]}...")

That's the entire "innovation" behind the headline benchmark number.

What MemPalace adds on top (and what it's worth):

Layer What it does Innovation?
Format parsers Reads Claude/ChatGPT/Codex/Slack exports Useful glue code, not innovation
Chunking 800-char overlapping windows Textbook technique, ~20 lines of code
Room/wing tagging Keyword-counting assigns metadata strings content.count("docker") — basic string matching
Knowledge graph SQLite with 2 tables (entities, triples) Standard SCD Type 2 from data warehousing
AAAK dialect Regex-based summarization Hand-rolled, no information theory, hurts benchmark scores by 12.4%
Entity detection Regex for capitalized words + frequency No NLP model, no NER, no ML
Dedup Query ChromaDB for similar docs, delete close ones Uses ChromaDB's own similarity — not a novel algorithm
Layers (L0-L3) Load small context first, expand on demand Standard tiered caching pattern
MCP server JSON-RPC wrapper around all of the above Integration plumbing, not innovation
Auto-save hooks Count messages, trigger save Shell scripts

MemPalace vs. Real Competitors — The Honest Comparison

We compared against the 3 most grounded AI memory systems that use similar approaches:

Architecture Comparison

Mem0 Zep/Graphiti Letta/MemGPT MemPalace
Core idea LLM extracts atomic facts LLM builds temporal knowledge graph LLM self-manages its own memory Store raw text in ChromaDB
LLM at write time YES YES YES NO
LLM at read time No No YES NO
What's stored Structured facts Entity-relationship triples with timestamps Key-value blocks + archival vectors Verbatim text chunks
Paper arXiv:2504.19413 arXiv:2501.13956 arXiv:2310.08560 (UC Berkeley) None
Stars 52,745 (3 years organic) 24,802 (2 years organic) 22,018 (2.5 years organic) 42,497 (7 days, purchased)
Infrastructure Qdrant + Postgres + API key Neo4j + API key Postgres + pgvector + Redis + API key ChromaDB (embedded)

What You Can't Replicate in 50 Lines

System Novel contribution
Mem0 LLM extraction pipeline with ADD/UPDATE/DELETE decisions against existing memories. Genuinely novel architecture.
Zep/Graphiti Bi-temporal knowledge graph with event time + ingestion time. LLM-driven contradiction resolution.
Letta/MemGPT Self-editing memory where the agent decides what to store/forget. OS-inspired memory hierarchy from UC Berkeley.
MemPalace Conversation format parsers + CLI. Everything else is ChromaDB's default behavior.

Where MemPalace Claims to Win (and why it doesn't matter)

MemPalace's 96.6% LongMemEval R@5 is a real number. It beats Mem0 (~75-80%), Zep (not published), and Letta (not tested on LongMemEval).

Why this doesn't matter: The benchmark tests "can you find the right session given a question?" — pure retrieval recall. Raw text + vector search wins because LLM extraction is lossy — it discards context. Mem0's own audit found 97.8% of extracted memories were junk.

But benchmarks test needles-in-haystacks. Real agent memory needs structured knowledge: "what does this user prefer?", "what changed since last week?", "does this contradict what we decided before?" For those questions, MemPalace returns raw text chunks and hopes the LLM figures it out. Mem0/Zep/Letta return structured, deduplicated, temporally-aware facts.

High recall on benchmarks ≠ high utility in practice.


Scientific Grounding: Zero

Claim Scientific Basis
"Memory Palace" architecture None — method of loci is metaphorical only. No spatial indexing, no coordinates. It's metadata strings.
AAAK dialect None — regex summarization. No Shannon entropy, no information theory.
Knowledge graph Standard relational triple store (2 SQLite tables). No graph theory algorithms.
4-layer memory stack Standard tiered caching. No cognitive science (no Atkinson-Shiffrin, no Baddeley, no Tulving).
Contradiction detection Does not exist. fact_checker.py is referenced in README but the file is not in the codebase.
Academic citations 1 total (MemBench ACL 2025 in a benchmark script). Zero in architectural documentation.

Compare: Mem0 has arXiv:2504.19413. Zep has arXiv:2501.13956. MemGPT has arXiv:2310.08560 from UC Berkeley with ~154 citations.


What's Real (Credit Where Due)

This is NOT a ruflo situation. MemPalace has real, working code:

  • All 24 MCP tools work. Every tool backed by actual ChromaDB/SQLite operations. No stubs. Marketing actually understates tool count (says 19, has 24).
  • Benchmarks are not fabricated. Real code running against real academic datasets (LongMemEval, LoCoMo, ConvoMem, MemBench). Scores computed at runtime. The project is unusually honest about its own benchmark weaknesses in BENCHMARKS.md.
  • 11,139 lines of real Python with a test suite enforcing 80% coverage across 3 operating systems.
  • Conversation format parsers for Claude Code JSONL, ChatGPT tree-structured JSON, Codex CLI, Slack — these are non-trivial and genuinely useful.
  • The ChromaDB migration fix (_fix_blob_seq_ids) handles a real 0.6.x → 1.5.x breaking change. Evidence of real-world usage.

MemPalace is 80% real code + 20% marketing inflation. The code works. The stars are fake. The innovation is near-zero.


The Core Paradox

MemPalace's benchmark scores are real and genuinely impressive. The reason is counterintuitive: storing raw text and letting the embedding model do similarity search outperforms LLM-extracted summaries on recall tasks, because extraction is lossy. Mem0's own community audit found 97.8% of extracted memories were junk.

But MemPalace didn't discover this insight — ChromaDB's tutorial demonstrates it. MemPalace packaged it with a celebrity name and bought 42,000 stars.


Why You Probably Don't Need This

Claude Code's built-in MEMORY.md already exists

Claude Code has auto-memory that persists across sessions in ~/.claude/projects/*/memory/. It's:

  • Free
  • Local
  • Requires zero setup
  • Natively integrated into every conversation
  • Already being used by anyone reading this

MemPalace's only advantage over native MEMORY.md is semantic vector search across historical conversations. But that advantage is ChromaDB's, not MemPalace's.

The actual use case is extremely narrow

A developer who:

  1. Has months of Claude Code / ChatGPT sessions exported as files on disk
  2. Wants to search "what did I decide about the auth migration 3 months ago?"
  3. Doesn't want to pay for API calls
  4. Doesn't want to run Neo4j/Postgres/Redis

That's a real niche — but it's solved by pip install chromadb and 50 lines of Python.


The Full Audit Scorecard

Dimension What We Checked Verdict Grade
README Claims vs Reality 17 claims verified against code 9 real, 3 understated, 3 partial, 1 misleading, 1 phantom B
Core Architecture Every .py file in mempalace/ Real code, no stubs, but thin wrappers over ChromaDB B+
Scientific Grounding All docs, all concepts, all references 1 citation total, zero cognitive science, zero information theory D
Test Suite 32 test files, ~165 test functions ~55 real integration tests, ~65 mock-heavy, benchmarks solid B-
MCP Server All 24 tools traced to implementation Every tool is real (marketing understates at 19) A
Embeddings/Vector Search Backend code, search code, scoring Genuine ChromaDB + all-MiniLM-L6-v2, no fake scores A
Benchmarks All 19 benchmark files Real datasets, runtime computation, honest methodology disclosures A
Conversation Mining All 10 pipeline modules Real parsers, real regex NER, honest about limitations A-
Dependencies & Packaging pyproject.toml, CI, imports 2 real deps, 11K lines, CI enforces 80% coverage A
GitHub Stars & Marketing Stargazer timestamps, author history Bot-farm patterns, unverified celebrity, version inflation F

Show Me The Code — The Roast

Let's look at what 42,000 stars bought you. Actual code from the repository, with commentary.


The Entire "Search Engine" (searcher.py)

The file that produces the 96.6% benchmark score. Here's the core of it:

# searcher.py — the ENTIRE search logic
results = col.query(
    query_texts=[query],
    n_results=n_results,
    include=["documents", "metadatas", "distances"],
)

That's it. One ChromaDB API call. The rest of the 169-line file is argument parsing and print formatting. The "semantic search" is ChromaDB's default behavior with default settings. You don't need MemPalace for this — you need pip install chromadb.


The "Palace" Architecture (palace.py — all 74 lines of it)

The file named palace.py — the core of the "Memory Palace architecture":

# palace.py — the ENTIRE "palace" module

_DEFAULT_BACKEND = ChromaBackend()

def get_collection(palace_path, collection_name="mempalace_drawers", create=True):
    """Get the palace collection through the backend layer."""
    return _DEFAULT_BACKEND.get_collection(palace_path, collection_name=collection_name, create=create)

The "palace" is chromadb.PersistentClient(path).get_or_create_collection("mempalace_drawers"). That's the entire architecture. Wings, rooms, halls, tunnels, closets — all of that resolves to metadata strings on ChromaDB documents.


"Intelligent Room Detection" — It's content.count()

The README says rooms are "auto-detected." Here's the detection algorithm:

# miner.py:302-308 — "room routing"
scores = defaultdict(int)
for room in rooms:
    keywords = room.get("keywords", []) + [room["name"]]
    for kw in keywords:
        count = content_lower.count(kw.lower())   # <-- the entire "AI"
        scores[room["name"]] += count

best = max(scores, key=scores.get)

The "intelligent routing" is str.count() in a loop. Count how many times "docker" appears in the file, assign it to the "infrastructure" room. That's a freshman CS homework assignment.

The conversation variant is the same pattern:

# convo_miner.py:181-191 — "topic detection"
def detect_convo_room(content: str) -> str:
    content_lower = content[:3000].lower()
    scores = {}
    for room, keywords in TOPIC_KEYWORDS.items():
        score = sum(1 for kw in keywords if kw in content_lower)   # <-- sum of keyword hits
    return max(scores, key=scores.get)

The keywords are hardcoded lists: ["code", "python", "function", "bug", "error", "api"] for "technical", ["plan", "roadmap", "milestone"] for "planning". This is a dictionary lookup, not NLP.


"Entity Detection" — A Single Regex

The "NLP-free entity detection" that finds people and projects:

# entity_detector.py:449 — the ENTIRE candidate extraction
raw = re.findall(r"\b([A-Z][a-z]{1,19})\b", text)

Find all capitalized words between 2-20 characters. Filter by frequency >= 3. That's the entity detector. No spaCy, no NER model, no transformer — a regex that matches capitalized words.


"Fuzzy Matching" — It's substring in

The graph module advertises "fuzzy matching" for room discovery:

# palace_graph.py:219-230 — "fuzzy" matching
def _fuzzy_match(query: str, nodes: dict, n: int = 5):
    query_lower = query.lower()
    scored = []
    for room in nodes:
        if query_lower in room:          # <-- exact substring match
            scored.append((room, 1.0))
        elif any(word in room for word in query_lower.split("-")):
            scored.append((room, 0.5))   # <-- split on hyphen, substring match
    return [r for r, _ in scored[:n]]

No Levenshtein distance. No trigram similarity. No fuzzy matching algorithm of any kind. It's Python's in operator — exact substring containment.


Layer 1 "Importance Scoring" — Everything Scores the Same

Layer 1 claims to surface "top moments" from your palace:

# layers.py:124-137 — the scoring that determines your "essential story"
scored = []
for doc, meta in zip(docs, metas):
    importance = 3                            # <-- default importance
    for key in ("importance", "emotional_weight", "weight"):
        val = meta.get(key)                   # <-- check metadata
        if val is not None:
            importance = float(val)
            break
    scored.append((importance, meta, doc))

scored.sort(key=lambda x: x[0], reverse=True)   # sort by importance
top = scored[:15]                                 # take top 15

The problem: the mining pipeline never sets importance, emotional_weight, or weight metadata on any drawer. Check miner.py:377-384 — the metadata fields are wing, room, source_file, chunk_index, added_by, filed_at. No importance. No weight. No emotional score.

So every single drawer gets importance = 3. The sort is meaningless. The "top 15 essential moments" are the first 15 drawers returned by ChromaDB — effectively random order. The "Essential Story" is random.


"Filing a Drawer" — It's collection.upsert()

The act of "filing a memory in the palace":

# miner.py:371-397 — "add one drawer to the palace"
def add_drawer(collection, wing, room, content, source_file, chunk_index, agent):
    drawer_id = f"drawer_{wing}_{room}_{hashlib.sha256(
        (source_file + str(chunk_index)).encode()
    ).hexdigest()[:24]}"
    
    metadata = {
        "wing": wing, "room": room, "source_file": source_file,
        "chunk_index": chunk_index, "added_by": agent,
        "filed_at": datetime.now().isoformat(),
    }
    
    collection.upsert(documents=[content], ids=[drawer_id], metadatas=[metadata])

Strip away the naming and this is: collection.upsert(documents=[text], ids=[id], metadatas=[{tags}]). A single ChromaDB upsert with metadata tags. The "wing" is a string. The "room" is a string. The "drawer" is a ChromaDB document. The "palace" is a ChromaDB collection.

And then there's this:

    except Exception:
        raise

Line 396-397. Catch any exception... then immediately re-raise it. This is a no-op. Dead code.


Dedup Statistics — A Hardcoded 40% Guess

The dedup module reports estimated duplicates:

# dedup.py:147 — "estimation"
estimated_dups = sum(int(len(ids) * 0.4) for ids in groups.values() if len(ids) > 20)

The "estimate" is: multiply the number of drawers by 0.4. Every group is assumed to be 40% duplicates. Not computed — guessed — with a hardcoded constant.


Token Estimation — Divide by 4

The "token counting" across the project:

# layers.py:68 — Layer 0 token estimation
def token_estimate(self) -> int:
    return len(self.render()) // 4

# dialect.py:949 — AAAK token estimation
def count_tokens(text):
    words = text.split()
    return int(len(words) * 1.3)

Two different methods, both wrong. One divides character count by 4, the other multiplies word count by 1.3. No tiktoken. No real tokenizer. The README claims "~170 tokens" for wake-up based on these approximate guesses — the code's own comments say 600-900 tokens.


Replace The Entire Repo in 50 Lines

Here's a complete replacement for MemPalace's core functionality — mine, search, and dedup — using only ChromaDB:

#!/usr/bin/env python3
"""mempalace_replacement.py — The entire "Memory Palace" in 50 lines."""

import chromadb
import hashlib
import json
from pathlib import Path

def mine(source_dir: str, palace_path: str = "~/.mempalace/palace"):
    """Mine files into ChromaDB. Replaces: miner.py, palace.py, convo_miner.py"""
    client = chromadb.PersistentClient(path=str(Path(palace_path).expanduser()))
    col = client.get_or_create_collection("memories", metadata={"hnsw:space": "cosine"})
    
    for f in Path(source_dir).expanduser().rglob("*"):
        if f.is_file() and f.suffix in (".md", ".txt", ".jsonl", ".json", ".py"):
            text = f.read_text(errors="ignore")
            # Chunk with overlap (replaces chunk_text)
            for i in range(0, len(text), 700):
                chunk = text[i:i+800].strip()
                if len(chunk) < 50:
                    continue
                cid = hashlib.sha256(f"{f}_{i}".encode()).hexdigest()[:24]
                col.upsert(
                    documents=[chunk],
                    ids=[cid],
                    metadatas=[{"source": str(f), "chunk": i}],
                )
    print(f"Mined {col.count()} chunks into {palace_path}")

def search(query: str, palace_path: str = "~/.mempalace/palace", n: int = 5):
    """Semantic search. Replaces: searcher.py, layers.py L3"""
    client = chromadb.PersistentClient(path=str(Path(palace_path).expanduser()))
    col = client.get_collection("memories")
    results = col.query(query_texts=[query], n_results=n, include=["documents", "distances"])
    for doc, dist in zip(results["documents"][0], results["distances"][0]):
        print(f"[{1-dist:.2f}] {doc[:120]}...")

def dedup(palace_path: str = "~/.mempalace/palace", threshold: float = 0.15):
    """Remove near-duplicates. Replaces: dedup.py"""
    client = chromadb.PersistentClient(path=str(Path(palace_path).expanduser()))
    col = client.get_collection("memories")
    all_data = col.get(include=["documents"])
    to_delete = []
    for i, (did, doc) in enumerate(zip(all_data["ids"], all_data["documents"])):
        if did in to_delete or not doc:
            continue
        hits = col.query(query_texts=[doc], n_results=3, include=["distances"])
        for rid, dist in zip(hits["ids"][0], hits["distances"][0]):
            if rid != did and dist < threshold:
                to_delete.append(rid)
    if to_delete:
        col.delete(ids=list(set(to_delete)))
    print(f"Removed {len(set(to_delete))} duplicates")

if __name__ == "__main__":
    import sys
    if len(sys.argv) < 2:
        print("Usage: python mempalace_replacement.py mine|search|dedup [args]")
    elif sys.argv[1] == "mine":
        mine(sys.argv[2] if len(sys.argv) > 2 else ".")
    elif sys.argv[1] == "search":
        search(" ".join(sys.argv[2:]))
    elif sys.argv[1] == "dedup":
        dedup()

50 lines. One dependency (chromadb). Same embedding model. Same HNSW index. Same cosine similarity. Same benchmark score potential.

What's missing compared to MemPalace's 11,139 lines?

  • Conversation format parsers (genuinely useful, ~400 lines of real work)
  • Pretty-printed CLI output (~200 lines of print statements)
  • MCP JSON-RPC server (~1,400 lines of protocol plumbing)
  • The word "palace" (~200 occurrences)

What's NOT missing?

  • The retrieval quality (that's ChromaDB's)
  • The benchmark score (that's ChromaDB's default embedding model)
  • The "intelligence" (there is none to miss)

One-Line Summary

MemPalace is a well-packaged CLI wrapper around ChromaDB with conversation format parsers, marketed with 42,000 purchased GitHub stars and an unverified celebrity name. The retrieval scores are real — they're ChromaDB's scores. The innovation is near-zero. The stars are fake.


This audit was conducted on MemPalace v3.1.0 / develop branch as of 2026-04-12. Previous audit in this series: Ruflo / Claude-Flow: 300+ MCP Tools Exposed — 99% Theater, 1% Real

@bensig
Copy link
Copy Markdown

bensig commented Apr 23, 2026

lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment