A deep technical audit of the ruflo (@claude-flow/cli) ecosystem — what it claims vs what it actually does.
Investigated 2026-04-04 by spawning 8 research agents across two analysis phases, reading source code from the ruflo GitHub repo, tracing local process execution, and testing every major tool category hands-on.
We audited 300+ MCP tools exposed by ruflo/claude-flow. ~10 are real. The rest are JSON state stubs with no execution backend. The "hive-mind" is child_process.spawn('claude', ['--dangerously-skip-permissions', '<big prompt>']). The "neural training" ignores your data and returns Math.random() accuracy. The "intelligence layer" processes 100 MB of graph data to surface 20 unique entries buried under 5,686 duplicates — on every single message.
From the README:
- "300+ MCP tools for autonomous AI agent coordination"
- "Byzantine fault-tolerant consensus"
- "Neural pattern learning with SONA integration"
- "HNSW-indexed semantic search (150x-12,500x faster)"
- "Hierarchical swarm orchestration with queen-worker topology"
- "WASM sandboxed agents"
- "137+ skills available"
Sounds like a distributed AI operating system. Let's see what's behind the curtain.
Reality: Creates a JavaScript Map entry: { agentId, status: 'idle', taskCount: 0 }. No subprocess. No fork. No LLM call. We spawned 5 agents, checked their status — all remained idle with taskCount: 0 and lastResult: null forever.
// What you get after agent_spawn:
{
"agentId": "synthesis-architecture",
"status": "idle", // will never change
"taskCount": 0, // will never increment
"lastResult": null // will never populate
}Reality: Stores task in an in-memory Map. Sends a task_assign message to a local EventEmitter. No worker process exists to pick it up. The coordinator polls task.status every 100ms waiting for state changes that never come.
// Created a task, assigned it to an agent. 30 minutes later:
{ "status": "pending", "progress": 0, "result": null }
// Agent status: still "idle", taskCount: still 0Reality: Writes a config JSON with a swarmId string. After spawning 5 agents into the swarm:
{ "agentCount": 0 } // doesn't even see its own agentsReality: PBFT phases are modeled in code. The "network communication" is this.emit(...) — local EventEmitter events within one Node.js process. No sockets, no gRPC, no HTTP, no inter-node transport. It's single-process BFT simulation.
The ruflo hive-mind spawn --claude command literally does:
child_process.spawn('claude', ['--dangerously-skip-permissions', '<system prompt>'])That's the entire "hive-mind." It starts Claude CLI with a long prompt telling it to pretend it's a queen bee.
Reality: We created a WASM agent and sent it a prompt:
Input: "List 3 advantages of backtesting infrastructure"
Output: "echo: List 3 advantages of backtesting infrastructure"
It echoes your input back. No WASM runtime. No LLM call. No sandbox.
Reality: We trained a "classifier" on 5 data points with labels [0, 1, 1, 0, 1]. It reported 93.6% accuracy on 5 samples with 1 epoch. Then we tested predictions:
Input: [1, 0, 0] (label was 0) → Prediction: "coder" (confidence 0.90)
Input: [0, 1, 1] (label was 1) → Prediction: "coder" (confidence 0.85)
It completely ignores the training data and labels. What actually happens:
- Generates a real 384-dim embedding of the input string (this part works)
- Returns hardcoded agent-type labels ("coder", "researcher", "reviewer") with semi-random confidences
- The "accuracy" is
Math.random()dressed up
The embeddings engine (all-MiniLM-L6-v2) is real. Everything wrapped around it is theater.
{
"predictions": [
{ "label": "coder", "confidence": 0.9015 }, // always "coder" first
{ "label": "researcher", "confidence": 0.5069 }, // always "researcher" second
{ "label": "reviewer", "confidence": 0.3588 } // always "reviewer" third
]
}Same labels, same order, regardless of input or "trained" model.
{ "error": "Workflow not found" }Even after creating a workflow. The state machine exists but has no executor.
| Category | Tools | Claim | Reality |
|---|---|---|---|
| Memory/HNSW | memory_store/search/retrieve |
Semantic vector search | REAL — 384-dim embeddings, HNSW index, SQLite persistence |
| AgentDB | agentdb_pattern-store/search |
Pattern storage with vector search | REAL — same HNSW engine, namespace-scoped |
| Embeddings | embeddings_generate/compare |
Embedding generation | REAL — all-MiniLM-L6-v2 |
| Terminal | terminal_execute |
Shell command execution | REAL — but redundant (Claude has Bash tool) |
| Session | session_save/restore |
State persistence | REAL — JSON key-value store |
| Agent | agent_spawn/status/health/pool |
Agent lifecycle management | FAKE — JSON records in Map, no execution |
| Task | task_create/assign/complete |
Task queue with workers | FAKE — no worker picks up tasks |
| Swarm | swarm_init/status/health |
Multi-agent coordination | FAKE — config storage, agentCount=0 |
| Hive-mind | hive-mind_init/consensus/broadcast |
Distributed consensus | FAKE — single-process EventEmitter |
| WASM | wasm_agent_create/prompt |
Sandboxed agent execution | FAKE — echo stub, no runtime |
| Neural | neural_train/predict |
ML model training | FAKE — ignores data, random accuracy |
| Workflow | workflow_create/execute |
Task orchestration | FAKE — state machine, no executor |
| Coordination | coordination_consensus/sync/topology |
Distributed coordination | FAKE — labels and config only |
| Claims | claims_claim/handoff/steal |
Distributed locking | FAKE — in-memory map |
| DAA | daa_agent_create/workflow_execute |
Autonomous agents | FAKE — registration stubs |
| Browser | browser_open/click/fill |
Web automation | UNTESTED — may work if Playwright connected |
Real: ~10 tools (3%). Fake: ~290 tools (97%).
Here's the tragic part: ruflo has a real AnthropicProvider class that makes actual HTTP calls to api.anthropic.com via fetch. A ProviderManager with round-robin/latency-based routing exists too.
But nothing in the agent spawn, task execution, or swarm coordination path imports or calls these providers.
The LLM layer is built. The task queue is built. The agent registry is built. The wire connecting them? Missing.
At every session start, the hooks system:
- Reads
MEMORY.mdfiles from~/.claude/projects/*/memory/ - Parses each markdown section as a separate "memory entry"
- Stores them in
auto-memory-store.json - Builds a graph with trigram-based Jaccard similarity edges
- Runs PageRank (30 iterations)
- Writes
graph-state.jsonandranked-context.json
Then on every single user message, it reads ranked-context.json (8.7 MB), computes trigram similarity against your prompt, and injects the top 5 "relevant patterns" into Claude's context.
| Metric | Value | Problem |
|---|---|---|
| Total entries | 5,706 | |
| Unique entries | ~20 | 5,686 are the same MEMORY.md sections duplicated across project directories |
| graph-state.json | 100 MB | For 20 unique entries |
| Edges | 719,632 | O(n^2) between near-identical duplicates |
| Similarity method | Character trigram Jaccard | Not semantic — matches character overlap, not meaning |
| PageRank result | ~0.02 for all nodes | Uniform distribution — meaningless over a near-complete graph of duplicates |
[INTELLIGENCE] Relevant patterns for this task:
* (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #1, 3x accessed]
* (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #2, 3x accessed]
* (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #3, 3x accessed]
* (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #4, 3x accessed]
* (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #5, 3x accessed]
The same entry 5 times (5 duplicates in the store). All scores identical (~0.12). Zero useful signal.
~300 tokens wasted per message (intelligence output + router output) x ~50 messages/session = ~15,000 tokens/session on noise that doesn't affect outcomes.
Fires on every UserPromptSubmit. Does regex keyword matching against 8 patterns. Prints a formatted ASCII table with:
- Fake latency:
Math.random() * 0.5 + 0.1milliseconds - Hardcoded "Semantic Match" percentages: always 15%, 14%, 13%
- Agent type recommendations that spawn nothing
Pure cosmetic theater on every message.
The .claude/agents/ directory contains 91 markdown files across 22 subdirectories: Byzantine coordinators, CRDT synchronizers, queen coordinators with "sovereign presence," topology optimizers, gossip coordinators...
Zero reference Aelita, MongoDB, trading signals, or any project-specific logic. All generic claude-flow templates.
The queen coordinator prompt literally stores:
{
"status": "sovereign-active",
"hierarchy_established": true,
"succession_plan": "collective-intelligence"
}...to a ruflo memory key that nothing ever reads.
Agent hooks call npx claude-flow@v3alpha memory search-patterns and hooks intelligence trajectory-start — which invoke the same fake tools documented above.
Out of the entire ruflo ecosystem, 3 things provide real value:
| Component | What it does | Why it's real |
|---|---|---|
HNSW memory (memory_store/search) |
384-dim vector search over stored entries | Real all-MiniLM-L6-v2 embeddings, real HNSW index, real SQLite persistence |
AgentDB patterns (agentdb_pattern-store/search) |
Same engine, namespace-scoped | Useful for cross-session project knowledge |
Auto-memory hook (auto-memory-hook.mjs) |
Imports MEMORY.md entries at session start | Bridges Claude Code's built-in memory to the session context |
Everything else — agent spawn, task management, swarm coordination, hive-mind consensus, WASM agents, neural training, workflows, claims, coordination, 91 agent definitions, 39 unwired helper scripts, 90 generic slash commands — is either a stub, a state record with no executor, or cosmetic output that wastes tokens.
What you actually need for AI-assisted development:
KEEP (already installed, actually works):
Serena MCP — semantic code navigation (LSP, symbols, references)
Context7 MCP — library/framework documentation lookup
Claude auto-memory — MEMORY.md files for cross-session knowledge
Native tools — Grep, Glob, Read, Bash, Agent (real subprocesses)
OPTIONAL ADD:
Qdrant MCP — real vector search if you need semantic code search
mcp-memory — official Anthropic knowledge graph MCP
REMOVE:
ruflo MCP — 2 Node.js processes, 300+ tools, ~10 real
.claude-flow/ — 120 MB of state files (100 MB graph for 20 entries)
.hive-mind/ — 31 stale prompt template files
.claude/agents/ — 91 generic agent definitions
39 unwired hooks — dead helper scripts
90 generic commands — boilerplate slash commands
Result: Same capabilities. Zero fake tools. No token waste. 200 MB less disk. 2 fewer Node.js processes.
- Spawned 5 native Claude Code agents (the ones that actually work) to analyze a competitor repo
- Tried to redo the same work with ruflo
agent_spawn+swarm_init— agents satidlewithtaskCount: 0forever - Tested
wasm_agent_prompt— got our input echoed back - Trained a "neural classifier" — got
Math.random()accuracy and hardcoded prediction labels - Traced every hook, read every helper script, inspected every state file
- Read the ruflo source code on GitHub — confirmed the execution gap is architectural, not a bug
The LLM providers exist. The task queue exists. The agent registry exists. The wire connecting them is missing.
- 8 research agents spawned across 2 analysis phases (all via Claude Code native Agent tool — the only thing that actually executes)
- Source code analysis of ruflo GitHub repository (
/v3/mcp/tools/,/v3/@claude-flow/providers/, swarm coordinator, WASM tools) - Local process inspection (
ps aux, file sizes, JSON content analysis) - Hands-on testing of
agent_spawn,task_create,task_assign,wasm_agent_create,wasm_agent_prompt,neural_train,neural_predict,workflow_execute,swarm_init,swarm_status,terminal_execute - Hooks code review of all 48 helper scripts in
.claude/helpers/ - Data analysis of
auto-memory-store.json(5,706 entries),graph-state.json(100 MB),ranked-context.json(8.7 MB)
When you run ruflo hive-mind init, the CLI asks you to choose a consensus strategy: byzantine, raft, queen, gossip, crdt. Sounds like you're picking a distributed coordination protocol. You're picking a string label.
| Type | Claim | Reality |
|---|---|---|
| Byzantine/PBFT | Multi-node fault-tolerant consensus with crypto verification | verifySignature() unconditionally returns true (comment: "In real implementation, verify with sender's public key"). Consensus = majority-vote on a JSON dictionary. No multi-process communication. |
| Raft | Leader election + log replication | RaftConsensus class has correct algorithms but requestVotes() does this.emit('vote_request') — local EventEmitter. Comments: "For now, emit event for testing." Never wired to hive-mind MCP tools. |
| Queen/Hierarchical | Sovereign leader election + worker coordination | Queen ID = "queen-${Date.now()}". No election. Workers = string IDs in a JSON array. "Broadcasting" = appending to state.sharedMemory.broadcasts[]. |
| Gossip | Epidemic protocol propagation | Same JSON majority-vote handler regardless of chosen strategy. |
| CRDT | Conflict-free replicated data types | In-memory merge ops exist. Never connected to cross-process sync. |
hive-mind_init stores your consensus choice as a string in state.json:
{
"consensus": "byzantine",
"topology": "hierarchical",
"queen": { "agentId": "queen-1712234567890", "electedAt": 1712234567890, "term": 1 }
}The actual consensus handler is the same code regardless of which type you selected — readFileSync → count votes in a dictionary → writeFileSync. The strategy string is cosmetic.
| Path | Spawns Real Processes? | Called by Claude? | Has Real Consensus? |
|---|---|---|---|
MCP agent_spawn |
No — JSON record in Map |
Yes (but useless) | N/A |
CLI hive-mind spawn --claude |
Yes — child_process.spawn('claude', ...) |
No — requires shell | No — shared JSON file |
Codex DualModeOrchestrator |
Yes — real spawn() |
No — not activated | No consensus layer |
directApiAgent |
Calls Anthropic API via fetch |
No — not wired | No consensus layer |
Even when real processes ARE spawned (via CLI), their "consensus" is reading/writing the same state.json file. No sockets, no gRPC, no HTTP between processes. The Byzantine fault tolerance protects against... nothing, because there's one node talking to itself through a JSON file.
The README advertises a Token Optimizer claiming 30-50% savings via ReasoningBank retrieval (-32%), Agent Booster (-15%), caching (95% hit rate, -10%), and optimal batch sizing (-20%). We downloaded @claude-flow/integration from npm and read every line.
| Component | Claimed | Reality |
|---|---|---|
| ReasoningBank | -32% tokens via pattern retrieval | Real SQLite + vector search. But "-32%" is from a graph hop reduction benchmark in simulation docs, not token measurement. Baseline is const baseline = 1000 (hardcoded). |
| Agent Booster | -15%, 352x faster | Real WASM string-diff engine. But "352x" benchmark does await this.sleep(352) as the baseline — comparing against a literal sleep, not an LLM call. |
| Cache | 95% hit rate, -10% | Real Map with 5-min TTL. "95%" is marketing. Token savings: this.stats.totalTokensSaved += 100 — hardcoded per hit. |
| Batch size | -20% via optimal grouping | getOptimalConfig(agentCount) returns { batchSize: 4 } regardless of input. agentCount is ignored. |
| Combined | 30-50% reduction | No token counting code exists anywhere. Percentages are individually fabricated then summed. |
// "352x faster" benchmark baseline (agent-booster-migration.js:150)
await this.sleep(352); // sleeps 352ms to simulate "traditional edit"
// "Token savings" from cache (token-optimizer.ts:228)
this.stats.totalTokensSaved += 100; // hardcoded, not measured
// "Compact context" baseline (token-optimizer.ts:~130)
const baseline = 1000; // hardcoded "what you would have sent"While claiming 30-50% reduction, ruflo dramatically inflates actual token consumption:
| Source of Token Bloat | Tokens Wasted | Frequency |
|---|---|---|
| 300+ MCP tool definitions in context | ~5,000-10,000 | Every session |
[INTELLIGENCE] duplicate patterns per message |
~150-200 | Every message |
| Router ASCII table with fake metrics | ~100-150 | Every message |
| 91 agent type definitions in Agent tool | ~2,000-3,000 | Every Agent invocation |
| Session restore verbose output | ~200-300 | Every session start |
| Estimated per-session waste | ~15,000-25,000 tokens |
The tool that promises 30-50% token savings adds 15,000-25,000 tokens of noise per session. You're paying more, not less.
This audit was conducted on ruflo v3.5.51 / @claude-flow/cli@latest as of 2026-04-04.
thanks for this. I was coming to the same conclusions after trying for a week, though less scientifically.