Ruflo / Claude-Flow: 300+ MCP Tools Exposed — 99% Theater, 1% Real, 5x Token Waste

A deep technical audit of the ruflo (@claude-flow/cli) ecosystem — what it claims vs what it actually does.

Investigated 2026-04-04 by spawning 8 research agents across two analysis phases, reading source code from the ruflo GitHub repo, tracing local process execution, and testing every major tool category hands-on.

TL;DR

We audited 300+ MCP tools exposed by ruflo/claude-flow. ~10 are real. The rest are JSON state stubs with no execution backend. The "hive-mind" is child_process.spawn('claude', ['--dangerously-skip-permissions', '<big prompt>']). The "neural training" ignores your data and returns Math.random() accuracy. The "intelligence layer" processes 100 MB of graph data to surface 20 unique entries buried under 5,686 duplicates — on every single message.

What Ruflo Claims to Be

From the README:

"300+ MCP tools for autonomous AI agent coordination"
"Byzantine fault-tolerant consensus"
"Neural pattern learning with SONA integration"
"HNSW-indexed semantic search (150x-12,500x faster)"
"Hierarchical swarm orchestration with queen-worker topology"
"WASM sandboxed agents"
"137+ skills available"

Sounds like a distributed AI operating system. Let's see what's behind the curtain.

The Audit: Tool by Tool

Agent Spawn — "Spawn agent with intelligent model selection"

Reality: Creates a JavaScript Map entry: { agentId, status: 'idle', taskCount: 0 }. No subprocess. No fork. No LLM call. We spawned 5 agents, checked their status — all remained idle with taskCount: 0 and lastResult: null forever.

// What you get after agent_spawn:
{
  "agentId": "synthesis-architecture",
  "status": "idle",      // will never change
  "taskCount": 0,        // will never increment
  "lastResult": null     // will never populate
}

Task Create + Assign — "Create and assign tasks to agents"

Reality: Stores task in an in-memory Map. Sends a task_assign message to a local EventEmitter. No worker process exists to pick it up. The coordinator polls task.status every 100ms waiting for state changes that never come.

// Created a task, assigned it to an agent. 30 minutes later:
{ "status": "pending", "progress": 0, "result": null }
// Agent status: still "idle", taskCount: still 0

Swarm Init — "Initialize swarm with persistent state tracking"

Reality: Writes a config JSON with a swarmId string. After spawning 5 agents into the swarm:

{ "agentCount": 0 }  // doesn't even see its own agents

Hive-Mind Consensus — "Byzantine fault-tolerant consensus"

Reality: PBFT phases are modeled in code. The "network communication" is this.emit(...) — local EventEmitter events within one Node.js process. No sockets, no gRPC, no HTTP, no inter-node transport. It's single-process BFT simulation.

The ruflo hive-mind spawn --claude command literally does:

child_process.spawn('claude', ['--dangerously-skip-permissions', '<system prompt>'])

That's the entire "hive-mind." It starts Claude CLI with a long prompt telling it to pretend it's a queen bee.

WASM Agent — "Sandboxed WASM agent with virtual filesystem"

Reality: We created a WASM agent and sent it a prompt:

Input:  "List 3 advantages of backtesting infrastructure"
Output: "echo: List 3 advantages of backtesting infrastructure"

It echoes your input back. No WASM runtime. No LLM call. No sandbox.

Neural Train — "Train a neural model"

Reality: We trained a "classifier" on 5 data points with labels [0, 1, 1, 0, 1]. It reported 93.6% accuracy on 5 samples with 1 epoch. Then we tested predictions:

Input: [1, 0, 0] (label was 0)  → Prediction: "coder" (confidence 0.90)
Input: [0, 1, 1] (label was 1)  → Prediction: "coder" (confidence 0.85)

It completely ignores the training data and labels. What actually happens:

Generates a real 384-dim embedding of the input string (this part works)
Returns hardcoded agent-type labels ("coder", "researcher", "reviewer") with semi-random confidences
The "accuracy" is Math.random() dressed up

The embeddings engine (all-MiniLM-L6-v2) is real. Everything wrapped around it is theater.

Neural Predict — "Make predictions using a neural model"

{
  "predictions": [
    { "label": "coder", "confidence": 0.9015 },       // always "coder" first
    { "label": "researcher", "confidence": 0.5069 },   // always "researcher" second
    { "label": "reviewer", "confidence": 0.3588 }      // always "reviewer" third
  ]
}

Same labels, same order, regardless of input or "trained" model.

Workflow Execute — "Execute a workflow"

{ "error": "Workflow not found" }

Even after creating a workflow. The state machine exists but has no executor.

The Full Scorecard

Category	Tools	Claim	Reality
Memory/HNSW	`memory_store/search/retrieve`	Semantic vector search	REAL — 384-dim embeddings, HNSW index, SQLite persistence
AgentDB	`agentdb_pattern-store/search`	Pattern storage with vector search	REAL — same HNSW engine, namespace-scoped
Embeddings	`embeddings_generate/compare`	Embedding generation	REAL — all-MiniLM-L6-v2
Terminal	`terminal_execute`	Shell command execution	REAL — but redundant (Claude has Bash tool)
Session	`session_save/restore`	State persistence	REAL — JSON key-value store
Agent	`agent_spawn/status/health/pool`	Agent lifecycle management	FAKE — JSON records in Map, no execution
Task	`task_create/assign/complete`	Task queue with workers	FAKE — no worker picks up tasks
Swarm	`swarm_init/status/health`	Multi-agent coordination	FAKE — config storage, agentCount=0
Hive-mind	`hive-mind_init/consensus/broadcast`	Distributed consensus	FAKE — single-process EventEmitter
WASM	`wasm_agent_create/prompt`	Sandboxed agent execution	FAKE — echo stub, no runtime
Neural	`neural_train/predict`	ML model training	FAKE — ignores data, random accuracy
Workflow	`workflow_create/execute`	Task orchestration	FAKE — state machine, no executor
Coordination	`coordination_consensus/sync/topology`	Distributed coordination	FAKE — labels and config only
Claims	`claims_claim/handoff/steal`	Distributed locking	FAKE — in-memory map
DAA	`daa_agent_create/workflow_execute`	Autonomous agents	FAKE — registration stubs
Browser	`browser_open/click/fill`	Web automation	UNTESTED — may work if Playwright connected

Real: ~10 tools (3%). Fake: ~290 tools (97%).

The Disconnected LLM Provider

Here's the tragic part: ruflo has a real AnthropicProvider class that makes actual HTTP calls to api.anthropic.com via fetch. A ProviderManager with round-robin/latency-based routing exists too.

But nothing in the agent spawn, task execution, or swarm coordination path imports or calls these providers.

The LLM layer is built. The task queue is built. The agent registry is built. The wire connecting them? Missing.

The Intelligence Layer: 100 MB of Wasted I/O

How It Works

At every session start, the hooks system:

Reads MEMORY.md files from ~/.claude/projects/*/memory/
Parses each markdown section as a separate "memory entry"
Stores them in auto-memory-store.json
Builds a graph with trigram-based Jaccard similarity edges
Runs PageRank (30 iterations)
Writes graph-state.json and ranked-context.json

Then on every single user message, it reads ranked-context.json (8.7 MB), computes trigram similarity against your prompt, and injects the top 5 "relevant patterns" into Claude's context.

The Numbers

Metric	Value	Problem
Total entries	5,706
Unique entries	~20	5,686 are the same MEMORY.md sections duplicated across project directories
graph-state.json	100 MB	For 20 unique entries
Edges	719,632	O(n^2) between near-identical duplicates
Similarity method	Character trigram Jaccard	Not semantic — matches character overlap, not meaning
PageRank result	~0.02 for all nodes	Uniform distribution — meaningless over a near-complete graph of duplicates

What You See Every Message

[INTELLIGENCE] Relevant patterns for this task:
  * (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #1, 3x accessed]
  * (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #2, 3x accessed]
  * (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #3, 3x accessed]
  * (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #4, 3x accessed]
  * (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #5, 3x accessed]

The same entry 5 times (5 duplicates in the store). All scores identical (~0.12). Zero useful signal.

Token Cost

~300 tokens wasted per message (intelligence output + router output) x ~50 messages/session = ~15,000 tokens/session on noise that doesn't affect outcomes.

The Router Hook

Fires on every UserPromptSubmit. Does regex keyword matching against 8 patterns. Prints a formatted ASCII table with:

Fake latency: Math.random() * 0.5 + 0.1 milliseconds
Hardcoded "Semantic Match" percentages: always 15%, 14%, 13%
Agent type recommendations that spawn nothing

Pure cosmetic theater on every message.

The Agents Folder: 91 Job Descriptions for Positions That Don't Exist

The .claude/agents/ directory contains 91 markdown files across 22 subdirectories: Byzantine coordinators, CRDT synchronizers, queen coordinators with "sovereign presence," topology optimizers, gossip coordinators...

Zero reference Aelita, MongoDB, trading signals, or any project-specific logic. All generic claude-flow templates.

The queen coordinator prompt literally stores:

{
  "status": "sovereign-active",
  "hierarchy_established": true,
  "succession_plan": "collective-intelligence"
}

...to a ruflo memory key that nothing ever reads.

Agent hooks call npx claude-flow@v3alpha memory search-patterns and hooks intelligence trajectory-start — which invoke the same fake tools documented above.

What's Actually Worth Keeping

Out of the entire ruflo ecosystem, 3 things provide real value:

Component	What it does	Why it's real
HNSW memory (`memory_store/search`)	384-dim vector search over stored entries	Real `all-MiniLM-L6-v2` embeddings, real HNSW index, real SQLite persistence
AgentDB patterns (`agentdb_pattern-store/search`)	Same engine, namespace-scoped	Useful for cross-session project knowledge
Auto-memory hook (`auto-memory-hook.mjs`)	Imports MEMORY.md entries at session start	Bridges Claude Code's built-in memory to the session context

Everything else — agent spawn, task management, swarm coordination, hive-mind consensus, WASM agents, neural training, workflows, claims, coordination, 91 agent definitions, 39 unwired helper scripts, 90 generic slash commands — is either a stub, a state record with no executor, or cosmetic output that wastes tokens.

The Alternative

What you actually need for AI-assisted development:

KEEP (already installed, actually works):
  Serena MCP        — semantic code navigation (LSP, symbols, references)
  Context7 MCP      — library/framework documentation lookup  
  Claude auto-memory — MEMORY.md files for cross-session knowledge
  Native tools       — Grep, Glob, Read, Bash, Agent (real subprocesses)

OPTIONAL ADD:
  Qdrant MCP        — real vector search if you need semantic code search
  mcp-memory        — official Anthropic knowledge graph MCP

REMOVE:
  ruflo MCP          — 2 Node.js processes, 300+ tools, ~10 real
  .claude-flow/      — 120 MB of state files (100 MB graph for 20 entries)
  .hive-mind/        — 31 stale prompt template files
  .claude/agents/    — 91 generic agent definitions
  39 unwired hooks   — dead helper scripts
  90 generic commands — boilerplate slash commands

Result: Same capabilities. Zero fake tools. No token waste. 200 MB less disk. 2 fewer Node.js processes.

How We Discovered This

Spawned 5 native Claude Code agents (the ones that actually work) to analyze a competitor repo
Tried to redo the same work with ruflo agent_spawn + swarm_init — agents sat idle with taskCount: 0 forever
Tested wasm_agent_prompt — got our input echoed back
Trained a "neural classifier" — got Math.random() accuracy and hardcoded prediction labels
Traced every hook, read every helper script, inspected every state file
Read the ruflo source code on GitHub — confirmed the execution gap is architectural, not a bug

The LLM providers exist. The task queue exists. The agent registry exists. The wire connecting them is missing.

Methodology

8 research agents spawned across 2 analysis phases (all via Claude Code native Agent tool — the only thing that actually executes)
Source code analysis of ruflo GitHub repository (/v3/mcp/tools/, /v3/@claude-flow/providers/, swarm coordinator, WASM tools)
Local process inspection (ps aux, file sizes, JSON content analysis)
Hands-on testing of agent_spawn, task_create, task_assign, wasm_agent_create, wasm_agent_prompt, neural_train, neural_predict, workflow_execute, swarm_init, swarm_status, terminal_execute
Hooks code review of all 48 helper scripts in .claude/helpers/
Data analysis of auto-memory-store.json (5,706 entries), graph-state.json (100 MB), ranked-context.json (8.7 MB)

Consensus Types: Byzantine, Raft, Queen — All Theater

When you run ruflo hive-mind init, the CLI asks you to choose a consensus strategy: byzantine, raft, queen, gossip, crdt. Sounds like you're picking a distributed coordination protocol. You're picking a string label.

The Verdict

Type	Claim	Reality
Byzantine/PBFT	Multi-node fault-tolerant consensus with crypto verification	`verifySignature()` unconditionally returns `true` (comment: "In real implementation, verify with sender's public key"). Consensus = majority-vote on a JSON dictionary. No multi-process communication.
Raft	Leader election + log replication	`RaftConsensus` class has correct algorithms but `requestVotes()` does `this.emit('vote_request')` — local EventEmitter. Comments: "For now, emit event for testing." Never wired to hive-mind MCP tools.
Queen/Hierarchical	Sovereign leader election + worker coordination	Queen ID = `"queen-${Date.now()}"`. No election. Workers = string IDs in a JSON array. "Broadcasting" = appending to `state.sharedMemory.broadcasts[]`.
Gossip	Epidemic protocol propagation	Same JSON majority-vote handler regardless of chosen strategy.
CRDT	Conflict-free replicated data types	In-memory merge ops exist. Never connected to cross-process sync.

What Happens Under the Hood

hive-mind_init stores your consensus choice as a string in state.json:

{
  "consensus": "byzantine",
  "topology": "hierarchical",
  "queen": { "agentId": "queen-1712234567890", "electedAt": 1712234567890, "term": 1 }
}

The actual consensus handler is the same code regardless of which type you selected — readFileSync → count votes in a dictionary → writeFileSync. The strategy string is cosmetic.

Three Execution Paths

Path	Spawns Real Processes?	Called by Claude?	Has Real Consensus?
MCP `agent_spawn`	No — JSON record in `Map`	Yes (but useless)	N/A
CLI `hive-mind spawn --claude`	Yes — `child_process.spawn('claude', ...)`	No — requires shell	No — shared JSON file
Codex `DualModeOrchestrator`	Yes — real `spawn()`	No — not activated	No consensus layer
`directApiAgent`	Calls Anthropic API via `fetch`	No — not wired	No consensus layer

Even when real processes ARE spawned (via CLI), their "consensus" is reading/writing the same state.json file. No sockets, no gRPC, no HTTP between processes. The Byzantine fault tolerance protects against... nothing, because there's one node talking to itself through a JSON file.

Token Optimizer: "30-50% Token Reduction" — Real Infrastructure, Fake Metrics

The README advertises a Token Optimizer claiming 30-50% savings via ReasoningBank retrieval (-32%), Agent Booster (-15%), caching (95% hit rate, -10%), and optimal batch sizing (-20%). We downloaded @claude-flow/integration from npm and read every line.

What The Code Actually Does

Component	Claimed	Reality
ReasoningBank	-32% tokens via pattern retrieval	Real SQLite + vector search. But "-32%" is from a graph hop reduction benchmark in simulation docs, not token measurement. Baseline is `const baseline = 1000` (hardcoded).
Agent Booster	-15%, 352x faster	Real WASM string-diff engine. But "352x" benchmark does `await this.sleep(352)` as the baseline — comparing against a literal sleep, not an LLM call.
Cache	95% hit rate, -10%	Real `Map` with 5-min TTL. "95%" is marketing. Token savings: `this.stats.totalTokensSaved += 100` — hardcoded per hit.
Batch size	-20% via optimal grouping	`getOptimalConfig(agentCount)` returns `{ batchSize: 4 }` regardless of input. `agentCount` is ignored.
Combined	30-50% reduction	No token counting code exists anywhere. Percentages are individually fabricated then summed.

The Smoking Gun

// "352x faster" benchmark baseline (agent-booster-migration.js:150)
await this.sleep(352);  // sleeps 352ms to simulate "traditional edit"

// "Token savings" from cache (token-optimizer.ts:228)
this.stats.totalTokensSaved += 100;  // hardcoded, not measured

// "Compact context" baseline (token-optimizer.ts:~130)
const baseline = 1000;  // hardcoded "what you would have sent"

The Irony: Ruflo INCREASES Your Token Usage

While claiming 30-50% reduction, ruflo dramatically inflates actual token consumption:

Source of Token Bloat	Tokens Wasted	Frequency
300+ MCP tool definitions in context	~5,000-10,000	Every session
`[INTELLIGENCE]` duplicate patterns per message	~150-200	Every message
Router ASCII table with fake metrics	~100-150	Every message
91 agent type definitions in Agent tool	~2,000-3,000	Every Agent invocation
Session restore verbose output	~200-300	Every session start
Estimated per-session waste	~15,000-25,000 tokens

The tool that promises 30-50% token savings adds 15,000-25,000 tokens of noise per session. You're paying more, not less.

This audit was conducted on ruflo v3.5.51 / @claude-flow/cli@latest as of 2026-04-04.

roman-rr/ruflo-audit-gist.md

Ruflo / Claude-Flow: 300+ MCP Tools Exposed — 99% Theater, 1% Real, 5x Token Waste

TL;DR

What Ruflo Claims to Be

The Audit: Tool by Tool

Agent Spawn — "Spawn agent with intelligent model selection"

Task Create + Assign — "Create and assign tasks to agents"

Swarm Init — "Initialize swarm with persistent state tracking"

Hive-Mind Consensus — "Byzantine fault-tolerant consensus"

WASM Agent — "Sandboxed WASM agent with virtual filesystem"

Neural Train — "Train a neural model"

Neural Predict — "Make predictions using a neural model"

Workflow Execute — "Execute a workflow"

The Full Scorecard

The Disconnected LLM Provider

The Intelligence Layer: 100 MB of Wasted I/O

How It Works

The Numbers

What You See Every Message

Token Cost

The Router Hook

The Agents Folder: 91 Job Descriptions for Positions That Don't Exist

What's Actually Worth Keeping

The Alternative

How We Discovered This

Methodology

Consensus Types: Byzantine, Raft, Queen — All Theater

The Verdict

What Happens Under the Hood

Three Execution Paths

Token Optimizer: "30-50% Token Reduction" — Real Infrastructure, Fake Metrics

What The Code Actually Does

The Smoking Gun

The Irony: Ruflo INCREASES Your Token Usage

stayce commented Apr 9, 2026

Uh oh!

yousecjoe commented Apr 9, 2026

Uh oh!

y3294992458 commented Apr 16, 2026

Uh oh!