Skip to content

Instantly share code, notes, and snippets.

@roman-rr
Last active April 16, 2026 05:46
Show Gist options
  • Select an option

  • Save roman-rr/ed603b676af019b8740423d2bb8e4bf6 to your computer and use it in GitHub Desktop.

Select an option

Save roman-rr/ed603b676af019b8740423d2bb8e4bf6 to your computer and use it in GitHub Desktop.
Ruflo / Claude-Flow Audit: 300+ MCP Tools — 99% Theater, 1% Real, 5x Token Waste

Ruflo / Claude-Flow: 300+ MCP Tools Exposed — 99% Theater, 1% Real, 5x Token Waste

A deep technical audit of the ruflo (@claude-flow/cli) ecosystem — what it claims vs what it actually does.

Investigated 2026-04-04 by spawning 8 research agents across two analysis phases, reading source code from the ruflo GitHub repo, tracing local process execution, and testing every major tool category hands-on.


TL;DR

We audited 300+ MCP tools exposed by ruflo/claude-flow. ~10 are real. The rest are JSON state stubs with no execution backend. The "hive-mind" is child_process.spawn('claude', ['--dangerously-skip-permissions', '<big prompt>']). The "neural training" ignores your data and returns Math.random() accuracy. The "intelligence layer" processes 100 MB of graph data to surface 20 unique entries buried under 5,686 duplicates — on every single message.


What Ruflo Claims to Be

From the README:

  • "300+ MCP tools for autonomous AI agent coordination"
  • "Byzantine fault-tolerant consensus"
  • "Neural pattern learning with SONA integration"
  • "HNSW-indexed semantic search (150x-12,500x faster)"
  • "Hierarchical swarm orchestration with queen-worker topology"
  • "WASM sandboxed agents"
  • "137+ skills available"

Sounds like a distributed AI operating system. Let's see what's behind the curtain.


The Audit: Tool by Tool

Agent Spawn — "Spawn agent with intelligent model selection"

Reality: Creates a JavaScript Map entry: { agentId, status: 'idle', taskCount: 0 }. No subprocess. No fork. No LLM call. We spawned 5 agents, checked their status — all remained idle with taskCount: 0 and lastResult: null forever.

// What you get after agent_spawn:
{
  "agentId": "synthesis-architecture",
  "status": "idle",      // will never change
  "taskCount": 0,        // will never increment
  "lastResult": null     // will never populate
}

Task Create + Assign — "Create and assign tasks to agents"

Reality: Stores task in an in-memory Map. Sends a task_assign message to a local EventEmitter. No worker process exists to pick it up. The coordinator polls task.status every 100ms waiting for state changes that never come.

// Created a task, assigned it to an agent. 30 minutes later:
{ "status": "pending", "progress": 0, "result": null }
// Agent status: still "idle", taskCount: still 0

Swarm Init — "Initialize swarm with persistent state tracking"

Reality: Writes a config JSON with a swarmId string. After spawning 5 agents into the swarm:

{ "agentCount": 0 }  // doesn't even see its own agents

Hive-Mind Consensus — "Byzantine fault-tolerant consensus"

Reality: PBFT phases are modeled in code. The "network communication" is this.emit(...)local EventEmitter events within one Node.js process. No sockets, no gRPC, no HTTP, no inter-node transport. It's single-process BFT simulation.

The ruflo hive-mind spawn --claude command literally does:

child_process.spawn('claude', ['--dangerously-skip-permissions', '<system prompt>'])

That's the entire "hive-mind." It starts Claude CLI with a long prompt telling it to pretend it's a queen bee.

WASM Agent — "Sandboxed WASM agent with virtual filesystem"

Reality: We created a WASM agent and sent it a prompt:

Input:  "List 3 advantages of backtesting infrastructure"
Output: "echo: List 3 advantages of backtesting infrastructure"

It echoes your input back. No WASM runtime. No LLM call. No sandbox.

Neural Train — "Train a neural model"

Reality: We trained a "classifier" on 5 data points with labels [0, 1, 1, 0, 1]. It reported 93.6% accuracy on 5 samples with 1 epoch. Then we tested predictions:

Input: [1, 0, 0] (label was 0)  → Prediction: "coder" (confidence 0.90)
Input: [0, 1, 1] (label was 1)  → Prediction: "coder" (confidence 0.85)

It completely ignores the training data and labels. What actually happens:

  1. Generates a real 384-dim embedding of the input string (this part works)
  2. Returns hardcoded agent-type labels ("coder", "researcher", "reviewer") with semi-random confidences
  3. The "accuracy" is Math.random() dressed up

The embeddings engine (all-MiniLM-L6-v2) is real. Everything wrapped around it is theater.

Neural Predict — "Make predictions using a neural model"

{
  "predictions": [
    { "label": "coder", "confidence": 0.9015 },       // always "coder" first
    { "label": "researcher", "confidence": 0.5069 },   // always "researcher" second
    { "label": "reviewer", "confidence": 0.3588 }      // always "reviewer" third
  ]
}

Same labels, same order, regardless of input or "trained" model.

Workflow Execute — "Execute a workflow"

{ "error": "Workflow not found" }

Even after creating a workflow. The state machine exists but has no executor.


The Full Scorecard

Category Tools Claim Reality
Memory/HNSW memory_store/search/retrieve Semantic vector search REAL — 384-dim embeddings, HNSW index, SQLite persistence
AgentDB agentdb_pattern-store/search Pattern storage with vector search REAL — same HNSW engine, namespace-scoped
Embeddings embeddings_generate/compare Embedding generation REAL — all-MiniLM-L6-v2
Terminal terminal_execute Shell command execution REAL — but redundant (Claude has Bash tool)
Session session_save/restore State persistence REAL — JSON key-value store
Agent agent_spawn/status/health/pool Agent lifecycle management FAKE — JSON records in Map, no execution
Task task_create/assign/complete Task queue with workers FAKE — no worker picks up tasks
Swarm swarm_init/status/health Multi-agent coordination FAKE — config storage, agentCount=0
Hive-mind hive-mind_init/consensus/broadcast Distributed consensus FAKE — single-process EventEmitter
WASM wasm_agent_create/prompt Sandboxed agent execution FAKE — echo stub, no runtime
Neural neural_train/predict ML model training FAKE — ignores data, random accuracy
Workflow workflow_create/execute Task orchestration FAKE — state machine, no executor
Coordination coordination_consensus/sync/topology Distributed coordination FAKE — labels and config only
Claims claims_claim/handoff/steal Distributed locking FAKE — in-memory map
DAA daa_agent_create/workflow_execute Autonomous agents FAKE — registration stubs
Browser browser_open/click/fill Web automation UNTESTED — may work if Playwright connected

Real: ~10 tools (3%). Fake: ~290 tools (97%).


The Disconnected LLM Provider

Here's the tragic part: ruflo has a real AnthropicProvider class that makes actual HTTP calls to api.anthropic.com via fetch. A ProviderManager with round-robin/latency-based routing exists too.

But nothing in the agent spawn, task execution, or swarm coordination path imports or calls these providers.

The LLM layer is built. The task queue is built. The agent registry is built. The wire connecting them? Missing.


The Intelligence Layer: 100 MB of Wasted I/O

How It Works

At every session start, the hooks system:

  1. Reads MEMORY.md files from ~/.claude/projects/*/memory/
  2. Parses each markdown section as a separate "memory entry"
  3. Stores them in auto-memory-store.json
  4. Builds a graph with trigram-based Jaccard similarity edges
  5. Runs PageRank (30 iterations)
  6. Writes graph-state.json and ranked-context.json

Then on every single user message, it reads ranked-context.json (8.7 MB), computes trigram similarity against your prompt, and injects the top 5 "relevant patterns" into Claude's context.

The Numbers

Metric Value Problem
Total entries 5,706
Unique entries ~20 5,686 are the same MEMORY.md sections duplicated across project directories
graph-state.json 100 MB For 20 unique entries
Edges 719,632 O(n^2) between near-identical duplicates
Similarity method Character trigram Jaccard Not semantic — matches character overlap, not meaning
PageRank result ~0.02 for all nodes Uniform distribution — meaningless over a near-complete graph of duplicates

What You See Every Message

[INTELLIGENCE] Relevant patterns for this task:
  * (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #1, 3x accessed]
  * (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #2, 3x accessed]
  * (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #3, 3x accessed]
  * (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #4, 3x accessed]
  * (0.12) auto-memory:rebrand-audit.md:Rebrand: aelita -> signals [rank #5, 3x accessed]

The same entry 5 times (5 duplicates in the store). All scores identical (~0.12). Zero useful signal.

Token Cost

~300 tokens wasted per message (intelligence output + router output) x ~50 messages/session = ~15,000 tokens/session on noise that doesn't affect outcomes.

The Router Hook

Fires on every UserPromptSubmit. Does regex keyword matching against 8 patterns. Prints a formatted ASCII table with:

  • Fake latency: Math.random() * 0.5 + 0.1 milliseconds
  • Hardcoded "Semantic Match" percentages: always 15%, 14%, 13%
  • Agent type recommendations that spawn nothing

Pure cosmetic theater on every message.


The Agents Folder: 91 Job Descriptions for Positions That Don't Exist

The .claude/agents/ directory contains 91 markdown files across 22 subdirectories: Byzantine coordinators, CRDT synchronizers, queen coordinators with "sovereign presence," topology optimizers, gossip coordinators...

Zero reference Aelita, MongoDB, trading signals, or any project-specific logic. All generic claude-flow templates.

The queen coordinator prompt literally stores:

{
  "status": "sovereign-active",
  "hierarchy_established": true,
  "succession_plan": "collective-intelligence"
}

...to a ruflo memory key that nothing ever reads.

Agent hooks call npx claude-flow@v3alpha memory search-patterns and hooks intelligence trajectory-start — which invoke the same fake tools documented above.


What's Actually Worth Keeping

Out of the entire ruflo ecosystem, 3 things provide real value:

Component What it does Why it's real
HNSW memory (memory_store/search) 384-dim vector search over stored entries Real all-MiniLM-L6-v2 embeddings, real HNSW index, real SQLite persistence
AgentDB patterns (agentdb_pattern-store/search) Same engine, namespace-scoped Useful for cross-session project knowledge
Auto-memory hook (auto-memory-hook.mjs) Imports MEMORY.md entries at session start Bridges Claude Code's built-in memory to the session context

Everything else — agent spawn, task management, swarm coordination, hive-mind consensus, WASM agents, neural training, workflows, claims, coordination, 91 agent definitions, 39 unwired helper scripts, 90 generic slash commands — is either a stub, a state record with no executor, or cosmetic output that wastes tokens.


The Alternative

What you actually need for AI-assisted development:

KEEP (already installed, actually works):
  Serena MCP        — semantic code navigation (LSP, symbols, references)
  Context7 MCP      — library/framework documentation lookup  
  Claude auto-memory — MEMORY.md files for cross-session knowledge
  Native tools       — Grep, Glob, Read, Bash, Agent (real subprocesses)

OPTIONAL ADD:
  Qdrant MCP        — real vector search if you need semantic code search
  mcp-memory        — official Anthropic knowledge graph MCP

REMOVE:
  ruflo MCP          — 2 Node.js processes, 300+ tools, ~10 real
  .claude-flow/      — 120 MB of state files (100 MB graph for 20 entries)
  .hive-mind/        — 31 stale prompt template files
  .claude/agents/    — 91 generic agent definitions
  39 unwired hooks   — dead helper scripts
  90 generic commands — boilerplate slash commands

Result: Same capabilities. Zero fake tools. No token waste. 200 MB less disk. 2 fewer Node.js processes.


How We Discovered This

  1. Spawned 5 native Claude Code agents (the ones that actually work) to analyze a competitor repo
  2. Tried to redo the same work with ruflo agent_spawn + swarm_init — agents sat idle with taskCount: 0 forever
  3. Tested wasm_agent_prompt — got our input echoed back
  4. Trained a "neural classifier" — got Math.random() accuracy and hardcoded prediction labels
  5. Traced every hook, read every helper script, inspected every state file
  6. Read the ruflo source code on GitHub — confirmed the execution gap is architectural, not a bug

The LLM providers exist. The task queue exists. The agent registry exists. The wire connecting them is missing.


Methodology

  • 8 research agents spawned across 2 analysis phases (all via Claude Code native Agent tool — the only thing that actually executes)
  • Source code analysis of ruflo GitHub repository (/v3/mcp/tools/, /v3/@claude-flow/providers/, swarm coordinator, WASM tools)
  • Local process inspection (ps aux, file sizes, JSON content analysis)
  • Hands-on testing of agent_spawn, task_create, task_assign, wasm_agent_create, wasm_agent_prompt, neural_train, neural_predict, workflow_execute, swarm_init, swarm_status, terminal_execute
  • Hooks code review of all 48 helper scripts in .claude/helpers/
  • Data analysis of auto-memory-store.json (5,706 entries), graph-state.json (100 MB), ranked-context.json (8.7 MB)

Consensus Types: Byzantine, Raft, Queen — All Theater

When you run ruflo hive-mind init, the CLI asks you to choose a consensus strategy: byzantine, raft, queen, gossip, crdt. Sounds like you're picking a distributed coordination protocol. You're picking a string label.

The Verdict

Type Claim Reality
Byzantine/PBFT Multi-node fault-tolerant consensus with crypto verification verifySignature() unconditionally returns true (comment: "In real implementation, verify with sender's public key"). Consensus = majority-vote on a JSON dictionary. No multi-process communication.
Raft Leader election + log replication RaftConsensus class has correct algorithms but requestVotes() does this.emit('vote_request') — local EventEmitter. Comments: "For now, emit event for testing." Never wired to hive-mind MCP tools.
Queen/Hierarchical Sovereign leader election + worker coordination Queen ID = "queen-${Date.now()}". No election. Workers = string IDs in a JSON array. "Broadcasting" = appending to state.sharedMemory.broadcasts[].
Gossip Epidemic protocol propagation Same JSON majority-vote handler regardless of chosen strategy.
CRDT Conflict-free replicated data types In-memory merge ops exist. Never connected to cross-process sync.

What Happens Under the Hood

hive-mind_init stores your consensus choice as a string in state.json:

{
  "consensus": "byzantine",
  "topology": "hierarchical",
  "queen": { "agentId": "queen-1712234567890", "electedAt": 1712234567890, "term": 1 }
}

The actual consensus handler is the same code regardless of which type you selected — readFileSync → count votes in a dictionary → writeFileSync. The strategy string is cosmetic.

Three Execution Paths

Path Spawns Real Processes? Called by Claude? Has Real Consensus?
MCP agent_spawn No — JSON record in Map Yes (but useless) N/A
CLI hive-mind spawn --claude Yes — child_process.spawn('claude', ...) No — requires shell No — shared JSON file
Codex DualModeOrchestrator Yes — real spawn() No — not activated No consensus layer
directApiAgent Calls Anthropic API via fetch No — not wired No consensus layer

Even when real processes ARE spawned (via CLI), their "consensus" is reading/writing the same state.json file. No sockets, no gRPC, no HTTP between processes. The Byzantine fault tolerance protects against... nothing, because there's one node talking to itself through a JSON file.


Token Optimizer: "30-50% Token Reduction" — Real Infrastructure, Fake Metrics

The README advertises a Token Optimizer claiming 30-50% savings via ReasoningBank retrieval (-32%), Agent Booster (-15%), caching (95% hit rate, -10%), and optimal batch sizing (-20%). We downloaded @claude-flow/integration from npm and read every line.

What The Code Actually Does

Component Claimed Reality
ReasoningBank -32% tokens via pattern retrieval Real SQLite + vector search. But "-32%" is from a graph hop reduction benchmark in simulation docs, not token measurement. Baseline is const baseline = 1000 (hardcoded).
Agent Booster -15%, 352x faster Real WASM string-diff engine. But "352x" benchmark does await this.sleep(352) as the baseline — comparing against a literal sleep, not an LLM call.
Cache 95% hit rate, -10% Real Map with 5-min TTL. "95%" is marketing. Token savings: this.stats.totalTokensSaved += 100hardcoded per hit.
Batch size -20% via optimal grouping getOptimalConfig(agentCount) returns { batchSize: 4 } regardless of input. agentCount is ignored.
Combined 30-50% reduction No token counting code exists anywhere. Percentages are individually fabricated then summed.

The Smoking Gun

// "352x faster" benchmark baseline (agent-booster-migration.js:150)
await this.sleep(352);  // sleeps 352ms to simulate "traditional edit"

// "Token savings" from cache (token-optimizer.ts:228)
this.stats.totalTokensSaved += 100;  // hardcoded, not measured

// "Compact context" baseline (token-optimizer.ts:~130)
const baseline = 1000;  // hardcoded "what you would have sent"

The Irony: Ruflo INCREASES Your Token Usage

While claiming 30-50% reduction, ruflo dramatically inflates actual token consumption:

Source of Token Bloat Tokens Wasted Frequency
300+ MCP tool definitions in context ~5,000-10,000 Every session
[INTELLIGENCE] duplicate patterns per message ~150-200 Every message
Router ASCII table with fake metrics ~100-150 Every message
91 agent type definitions in Agent tool ~2,000-3,000 Every Agent invocation
Session restore verbose output ~200-300 Every session start
Estimated per-session waste ~15,000-25,000 tokens

The tool that promises 30-50% token savings adds 15,000-25,000 tokens of noise per session. You're paying more, not less.


This audit was conducted on ruflo v3.5.51 / @claude-flow/cli@latest as of 2026-04-04.

@stayce
Copy link
Copy Markdown

stayce commented Apr 9, 2026

thanks for this. I was coming to the same conclusions after trying for a week, though less scientifically.

@yousecjoe
Copy link
Copy Markdown

Thank you for showing signs of critical thought in the space. We need more people like you in the Agentics Foundation.

@y3294992458
Copy link
Copy Markdown

Which program can achieve the effect of this project?
OMC?Suerpowers?ultrapower?
How to choose the multi-agent orchestration and reinforcement framework?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment