Live benchmark for ruflo-cost-tracker v0.15.0 against four endpoints,
on the v3 corpus (18 structural Tier 1 cases + 7 adversarial). All
measurements local; LLM keys pulled from the same Secret Manager
secrets the deployed ruvocal Cloud Run service uses.
| Endpoint | Tier 1 win | Adversarial | Avg latency | Cost / edit | Speedup vs Booster |
|---|---|---|---|---|---|
| Agent Booster (WASM, local) | 18 / 18 ✓ | escalates 7/7 ✓ | 0.36 ms | $0 | — |
| Gemini 2.0 Flash | 18 / 18 ✓ | 3 / 7 | 807.56 ms | $0.000028 | 2243.2× |
| Claude Sonnet 4.6 | 18 / 18 ✓ | 2 / 7 | 1270.64 ms | $0.000933 | 3529.6× |
| Claude Opus 4.7 | 18 / 18 ✓ | 5 / 7 | 1563.72 ms | $0.005943 | 4343.7× |
Three findings the v3 corpus surfaced:
- All 4 endpoints (booster + 3 LLMs) score 18/18 on Tier 1 cases. Frontier-LLM accuracy parity on simple structural transforms — the differentiator is latency × cost.
- Sonnet 4.6 only solves 2/7 adversarial cases. Booster's "correctly refuse" path (escalates 7/7 with min confidence 0.000) beats Sonnet on routing-correctness — a real honest win, not just speed.
- Speedup grew with the corpus (1524× → 2243× vs Gemini; 2316× → 3530× vs Sonnet; 3036× → 4344× vs Opus). Adversarial cases push LLM latency up while booster's avg actually dropped (0.50 → 0.36 ms).
| Replaced by Booster | Wall-time saved | Cost saved |
|---|---|---|
| Gemini 2.0 Flash floor | ~22.4 hours | $2.80 |
| Claude Sonnet 4.6 | ~35.3 hours | $93.30 |
| Claude Opus 4.7 | ~43.4 hours | $594.30 |
README.md— this summarybench-result-latest.json— full per-case results, all four endpointsbooster-corpus.json— 25 cases (18 Tier 1 + 7 adversarial)bench.mjs— harness; supportsBENCH_LLM_BASELINE=1andBENCH_ANTHROPIC=1
( cd v3 && BENCH_LLM_BASELINE=1 BENCH_ANTHROPIC=1 \
node ../plugins/ruflo-cost-tracker/scripts/bench.mjs )| Version | Capability shipped |
|---|---|
| 0.4.0 | corpus v1 (12 cases) + booster integration baseline |
| 0.5.0 | cost-track — auto-capture per-session token usage to cost-tracking namespace |
| 0.6.0 | cost-budget-check — real budget enforcement with 50/75/90/100% alert ladder |
| 0.7.0 | outcome.mjs — auto-emit hooks_model-outcome from cost-optimize |
| 0.8.0 | compact.mjs — drops inline Node block from cost-compact-context |
| 0.9.0 | cost-trend — drift detection across all bench runs |
| 0.10.0 | corpus v2 → v3 (16 → 25 cases incl. adversarial split) |
| 0.11.0 | GitHub Actions — smoke + booster-only bench on PR |
| 0.12.0 | cost-conversation — per-conversation lens |
| 0.13.0 | cost-export — Prometheus textfile + webhook |
| 0.14.0 | cost-federation — ADR-097 Phase 3 consumer (ready when upstream emits) |
| 0.15.0 | cost-summary — programmatic JSON contract |
13 skills, 15 CLI subcommands, 9 scripts, 44 smoke checks, ~85 ms wall-time.
- Plugin:
plugins/ruflo-cost-trackerv0.15.0 - ADR-0002: agentic-flow + Agent Booster integration
- ADR-097: federation budget circuit breaker (Phase 3 consumer wired)
- Issue: ruvnet/ruflo#1743
- npm:
agent-boosterv0.2.2 (viaagentic-flow/agent-booster)