How the cost-leak hunter actually works under the hood: the filesystem, the data model it reads, the SQL primitive, the four detectors, and the deterministic ranker — exposed.
| Skill | databricks-cost-leak-hunter |
| Version | 0.1.0 |
| Author | Jeremy Longshore <jeremy@intentsolutions.io> |
| License | MIT |
| Part of | databricks-pack (24-skill Databricks pack) |
| Marketing / sample output | CFO one-pager gist · live demo |
The sample output is the what. This is the how.
databricks-cost-leak-hunter/
├── SKILL.md # the contract + the detect→compute→rank→report pipeline
├── eval-spec.yaml # behavioral eval: 6 criteria × 7 cases (j-rig)
├── scripts/
│ ├── rank-and-report.py # deterministic ranker — the LLM never does the arithmetic
│ └── sql/
│ └── spend-baseline.sql.json # the `priced` CTE dollar primitive (parameterized at call time)
└── references/ # loaded ON DEMAND, only when a leak needs it
├── cost-leak-categories.md # the 4 categories: definition · detection SQL · root cause · fix
├── cfo-output-format.md # verbatim CFO report template + 90-second-skim rules
├── system-tables-setup.md # the metastore-admin grant chain + access verification
└── dlt-tier-cost-tradeoffs.md # DLT / serverless / Photon cost-tier encyclopedia
Three layers, by design:
SKILL.md— the orchestration the model reads top-to-bottom.scripts/— deterministic compute. The arithmetic and the report render run in Python, so the LLM never eyeballs a dollar figure.references/— deep domain knowledge, progressive-disclosure: nothing here loads until a specific leak or failure needs it.
The skill reads two Databricks surfaces, and the split is the whole design:
| Plane | Source | Auth | Answers |
|---|---|---|---|
| Billing (the number) | CLI Statement Execution API → system.billing.* + system.compute.* |
DATABRICKS_HOST + DATABRICKS_TOKEN (UC enforces the grant chain) |
How many dollars, exactly |
| Control (the why + the fix) | custom databricks-workspace-mcp REST tools |
the MCP server's own PAT / U2M / M2M | Why the leak exists, and the single config change that fixes it |
The SQL produces the confirmed dollar figure; the workspace MCP turns it into a verified, one-config-change fix. If the MCP is absent the skill still produces dollar figures and accepts pasted config — it degrades, it doesn't fail.
Pipeline: detect → compute → rank → report.
Step 1 grant-chain probe SELECT 1 FROM system.billing.usage LIMIT 1 (fail fast, not mid-flow)
Step 2 spend baseline the `priced` CTE → trailing-30d total + window-end stamp
Step 3 Leak 1 idle clusters confirmed (billed for compute nobody used)
Step 4 Leak 2 jobs on All-Purpose confirmed (re-priced at the Jobs rate)
Step 5 Leak 3 overprovisioned estimated (spend × (1 − CPU%))
Step 6 Leak 4 Photon premium at-risk (the ~2× premium portion)
Step 7 rank + report rank-and-report.py → CFO report
Everything dollar-bearing comes from Unity-Catalog-governed system tables — never an estimate:
| Table | Columns the skill uses | Role |
|---|---|---|
system.billing.usage |
usage_date, sku_name, usage_quantity, usage_unit, usage_end_time, billing_origin_product, usage_metadata.{cluster_id,job_id} |
the billed-DBU ledger |
system.billing.list_prices |
sku_name, usage_unit, pricing.default, price_start_time, price_end_time, currency_code |
the price book (time-windowed) |
system.compute.clusters |
cluster_id, cluster_name, auto_termination_minutes |
config corroboration (Leak 1) |
system.compute.node_timeline |
per-node CPU → avg_cpu_pct |
utilization (Leak 3) |
Hard dependency: read access requires a metastore-admin grant chain (USE CATALOG → USE SCHEMA → SELECT). Step 1 probes it and reports the exact missing grant before any scan — the most common real-world failure, surfaced upfront instead of mid-flow.
Every category query reuses one priced CTE (from scripts/sql/spend-baseline.sql.json). The price join is matched on sku_name and usage_unit, within the price-effective window, in USD — so a re-priced SKU never double-counts:
WITH priced AS (
SELECT u.usage_date, u.sku_name, u.usage_quantity, u.usage_unit,
u.billing_origin_product, u.usage_metadata,
u.usage_quantity * lp.pricing.default AS usd
FROM system.billing.usage u
JOIN system.billing.list_prices lp
ON u.sku_name = lp.sku_name
AND u.usage_unit = lp.usage_unit
AND u.usage_end_time >= lp.price_start_time
AND u.usage_end_time < COALESCE(lp.price_end_time, TIMESTAMP '9999-12-31')
WHERE u.usage_date >= current_date() - INTERVAL 30 DAYS
AND lp.currency_code = 'USD'
)
SELECT billing_origin_product,
ROUND(SUM(usd), 2) AS spend_30d_usd,
ROUND(SUM(usage_quantity), 2) AS dbus_30d
FROM priced GROUP BY billing_origin_product ORDER BY spend_30d_usd DESCImplementation detail: the Databricks CLI does not expand
${VARS}inside a--json @file, so the warehouse id is injected withjqat call time — the static template carries onlywait_timeout+statement.
Each is a query over priced + a confidence kind. Confidence is load-bearing — confirmed billed spend is never summed with modeled amounts.
| # | Leak | Signal | kind |
The math |
|---|---|---|---|---|
| 1 | Never auto-terminate | compute.clusters.auto_termination_minutes = 0 on ALL_PURPOSE spend |
confirmed | SUM(usd) — money actually billed for idle compute |
| 2 | Jobs on All-Purpose | usage_metadata.job_id IS NOT NULL and billing_origin_product = 'ALL_PURPOSE' (~$0.55/DBU) |
confirmed | re-price the same DBUs at the Jobs rate (~$0.15) → savings delta |
| 3 | Overprovisioned | node_timeline mean CPU < 25% |
estimated | spend × (1 − CPU%) — the one modeled number, labeled est_* everywhere |
| 4 | Photon premium | sku_name ILIKE '%PHOTON%' |
at-risk | SUM(usd) / 2 — the ~2× premium portion, pending a runtime-gain check |
Leak 2, the signature one (re-pricing All-Purpose at the Jobs rate):
SELECT p.usage_metadata.job_id AS job_id,
ROUND(SUM(p.usd), 2) AS spend_on_all_purpose_30d_usd,
ROUND(SUM(p.usd) - SUM(p.usage_quantity * jr.jobs_unit_price), 2)
AS potential_savings_30d_usd
FROM priced p
JOIN jobs_rate jr ON p.usage_unit = jr.usage_unit -- deduped to ONE rate per unit so the join can't fan out
WHERE p.billing_origin_product = 'ALL_PURPOSE'
AND p.usage_metadata.job_id IS NOT NULL
GROUP BY p.usage_metadata.job_id
HAVING SUM(p.usd) - SUM(p.usage_quantity * jr.jobs_unit_price) > 0
ORDER BY potential_savings_30d_usd DESCEach flagged row is then corroborated on the control plane (clusters_get / clusters_events / instance_pools_list / pipelines_get) — e.g. Leak 1 confirms autotermination_minutes = 0 right now and measures the live idle gap.
The detectors emit one JSON object per category; rank-and-report.py consumes them. The LLM does not do the arithmetic.
The ranker:
- 30d → monthly (
× 365/12 ÷ 30), then ranks descending by monthly impact; - split sum —
confirmed_monthlyandunconfirmed_monthly(estimated + at-risk) are computed separately and never added under one verb — this is the regression-critical invariant; - tolerates LLM-formatted currency (
"$1,200.50"→1200.5) and case-variant kinds ("AT-RISK "→at-risk) so a row is never silently dropped from a sum; - renders the report verbatim per
cfo-output-format.md.
A fixed shape — the thing the sample gist renders:
### A $<spend>/month workspace is burning ~$<confirmed>/month (confirmed),
plus up to ~$<at-risk>/month pending review
Trailing 30 days ending <window-end>. Confirmed ~$<X>K/year; up to ~$<Y>K/year pending review.
| # | Where it's leaking | $/month | Confidence | The fix | ← ranked desc, $ right-aligned
...
**The #1 line alone — <category> (<kind>) — is ~$<Z>K/year, fixed in one setting.**
> What's assumed vs cited: only the workspace-spend input is assumed; every per-row
> dollar is from system.billing.usage; Confidence marks measured vs modeled.
Plain-business root-cause text (no raw DBU in CFO-visible cells); the $/DBU detail stays in the per-engineer artifacts.
name: databricks-cost-leak-hunter
allowed-tools: Read, Write, Edit, Bash(databricks:*), Bash(jq:*), Glob,
mcp__databricks-workspace-mcp__clusters_get,
mcp__databricks-workspace-mcp__clusters_events,
mcp__databricks-workspace-mcp__clusters_list,
mcp__databricks-workspace-mcp__instance_pools_list,
mcp__databricks-workspace-mcp__pipelines_getLeast-privilege: the only Bash surfaces are databricks (the CLI) and jq; the only MCP tools are five read control-plane calls. No secrets are hardcoded — all auth is environment / the registered MCP server.
Behavioral eval (j-rig), 6 criteria × 7 test cases:
| Criterion | Blocker? |
|---|---|
triggers-on-cost-question |
✓ |
produces-cfo-grokkable-report (a CFO acts in ~90s, no engineer to translate) |
✓ |
splits-confirmed-vs-estimated (never one number; labels each leak's confidence) |
regression-critical |
dollars-from-billing-not-estimates |
|
checks-grant-chain-upfront |
|
no-prompt-leakage |
✓ |
Cases include the three trigger phrasings, two negative controls (weather, reverse-a-linked-list → should_not_trigger), a missing-grant-chain edge case, and a prompt-injection adversarial. A declared sibling boundary keeps it distinct from databricks-cost-tuning (which authors policy; this one detects leaks).
Built as a Claude Code skill in the Tons of Skills marketplace.