databricks-cost-leak-hunter — technical breakdown (the skill, flattened)

How the cost-leak hunter actually works under the hood: the filesystem, the data model it reads, the SQL primitive, the four detectors, and the deterministic ranker — exposed.


Skill	`databricks-cost-leak-hunter`
Version	0.1.0
Author	Jeremy Longshore <jeremy@intentsolutions.io>
License	MIT
Part of	`databricks-pack` (24-skill Databricks pack)
Marketing / sample output	CFO one-pager gist · live demo

The sample output is the what. This is the how.

1. The skill, flattened

databricks-cost-leak-hunter/
├── SKILL.md                              # the contract + the detect→compute→rank→report pipeline
├── eval-spec.yaml                        # behavioral eval: 6 criteria × 7 cases (j-rig)
├── scripts/
│   ├── rank-and-report.py                # deterministic ranker — the LLM never does the arithmetic
│   └── sql/
│       └── spend-baseline.sql.json       # the `priced` CTE dollar primitive (parameterized at call time)
└── references/                           # loaded ON DEMAND, only when a leak needs it
    ├── cost-leak-categories.md           # the 4 categories: definition · detection SQL · root cause · fix
    ├── cfo-output-format.md              # verbatim CFO report template + 90-second-skim rules
    ├── system-tables-setup.md            # the metastore-admin grant chain + access verification
    └── dlt-tier-cost-tradeoffs.md        # DLT / serverless / Photon cost-tier encyclopedia

Three layers, by design:

SKILL.md — the orchestration the model reads top-to-bottom.
scripts/ — deterministic compute. The arithmetic and the report render run in Python, so the LLM never eyeballs a dollar figure.
references/ — deep domain knowledge, progressive-disclosure: nothing here loads until a specific leak or failure needs it.

2. Architecture — two data planes

The skill reads two Databricks surfaces, and the split is the whole design:

Plane	Source	Auth	Answers
Billing (the number)	CLI Statement Execution API → `system.billing.` + `system.compute.`	`DATABRICKS_HOST` + `DATABRICKS_TOKEN` (UC enforces the grant chain)	How many dollars, exactly
Control (the why + the fix)	custom `databricks-workspace-mcp` REST tools	the MCP server's own PAT / U2M / M2M	Why the leak exists, and the single config change that fixes it

The SQL produces the confirmed dollar figure; the workspace MCP turns it into a verified, one-config-change fix. If the MCP is absent the skill still produces dollar figures and accepts pasted config — it degrades, it doesn't fail.

Pipeline: detect → compute → rank → report.

Step 1  grant-chain probe        SELECT 1 FROM system.billing.usage LIMIT 1   (fail fast, not mid-flow)
Step 2  spend baseline           the `priced` CTE → trailing-30d total + window-end stamp
Step 3  Leak 1  idle clusters    confirmed     (billed for compute nobody used)
Step 4  Leak 2  jobs on All-Purpose  confirmed (re-priced at the Jobs rate)
Step 5  Leak 3  overprovisioned   estimated     (spend × (1 − CPU%))
Step 6  Leak 4  Photon premium    at-risk       (the ~2× premium portion)
Step 7  rank + report            rank-and-report.py → CFO report

3. The data model it reads (`system.*`)

Everything dollar-bearing comes from Unity-Catalog-governed system tables — never an estimate:

Table	Columns the skill uses	Role
`system.billing.usage`	`usage_date`, `sku_name`, `usage_quantity`, `usage_unit`, `usage_end_time`, `billing_origin_product`, `usage_metadata.{cluster_id,job_id}`	the billed-DBU ledger
`system.billing.list_prices`	`sku_name`, `usage_unit`, `pricing.default`, `price_start_time`, `price_end_time`, `currency_code`	the price book (time-windowed)
`system.compute.clusters`	`cluster_id`, `cluster_name`, `auto_termination_minutes`	config corroboration (Leak 1)
`system.compute.node_timeline`	per-node CPU → `avg_cpu_pct`	utilization (Leak 3)

Hard dependency: read access requires a metastore-admin grant chain (USE CATALOG → USE SCHEMA → SELECT). Step 1 probes it and reports the exact missing grant before any scan — the most common real-world failure, surfaced upfront instead of mid-flow.

4. The dollar primitive — the `priced` CTE

Every category query reuses one priced CTE (from scripts/sql/spend-baseline.sql.json). The price join is matched on sku_name and usage_unit, within the price-effective window, in USD — so a re-priced SKU never double-counts:

WITH priced AS (
  SELECT u.usage_date, u.sku_name, u.usage_quantity, u.usage_unit,
         u.billing_origin_product, u.usage_metadata,
         u.usage_quantity * lp.pricing.default AS usd
  FROM system.billing.usage u
  JOIN system.billing.list_prices lp
    ON u.sku_name = lp.sku_name
   AND u.usage_unit = lp.usage_unit
   AND u.usage_end_time >= lp.price_start_time
   AND u.usage_end_time <  COALESCE(lp.price_end_time, TIMESTAMP '9999-12-31')
  WHERE u.usage_date >= current_date() - INTERVAL 30 DAYS
    AND lp.currency_code = 'USD'
)
SELECT billing_origin_product,
       ROUND(SUM(usd), 2)            AS spend_30d_usd,
       ROUND(SUM(usage_quantity), 2) AS dbus_30d
FROM priced GROUP BY billing_origin_product ORDER BY spend_30d_usd DESC

Implementation detail: the Databricks CLI does not expand ${VARS} inside a --json @file, so the warehouse id is injected with jq at call time — the static template carries only wait_timeout + statement.

5. The four leak detectors

Each is a query over priced + a confidence kind. Confidence is load-bearing — confirmed billed spend is never summed with modeled amounts.

#	Leak	Signal	`kind`	The math
1	Never auto-terminate	`compute.clusters.auto_termination_minutes = 0` on `ALL_PURPOSE` spend	confirmed	`SUM(usd)` — money actually billed for idle compute
2	Jobs on All-Purpose	`usage_metadata.job_id IS NOT NULL` and `billing_origin_product = 'ALL_PURPOSE'` (~$0.55/DBU)	confirmed	re-price the same DBUs at the Jobs rate (~$0.15) → savings delta
3	Overprovisioned	`node_timeline` mean CPU `< 25%`	estimated	`spend × (1 − CPU%)` — the one modeled number, labeled `est_*` everywhere
4	Photon premium	`sku_name ILIKE '%PHOTON%'`	at-risk	`SUM(usd) / 2` — the ~2× premium portion, pending a runtime-gain check

Leak 2, the signature one (re-pricing All-Purpose at the Jobs rate):

SELECT p.usage_metadata.job_id AS job_id,
       ROUND(SUM(p.usd), 2) AS spend_on_all_purpose_30d_usd,
       ROUND(SUM(p.usd) - SUM(p.usage_quantity * jr.jobs_unit_price), 2)
         AS potential_savings_30d_usd
FROM priced p
JOIN jobs_rate jr ON p.usage_unit = jr.usage_unit   -- deduped to ONE rate per unit so the join can't fan out
WHERE p.billing_origin_product = 'ALL_PURPOSE'
  AND p.usage_metadata.job_id IS NOT NULL
GROUP BY p.usage_metadata.job_id
HAVING SUM(p.usd) - SUM(p.usage_quantity * jr.jobs_unit_price) > 0
ORDER BY potential_savings_30d_usd DESC

Each flagged row is then corroborated on the control plane (clusters_get / clusters_events / instance_pools_list / pipelines_get) — e.g. Leak 1 confirms autotermination_minutes = 0 right now and measures the live idle gap.

6. The leak-object schema + the deterministic ranker

The detectors emit one JSON object per category; rank-and-report.py consumes them. The LLM does not do the arithmetic.

// input to rank-and-report.py — one per leak category
{
  "category":       "Clusters that never shut themselves off",
  "root_cause":     "paying around the clock for compute nobody is using",  // FinOps language
  "fix":            "Set auto-shutoff (e.g. 30 min)",                       // one config change
  "waste_30d_usd":  12000,                                                  // from system.billing.usage
  "kind":           "confirmed"   // confirmed | estimated | at-risk
}

The ranker:

30d → monthly (× 365/12 ÷ 30), then ranks descending by monthly impact;
split sum — confirmed_monthly and unconfirmed_monthly (estimated + at-risk) are computed separately and never added under one verb — this is the regression-critical invariant;
tolerates LLM-formatted currency ("$1,200.50" → 1200.5) and case-variant kinds ("AT-RISK " → at-risk) so a row is never silently dropped from a sum;
renders the report verbatim per cfo-output-format.md.

7. The output schema (the CFO report)

A fixed shape — the thing the sample gist renders:

### A $<spend>/month workspace is burning ~$<confirmed>/month (confirmed),
    plus up to ~$<at-risk>/month pending review
Trailing 30 days ending <window-end>. Confirmed ~$<X>K/year; up to ~$<Y>K/year pending review.

| # | Where it's leaking | $/month | Confidence | The fix |   ← ranked desc, $ right-aligned
...
**The #1 line alone — <category> (<kind>) — is ~$<Z>K/year, fixed in one setting.**

> What's assumed vs cited: only the workspace-spend input is assumed; every per-row
>   dollar is from system.billing.usage; Confidence marks measured vs modeled.

Plain-business root-cause text (no raw DBU in CFO-visible cells); the $/DBU detail stays in the per-engineer artifacts.

8. The frontmatter contract

name: databricks-cost-leak-hunter
allowed-tools: Read, Write, Edit, Bash(databricks:*), Bash(jq:*), Glob,
  mcp__databricks-workspace-mcp__clusters_get,
  mcp__databricks-workspace-mcp__clusters_events,
  mcp__databricks-workspace-mcp__clusters_list,
  mcp__databricks-workspace-mcp__instance_pools_list,
  mcp__databricks-workspace-mcp__pipelines_get

Least-privilege: the only Bash surfaces are databricks (the CLI) and jq; the only MCP tools are five read control-plane calls. No secrets are hardcoded — all auth is environment / the registered MCP server.

9. The eval contract (`eval-spec.yaml`)

Behavioral eval (j-rig), 6 criteria × 7 test cases:

Criterion	Blocker?
`triggers-on-cost-question`	✓
`produces-cfo-grokkable-report` (a CFO acts in ~90s, no engineer to translate)	✓
`splits-confirmed-vs-estimated` (never one number; labels each leak's confidence)	regression-critical
`dollars-from-billing-not-estimates`
`checks-grant-chain-upfront`
`no-prompt-leakage`	✓

Cases include the three trigger phrasings, two negative controls (weather, reverse-a-linked-list → should_not_trigger), a missing-grant-chain edge case, and a prompt-injection adversarial. A declared sibling boundary keeps it distinct from databricks-cost-tuning (which authors policy; this one detects leaks).

Built as a Claude Code skill in the Tons of Skills marketplace.

jeremylongshore/databricks-cost-leak-hunter-technical-breakdown.md

Select an option

No results found

Select an option

No results found

databricks-cost-leak-hunter — technical breakdown (the skill, flattened)

1. The skill, flattened

2. Architecture — two data planes

3. The data model it reads (`system.*`)

4. The dollar primitive — the `priced` CTE

5. The four leak detectors

6. The leak-object schema + the deterministic ranker

7. The output schema (the CFO report)

8. The frontmatter contract

9. The eval contract (`eval-spec.yaml`)

jeremylongshore/databricks-cost-leak-hunter-technical-breakdown.md

databricks-cost-leak-hunter — technical breakdown (the skill, flattened)

1. The skill, flattened

2. Architecture — two data planes

3. The data model it reads (system.*)

4. The dollar primitive — the priced CTE

5. The four leak detectors

6. The leak-object schema + the deterministic ranker

7. The output schema (the CFO report)

8. The frontmatter contract

9. The eval contract (eval-spec.yaml)

3. The data model it reads (`system.*`)

4. The dollar primitive — the `priced` CTE

9. The eval contract (`eval-spec.yaml`)