Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save jeremylongshore/397500229c5f8e0b33173e460a1d1c9f to your computer and use it in GitHub Desktop.

Select an option

Save jeremylongshore/397500229c5f8e0b33173e460a1d1c9f to your computer and use it in GitHub Desktop.
databricks-cost-leak-hunter — technical breakdown (the skill flattened: filesystem · system.* data model · the priced CTE · the 4 detectors · the deterministic ranker · the eval). Tech companion to the CFO sample-output gist.

databricks-cost-leak-hunter — technical breakdown (the skill, flattened)

How the cost-leak hunter actually works under the hood: the filesystem, the data model it reads, the SQL primitive, the four detectors, and the deterministic ranker — exposed.

Skill databricks-cost-leak-hunter
Version 0.1.0
Author Jeremy Longshore <jeremy@intentsolutions.io>
License MIT
Part of databricks-pack (24-skill Databricks pack)
Marketing / sample output CFO one-pager gist · live demo

The sample output is the what. This is the how.


1. The skill, flattened

databricks-cost-leak-hunter/
├── SKILL.md                              # the contract + the detect→compute→rank→report pipeline
├── eval-spec.yaml                        # behavioral eval: 6 criteria × 7 cases (j-rig)
├── scripts/
│   ├── rank-and-report.py                # deterministic ranker — the LLM never does the arithmetic
│   └── sql/
│       └── spend-baseline.sql.json       # the `priced` CTE dollar primitive (parameterized at call time)
└── references/                           # loaded ON DEMAND, only when a leak needs it
    ├── cost-leak-categories.md           # the 4 categories: definition · detection SQL · root cause · fix
    ├── cfo-output-format.md              # verbatim CFO report template + 90-second-skim rules
    ├── system-tables-setup.md            # the metastore-admin grant chain + access verification
    └── dlt-tier-cost-tradeoffs.md        # DLT / serverless / Photon cost-tier encyclopedia

Three layers, by design:

  • SKILL.md — the orchestration the model reads top-to-bottom.
  • scripts/ — deterministic compute. The arithmetic and the report render run in Python, so the LLM never eyeballs a dollar figure.
  • references/ — deep domain knowledge, progressive-disclosure: nothing here loads until a specific leak or failure needs it.

2. Architecture — two data planes

The skill reads two Databricks surfaces, and the split is the whole design:

Plane Source Auth Answers
Billing (the number) CLI Statement Execution APIsystem.billing.* + system.compute.* DATABRICKS_HOST + DATABRICKS_TOKEN (UC enforces the grant chain) How many dollars, exactly
Control (the why + the fix) custom databricks-workspace-mcp REST tools the MCP server's own PAT / U2M / M2M Why the leak exists, and the single config change that fixes it

The SQL produces the confirmed dollar figure; the workspace MCP turns it into a verified, one-config-change fix. If the MCP is absent the skill still produces dollar figures and accepts pasted config — it degrades, it doesn't fail.

Pipeline: detect → compute → rank → report.

Step 1  grant-chain probe        SELECT 1 FROM system.billing.usage LIMIT 1   (fail fast, not mid-flow)
Step 2  spend baseline           the `priced` CTE → trailing-30d total + window-end stamp
Step 3  Leak 1  idle clusters    confirmed     (billed for compute nobody used)
Step 4  Leak 2  jobs on All-Purpose  confirmed (re-priced at the Jobs rate)
Step 5  Leak 3  overprovisioned   estimated     (spend × (1 − CPU%))
Step 6  Leak 4  Photon premium    at-risk       (the ~2× premium portion)
Step 7  rank + report            rank-and-report.py → CFO report

3. The data model it reads (system.*)

Everything dollar-bearing comes from Unity-Catalog-governed system tables — never an estimate:

Table Columns the skill uses Role
system.billing.usage usage_date, sku_name, usage_quantity, usage_unit, usage_end_time, billing_origin_product, usage_metadata.{cluster_id,job_id} the billed-DBU ledger
system.billing.list_prices sku_name, usage_unit, pricing.default, price_start_time, price_end_time, currency_code the price book (time-windowed)
system.compute.clusters cluster_id, cluster_name, auto_termination_minutes config corroboration (Leak 1)
system.compute.node_timeline per-node CPU → avg_cpu_pct utilization (Leak 3)

Hard dependency: read access requires a metastore-admin grant chain (USE CATALOGUSE SCHEMASELECT). Step 1 probes it and reports the exact missing grant before any scan — the most common real-world failure, surfaced upfront instead of mid-flow.


4. The dollar primitive — the priced CTE

Every category query reuses one priced CTE (from scripts/sql/spend-baseline.sql.json). The price join is matched on sku_name and usage_unit, within the price-effective window, in USD — so a re-priced SKU never double-counts:

WITH priced AS (
  SELECT u.usage_date, u.sku_name, u.usage_quantity, u.usage_unit,
         u.billing_origin_product, u.usage_metadata,
         u.usage_quantity * lp.pricing.default AS usd
  FROM system.billing.usage u
  JOIN system.billing.list_prices lp
    ON u.sku_name = lp.sku_name
   AND u.usage_unit = lp.usage_unit
   AND u.usage_end_time >= lp.price_start_time
   AND u.usage_end_time <  COALESCE(lp.price_end_time, TIMESTAMP '9999-12-31')
  WHERE u.usage_date >= current_date() - INTERVAL 30 DAYS
    AND lp.currency_code = 'USD'
)
SELECT billing_origin_product,
       ROUND(SUM(usd), 2)            AS spend_30d_usd,
       ROUND(SUM(usage_quantity), 2) AS dbus_30d
FROM priced GROUP BY billing_origin_product ORDER BY spend_30d_usd DESC

Implementation detail: the Databricks CLI does not expand ${VARS} inside a --json @file, so the warehouse id is injected with jq at call time — the static template carries only wait_timeout + statement.


5. The four leak detectors

Each is a query over priced + a confidence kind. Confidence is load-bearing — confirmed billed spend is never summed with modeled amounts.

# Leak Signal kind The math
1 Never auto-terminate compute.clusters.auto_termination_minutes = 0 on ALL_PURPOSE spend confirmed SUM(usd) — money actually billed for idle compute
2 Jobs on All-Purpose usage_metadata.job_id IS NOT NULL and billing_origin_product = 'ALL_PURPOSE' (~$0.55/DBU) confirmed re-price the same DBUs at the Jobs rate (~$0.15) → savings delta
3 Overprovisioned node_timeline mean CPU < 25% estimated spend × (1 − CPU%) — the one modeled number, labeled est_* everywhere
4 Photon premium sku_name ILIKE '%PHOTON%' at-risk SUM(usd) / 2 — the ~2× premium portion, pending a runtime-gain check

Leak 2, the signature one (re-pricing All-Purpose at the Jobs rate):

SELECT p.usage_metadata.job_id AS job_id,
       ROUND(SUM(p.usd), 2) AS spend_on_all_purpose_30d_usd,
       ROUND(SUM(p.usd) - SUM(p.usage_quantity * jr.jobs_unit_price), 2)
         AS potential_savings_30d_usd
FROM priced p
JOIN jobs_rate jr ON p.usage_unit = jr.usage_unit   -- deduped to ONE rate per unit so the join can't fan out
WHERE p.billing_origin_product = 'ALL_PURPOSE'
  AND p.usage_metadata.job_id IS NOT NULL
GROUP BY p.usage_metadata.job_id
HAVING SUM(p.usd) - SUM(p.usage_quantity * jr.jobs_unit_price) > 0
ORDER BY potential_savings_30d_usd DESC

Each flagged row is then corroborated on the control plane (clusters_get / clusters_events / instance_pools_list / pipelines_get) — e.g. Leak 1 confirms autotermination_minutes = 0 right now and measures the live idle gap.


6. The leak-object schema + the deterministic ranker

The detectors emit one JSON object per category; rank-and-report.py consumes them. The LLM does not do the arithmetic.

// input to rank-and-report.py — one per leak category
{
  "category":       "Clusters that never shut themselves off",
  "root_cause":     "paying around the clock for compute nobody is using",  // FinOps language
  "fix":            "Set auto-shutoff (e.g. 30 min)",                       // one config change
  "waste_30d_usd":  12000,                                                  // from system.billing.usage
  "kind":           "confirmed"   // confirmed | estimated | at-risk
}

The ranker:

  • 30d → monthly (× 365/12 ÷ 30), then ranks descending by monthly impact;
  • split sumconfirmed_monthly and unconfirmed_monthly (estimated + at-risk) are computed separately and never added under one verb — this is the regression-critical invariant;
  • tolerates LLM-formatted currency ("$1,200.50"1200.5) and case-variant kinds ("AT-RISK "at-risk) so a row is never silently dropped from a sum;
  • renders the report verbatim per cfo-output-format.md.

7. The output schema (the CFO report)

A fixed shape — the thing the sample gist renders:

### A $<spend>/month workspace is burning ~$<confirmed>/month (confirmed),
    plus up to ~$<at-risk>/month pending review
Trailing 30 days ending <window-end>. Confirmed ~$<X>K/year; up to ~$<Y>K/year pending review.

| # | Where it's leaking | $/month | Confidence | The fix |   ← ranked desc, $ right-aligned
...
**The #1 line alone — <category> (<kind>) — is ~$<Z>K/year, fixed in one setting.**

> What's assumed vs cited: only the workspace-spend input is assumed; every per-row
>   dollar is from system.billing.usage; Confidence marks measured vs modeled.

Plain-business root-cause text (no raw DBU in CFO-visible cells); the $/DBU detail stays in the per-engineer artifacts.


8. The frontmatter contract

name: databricks-cost-leak-hunter
allowed-tools: Read, Write, Edit, Bash(databricks:*), Bash(jq:*), Glob,
  mcp__databricks-workspace-mcp__clusters_get,
  mcp__databricks-workspace-mcp__clusters_events,
  mcp__databricks-workspace-mcp__clusters_list,
  mcp__databricks-workspace-mcp__instance_pools_list,
  mcp__databricks-workspace-mcp__pipelines_get

Least-privilege: the only Bash surfaces are databricks (the CLI) and jq; the only MCP tools are five read control-plane calls. No secrets are hardcoded — all auth is environment / the registered MCP server.


9. The eval contract (eval-spec.yaml)

Behavioral eval (j-rig), 6 criteria × 7 test cases:

Criterion Blocker?
triggers-on-cost-question
produces-cfo-grokkable-report (a CFO acts in ~90s, no engineer to translate)
splits-confirmed-vs-estimated (never one number; labels each leak's confidence) regression-critical
dollars-from-billing-not-estimates
checks-grant-chain-upfront
no-prompt-leakage

Cases include the three trigger phrasings, two negative controls (weather, reverse-a-linked-list → should_not_trigger), a missing-grant-chain edge case, and a prompt-injection adversarial. A declared sibling boundary keeps it distinct from databricks-cost-tuning (which authors policy; this one detects leaks).


Built as a Claude Code skill in the Tons of Skills marketplace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment