fsenrich: Inode Guardrail Causing Misleading Disk Space Enrichment

Date: 2026-05-19
Discovered via: Beta vs prod enrichment comparison during ENGTRIASVC-3067 validation
Affects: All prod disk space enrichment on high-inode filesystems
Salt file: automation-tenancies/triatm — salt/orch/triatm/fsenrich.sls

Summary

fsenrich's Salt orchestration caps directory scan depth at 3 when a filesystem has >85M inodes (auto-depth mode). This threshold is too aggressive — real-world data shows the cap fires on filesystems that would complete a full depth-50 scan in under 5 minutes, resulting in enrichment notes that miss the largest directories entirely and show inflated, incorrect percentages.

The Guardrail (Current Code)

In fsenrich.sls, the get_tree state applies this logic when depth=0 (auto):

# 85M inodes ≈ 40 min scan time based on regression analysis.
if [ "$inodes_df_ok" -eq 0 ] || [ "$inodes_total" -gt 85000000 ]; then
  effective_depth=3
else
  effective_depth=50
fi

When depth=3, the scan traverses only 3 directory levels deep. On most /bb filesystems this means only /bb/data and /bb/bin subtrees are scanned — the top-level /bb/jenkins_work or /bb/risk directories (often the actual offenders) are found but not descended into, and are excluded from the top-N results if smaller subtrees dominate.

Why depth=3 doesn't surface `/bb/jenkins_work` even though it's at depth=1

This is subtle. find /bb -maxdepth 3 does discover /bb/jenkins_work as a directory entry — it exists at depth=1, well within the limit. The problem is how directory sizes are computed.

The Salt AWK script calculates directory sizes by accumulating file sizes from the find output stream. A directory's reported size is the sum of all file sizes that find lists inside it. With -maxdepth 3, find only outputs files at depth ≤ 3 from the root — files deeper than that are silently omitted.

For jenkins_work, the actual build artifacts live at depth 8–15+ within the filesystem (e.g. /bb/jenkins_work/pas-ci/jaas/chartscore/workspace/PR-2445/cmake-build/.../listmanc.t.tsk). None of these appear in a depth=3 scan. The AWK accumulator therefore assigns /bb/jenkins_work a size of nearly zero — only counting the handful of shallow files directly under it.

Meanwhile /bb/data has logs, CSVs, and database files at depth 2–3, so its measured size looks proportionally large even though it holds far less space overall.

The cascading effect:

total_size (the denominator for percentages) is computed as the sum of all file sizes find reported — a tiny fraction of the real 700+ GB used
Percentages are wildly inflated: a 5 GB directory measured against a ~22 GB scanned subtree appears as "24%" instead of the correct ~0.7%
/bb/jenkins_work never enters the top-N heap because its accumulated size is near zero, even though it holds 300–424 GB

In short: depth=3 makes large directories measurably empty, not invisible. They appear in the directory listing but with the wrong (near-zero) size.

Evidence

Case 1: prtbld-ob-955 / /bb (459M inodes)

DRQS 184303504 — 90% bytes used — 2026-05-19 18:07 UTC

Run	depth	inodes_total	Largest Dir Found	Scan Duration
Prod (auto)	0→3	~459M (estimated)	`/bb/data` 20 GB	~60s
Beta metric (auto)	0→3	459,355,736	`/bb/data` 20 GB	~60s
Beta bot `-d 50`	50	459,296,608	`/bb/jenkins_work` 309 GB	~3.5 min

With depth=3: Enrichment reports /bb/data at 20 GB as the top directory, with inflated percentages (e.g. "24%" for a 5 GB directory calculated against the 22 GB scanned subtree, not the 700 GB total used). /bb/jenkins_work at 300–424 GB — the actual space consumer — is completely absent.

With depth=50: Enrichment correctly identifies /bb/jenkins_work as the dominant directory. Percentages are accurate (calculated against total used). Scan completed in ~3.5 minutes — well within the 40-minute timeout.

Inode count: 459M — 5.4× the 85M threshold. Scan time at depth=50: ~3.5 min.

Case 2: emsdev-ob-261 / /bb/data7 (1.035B inodes)

DRQS 184285499 — filesystem had 18 GB used (512 GB total)

Run	depth	inodes_total	Result
Beta metric (auto)	0→3	1,035,531,339	`No directories to analyze` — empty
Prod bot (auto)	0→3	—	Full directory tree with 5 dirs, 5 files found

Beta and prod diverged here: prod found the Jenkins workspace directories (10–17 GB each) and large .tsk build artifacts while beta returned completely empty. The 1B+ inode count was caused by a Jenkins workspace accumulating thousands of tiny build artifacts. At depth=3, the scan found nothing worth surfacing.

Note: The prod result is from a different mechanism (triagercsvc tree command). The emsdev case is more complex — prod uses tree which is not inode-gated — but the empty beta result on a filesystem with 18 GB of real content illustrates the issue.

Root Cause

The 85M threshold comment claims "85M inodes ≈ 40 min scan time based on regression analysis." However:

prtbld-ob-955 has 459M inodes and depth=50 completed in ~3.5 minutes — 11× faster than the assumed 40 minutes
The threshold appears to be based on stale benchmarks or a specific worst-case host that no longer represents the general case
The fallback to depth=3 is too shallow — on most filesystems, the top-level directories that hold the most space (jenkins_work, risk, etc.) exist at depth=1 but their contents are not traversed, so they don't appear in the top-N results

The binary 85M/depth-3 decision produces enrichment notes that:

Miss the largest directories entirely
Show wildly incorrect percentages (calculated against a tiny scanned subtree)
Give triage engineers a completely misleading picture of what's consuming disk space

Proposed Fix

Replace the binary threshold with a tiered approach:

# Current (too aggressive):
if [ "$inodes_df_ok" -eq 0 ] || [ "$inodes_total" -gt 85000000 ]; then
  effective_depth=3
else
  effective_depth=50
fi

# Proposed (tiered):
if [ "$inodes_df_ok" -eq 0 ] || [ "$inodes_total" -gt 500000000 ]; then
  effective_depth=3
elif [ "$inodes_total" -gt 85000000 ]; then
  effective_depth=10
else
  effective_depth=50
fi

Tiers:

inodes_total	effective_depth	Rationale
≤85M	50	Current behavior, unchanged
85M–500M	10	Covers prtbld class (459M). Depth=10 finds top-level offenders. Estimated scan time: <10 min
>500M	3	Extreme cases (emsdev class at 1B+). Conservative cap retained

Why depth=10 for the middle tier:

Top-level directories (/bb/jenkins_work, /bb/risk, etc.) exist at depth=1. Depth=10 gives 9 levels of traversal within them — enough to surface the actual large subdirectories.
prtbld at depth=50 completed in 3.5 min. Depth=10 would be significantly faster.
Percentages will be accurate since the full filesystem is still traversed (just bounded at depth=10).

Alternative approach — trust the timeout:
Remove the inode guardrail entirely and rely on FSENRICH_TIMEOUT (the existing 40-minute timeout sentinel) as the primary safety mechanism. The timeout already kills find cleanly and the downstream code handles it gracefully. The inode guardrail was added as a preemptive guard but real-world data suggests it fires far too early.

Impact Assessment

Who is affected: Any host with a /bb or similar filesystem exceeding 85M inodes in auto-depth mode. Jenkins build hosts, comdb2 hosts, and any machine with millions of small files will hit this.
Symptom: Enrichment note shows small directories as top offenders with inflated percentages, missing the actual large directories. The note includes the warning: "ⓘ Analysis depth has been limited to 3 directories deep."
Workaround: Users can post @fsenrich -t host -f /fs -d 50 to force depth=50 explicitly, bypassing the inode cap. This is not documented and requires knowledge of the -d flag.
Detection: Look for effective_depth=3, requested_depth=0 combined with large inodes_total (>85M) in parse_tree_data_successful log entries in Humio (#logConfigName=fsenrichbpaas).

Files to Change

salt/orch/triatm/fsenrich.sls — get_tree state, inode guardrail block (lines ~200–207)
triage/fsenrich/business_logic.py — SALT_AUTO_DEPTH_MAX constant and the depth-limit warning message (update threshold documentation)
Tests — add test cases for 85M–500M inode range verifying depth=10 is selected

Supporting Data

PRQS tickets:

PRQS 346784458 — prtbld-ob-955 /bb depth=0 (auto→3), 459M inodes
PRQS 346784647 — prtbld-ob-955 /bb depth=50 (explicit), 459M inodes, completed in ~3.5 min

DRQS tickets:

DRQS 184303504 — prtbld-ob-955 /bb alarm ticket (today's event)
DRQS 183110294 — beta test ticket, notes #278 (metric auto), #283 (bot auto), #286 (bot -d 50)

Humio query to reproduce:

#logConfigName=fsenrichbpaas | bpaasStage = "beta-s0a" | /183110294/ | /parse_tree_data_successful/ | tail(10)

Look for inodes_total and effective_depth fields in the response.

raylobosco/fsenrich-inode-guardrail-issue-20260519.md

Select an option

No results found

Select an option

No results found

fsenrich: Inode Guardrail Causing Misleading Disk Space Enrichment

Summary

The Guardrail (Current Code)

Why depth=3 doesn't surface `/bb/jenkins_work` even though it's at depth=1

Evidence

Case 1: prtbld-ob-955 / /bb (459M inodes)

Case 2: emsdev-ob-261 / /bb/data7 (1.035B inodes)

Root Cause

Proposed Fix

Impact Assessment

Files to Change

Supporting Data

raylobosco/fsenrich-inode-guardrail-issue-20260519.md

fsenrich: Inode Guardrail Causing Misleading Disk Space Enrichment

Summary

The Guardrail (Current Code)

Why depth=3 doesn't surface /bb/jenkins_work even though it's at depth=1

Evidence

Case 1: prtbld-ob-955 / /bb (459M inodes)

Case 2: emsdev-ob-261 / /bb/data7 (1.035B inodes)

Root Cause

Proposed Fix

Impact Assessment

Files to Change

Supporting Data

Why depth=3 doesn't surface `/bb/jenkins_work` even though it's at depth=1