Date: 2026-05-19
Discovered via: Beta vs prod enrichment comparison during ENGTRIASVC-3067 validation
Affects: All prod disk space enrichment on high-inode filesystems
Salt file: automation-tenancies/triatm — salt/orch/triatm/fsenrich.sls
fsenrich's Salt orchestration caps directory scan depth at 3 when a filesystem has >85M inodes (auto-depth mode). This threshold is too aggressive — real-world data shows the cap fires on filesystems that would complete a full depth-50 scan in under 5 minutes, resulting in enrichment notes that miss the largest directories entirely and show inflated, incorrect percentages.
In fsenrich.sls, the get_tree state applies this logic when depth=0 (auto):
# 85M inodes ≈ 40 min scan time based on regression analysis.
if [ "$inodes_df_ok" -eq 0 ] || [ "$inodes_total" -gt 85000000 ]; then
effective_depth=3
else
effective_depth=50
fiWhen depth=3, the scan traverses only 3 directory levels deep. On most /bb filesystems this means only /bb/data and /bb/bin subtrees are scanned — the top-level /bb/jenkins_work or /bb/risk directories (often the actual offenders) are found but not descended into, and are excluded from the top-N results if smaller subtrees dominate.
This is subtle. find /bb -maxdepth 3 does discover /bb/jenkins_work as a directory entry — it exists at depth=1, well within the limit. The problem is how directory sizes are computed.
The Salt AWK script calculates directory sizes by accumulating file sizes from the find output stream. A directory's reported size is the sum of all file sizes that find lists inside it. With -maxdepth 3, find only outputs files at depth ≤ 3 from the root — files deeper than that are silently omitted.
For jenkins_work, the actual build artifacts live at depth 8–15+ within the filesystem (e.g. /bb/jenkins_work/pas-ci/jaas/chartscore/workspace/PR-2445/cmake-build/.../listmanc.t.tsk). None of these appear in a depth=3 scan. The AWK accumulator therefore assigns /bb/jenkins_work a size of nearly zero — only counting the handful of shallow files directly under it.
Meanwhile /bb/data has logs, CSVs, and database files at depth 2–3, so its measured size looks proportionally large even though it holds far less space overall.
The cascading effect:
total_size(the denominator for percentages) is computed as the sum of all file sizesfindreported — a tiny fraction of the real 700+ GB used- Percentages are wildly inflated: a 5 GB directory measured against a ~22 GB scanned subtree appears as "24%" instead of the correct ~0.7%
/bb/jenkins_worknever enters the top-N heap because its accumulated size is near zero, even though it holds 300–424 GB
In short: depth=3 makes large directories measurably empty, not invisible. They appear in the directory listing but with the wrong (near-zero) size.
DRQS 184303504 — 90% bytes used — 2026-05-19 18:07 UTC
| Run | depth | inodes_total | Largest Dir Found | Scan Duration |
|---|---|---|---|---|
| Prod (auto) | 0→3 | ~459M (estimated) | /bb/data 20 GB |
~60s |
| Beta metric (auto) | 0→3 | 459,355,736 | /bb/data 20 GB |
~60s |
Beta bot -d 50 |
50 | 459,296,608 | /bb/jenkins_work 309 GB |
~3.5 min |
With depth=3: Enrichment reports /bb/data at 20 GB as the top directory, with inflated percentages (e.g. "24%" for a 5 GB directory calculated against the 22 GB scanned subtree, not the 700 GB total used). /bb/jenkins_work at 300–424 GB — the actual space consumer — is completely absent.
With depth=50: Enrichment correctly identifies /bb/jenkins_work as the dominant directory. Percentages are accurate (calculated against total used). Scan completed in ~3.5 minutes — well within the 40-minute timeout.
Inode count: 459M — 5.4× the 85M threshold. Scan time at depth=50: ~3.5 min.
DRQS 184285499 — filesystem had 18 GB used (512 GB total)
| Run | depth | inodes_total | Result |
|---|---|---|---|
| Beta metric (auto) | 0→3 | 1,035,531,339 | *No directories to analyze* — empty |
| Prod bot (auto) | 0→3 | — | Full directory tree with 5 dirs, 5 files found |
Beta and prod diverged here: prod found the Jenkins workspace directories (10–17 GB each) and large .tsk build artifacts while beta returned completely empty. The 1B+ inode count was caused by a Jenkins workspace accumulating thousands of tiny build artifacts. At depth=3, the scan found nothing worth surfacing.
Note: The prod result is from a different mechanism (triagercsvc tree command). The emsdev case is more complex — prod uses tree which is not inode-gated — but the empty beta result on a filesystem with 18 GB of real content illustrates the issue.
The 85M threshold comment claims "85M inodes ≈ 40 min scan time based on regression analysis." However:
- prtbld-ob-955 has 459M inodes and depth=50 completed in ~3.5 minutes — 11× faster than the assumed 40 minutes
- The threshold appears to be based on stale benchmarks or a specific worst-case host that no longer represents the general case
- The fallback to depth=3 is too shallow — on most filesystems, the top-level directories that hold the most space (jenkins_work, risk, etc.) exist at depth=1 but their contents are not traversed, so they don't appear in the top-N results
The binary 85M/depth-3 decision produces enrichment notes that:
- Miss the largest directories entirely
- Show wildly incorrect percentages (calculated against a tiny scanned subtree)
- Give triage engineers a completely misleading picture of what's consuming disk space
Replace the binary threshold with a tiered approach:
# Current (too aggressive):
if [ "$inodes_df_ok" -eq 0 ] || [ "$inodes_total" -gt 85000000 ]; then
effective_depth=3
else
effective_depth=50
fi
# Proposed (tiered):
if [ "$inodes_df_ok" -eq 0 ] || [ "$inodes_total" -gt 500000000 ]; then
effective_depth=3
elif [ "$inodes_total" -gt 85000000 ]; then
effective_depth=10
else
effective_depth=50
fiTiers:
| inodes_total | effective_depth | Rationale |
|---|---|---|
| ≤85M | 50 | Current behavior, unchanged |
| 85M–500M | 10 | Covers prtbld class (459M). Depth=10 finds top-level offenders. Estimated scan time: <10 min |
| >500M | 3 | Extreme cases (emsdev class at 1B+). Conservative cap retained |
Why depth=10 for the middle tier:
- Top-level directories (
/bb/jenkins_work,/bb/risk, etc.) exist at depth=1. Depth=10 gives 9 levels of traversal within them — enough to surface the actual large subdirectories. - prtbld at depth=50 completed in 3.5 min. Depth=10 would be significantly faster.
- Percentages will be accurate since the full filesystem is still traversed (just bounded at depth=10).
Alternative approach — trust the timeout:
Remove the inode guardrail entirely and rely on FSENRICH_TIMEOUT (the existing 40-minute timeout sentinel) as the primary safety mechanism. The timeout already kills find cleanly and the downstream code handles it gracefully. The inode guardrail was added as a preemptive guard but real-world data suggests it fires far too early.
- Who is affected: Any host with a
/bbor similar filesystem exceeding 85M inodes in auto-depth mode. Jenkins build hosts, comdb2 hosts, and any machine with millions of small files will hit this. - Symptom: Enrichment note shows small directories as top offenders with inflated percentages, missing the actual large directories. The note includes the warning: "ⓘ Analysis depth has been limited to 3 directories deep."
- Workaround: Users can post
@fsenrich -t host -f /fs -d 50to force depth=50 explicitly, bypassing the inode cap. This is not documented and requires knowledge of the-dflag. - Detection: Look for
effective_depth=3, requested_depth=0combined with largeinodes_total(>85M) inparse_tree_data_successfullog entries in Humio (#logConfigName=fsenrichbpaas).
salt/orch/triatm/fsenrich.sls—get_treestate, inode guardrail block (lines ~200–207)triage/fsenrich/business_logic.py—SALT_AUTO_DEPTH_MAXconstant and the depth-limit warning message (update threshold documentation)- Tests — add test cases for 85M–500M inode range verifying depth=10 is selected
PRQS tickets:
- PRQS 346784458 — prtbld-ob-955 /bb depth=0 (auto→3), 459M inodes
- PRQS 346784647 — prtbld-ob-955 /bb depth=50 (explicit), 459M inodes, completed in ~3.5 min
DRQS tickets:
- DRQS 184303504 — prtbld-ob-955 /bb alarm ticket (today's event)
- DRQS 183110294 — beta test ticket, notes #278 (metric auto), #283 (bot auto), #286 (bot -d 50)
Humio query to reproduce:
#logConfigName=fsenrichbpaas | bpaasStage = "beta-s0a" | /183110294/ | /parse_tree_data_successful/ | tail(10)
Look for inodes_total and effective_depth fields in the response.