Skip to content

Instantly share code, notes, and snippets.

@0xdeafbeef
Created May 19, 2025 20:03
Show Gist options
  • Save 0xdeafbeef/9cb4aa074aa2b0da3aa7cfd9bbdd5f8a to your computer and use it in GitHub Desktop.
Save 0xdeafbeef/9cb4aa074aa2b0da3aa7cfd9bbdd5f8a to your computer and use it in GitHub Desktop.
Metric With Preload Without Preload Delta (Abs.) Delta (%) Interpretation
Time Elapsed (s) ~337.14 ~296.96 ~+40.18 s +13.53% Overall slowdown with preload.
task-clock (CPU s) ~300.28 ~259.31 ~+40.97 s +15.80% More CPU time used, indicating CPU-bound work or stalls.
cycles (Trillions) ~1,630 ~1,405 ~+225 T +16.01% More CPU cycles spent, largely due to stalls.
instructions (T) ~1,420 ~1,396 ~+24 T +1.72% Slightly more instructions; amount of actual work is very similar.
IPC (Insn per Cycle) 0.87 0.99 -0.12 -12.12% Lower CPU efficiency with preload; CPU stalled more often.
cache-references (B) ~19.57 ~17.36 ~+2.21 B +12.73% More accesses to the cache hierarchy.
cache-misses (B) ~5.21 ~4.20 ~+1.01 B +24.05% CRITICAL: Huge increase in (likely LLC) misses. Strong evidence of cache pollution by memset.
L1-dcache-load-misses (B) ~7.65 ~7.26 ~+0.39 B +5.37% Significant increase in L1 data cache misses.
L1-icache-load-misses (M) ~78.7 ~95.5 ~-16.8 M -17.59% Fewer L1 instruction cache misses with preload; not a dominant factor.
page-faults (M) ~7.34 ~0.70 ~+6.64 M +945.8% Preload faults all pages upfront. No-preload faults on demand. High % due to low base in no-preload.
minor-faults (M) ~7.34 ~0.70 ~+6.64 M +945.8% Same as page-faults; reflects memset activity on ~30GB.
dTLB-load-misses (B) ~1.23 ~0.58 ~+0.65 B +112.07% CRITICAL: More than doubled dTLB misses. memset on huge region thrashes TLB.
iTLB-load-misses (M) ~26.0 ~25.0 ~+1.0 M +4.00% Minor difference in instruction TLB misses.
branch-misses (B) ~20.58 ~19.97 ~+0.61 B +3.05% Slightly more branch misses with preload.

Summary of Findings (with Percentages):

  • Preloading with memset resulted in a ~13.5% increase in overall execution time.
  • This slowdown is strongly correlated with:
    • A +16.0% increase in CPU cycles.
    • A -12.1% decrease in Instructions Per Cycle (IPC), indicating reduced CPU efficiency.
    • A +24.0% increase in cache-misses (likely LLC misses), a major bottleneck.
    • A staggering +112.1% increase in dTLB-load-misses, indicating severe contention for address translation caching.
  • The dramatic +945.8% increase in page-faults for the preload case is expected, as it reflects the upfront faulting of all ~7.3 million pages by memset. While this is a large percentage, the impact of these faults comes from the subsequent cache/TLB pollution rather than just the faulting time itself.
  • The data clearly shows that the memory system performance (caches and TLB) is severely degraded in the preload scenario for this workload, leading to significant performance loss.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment