preload.md

Metric	With Preload	Without Preload	Delta (Abs.)	Delta (%)	Interpretation
Time Elapsed (s)	~337.14	~296.96	~+40.18 s	+13.53%	Overall slowdown with preload.
`task-clock` (CPU s)	~300.28	~259.31	~+40.97 s	+15.80%	More CPU time used, indicating CPU-bound work or stalls.
`cycles` (Trillions)	~1,630	~1,405	~+225 T	+16.01%	More CPU cycles spent, largely due to stalls.
`instructions` (T)	~1,420	~1,396	~+24 T	+1.72%	Slightly more instructions; amount of actual work is very similar.
IPC (Insn per Cycle)	0.87	0.99	-0.12	-12.12%	Lower CPU efficiency with preload; CPU stalled more often.
`cache-references` (B)	~19.57	~17.36	~+2.21 B	+12.73%	More accesses to the cache hierarchy.
`cache-misses` (B)	~5.21	~4.20	~+1.01 B	+24.05%	CRITICAL: Huge increase in (likely LLC) misses. Strong evidence of cache pollution by `memset`.
`L1-dcache-load-misses` (B)	~7.65	~7.26	~+0.39 B	+5.37%	Significant increase in L1 data cache misses.
`L1-icache-load-misses` (M)	~78.7	~95.5	~-16.8 M	-17.59%	Fewer L1 instruction cache misses with preload; not a dominant factor.
`page-faults` (M)	~7.34	~0.70	~+6.64 M	+945.8%	Preload faults all pages upfront. No-preload faults on demand. High % due to low base in no-preload.
`minor-faults` (M)	~7.34	~0.70	~+6.64 M	+945.8%	Same as page-faults; reflects `memset` activity on ~30GB.
`dTLB-load-misses` (B)	~1.23	~0.58	~+0.65 B	+112.07%	CRITICAL: More than doubled dTLB misses. `memset` on huge region thrashes TLB.
`iTLB-load-misses` (M)	~26.0	~25.0	~+1.0 M	+4.00%	Minor difference in instruction TLB misses.
`branch-misses` (B)	~20.58	~19.97	~+0.61 B	+3.05%	Slightly more branch misses with preload.

Summary of Findings (with Percentages):

Preloading with memset resulted in a ~13.5% increase in overall execution time.
This slowdown is strongly correlated with:
- A +16.0% increase in CPU cycles.
- A -12.1% decrease in Instructions Per Cycle (IPC), indicating reduced CPU efficiency.
- A +24.0% increase in cache-misses (likely LLC misses), a major bottleneck.
- A staggering +112.1% increase in dTLB-load-misses, indicating severe contention for address translation caching.
The dramatic +945.8% increase in page-faults for the preload case is expected, as it reflects the upfront faulting of all ~7.3 million pages by memset. While this is a large percentage, the impact of these faults comes from the subsequent cache/TLB pollution rather than just the faulting time itself.
The data clearly shows that the memory system performance (caches and TLB) is severely degraded in the preload scenario for this workload, leading to significant performance loss.

0xdeafbeef/preload.md