Nathan Moinvaziri nmoinvaz

zlib-ng vs zlib-rs Benchmark Comparison (ARM64, Apple M3)

Machine Specs

CPU: Apple M3 (8 cores)
RAM: 24 GB
OS: Darwin 24.6.0 arm64 (macOS Sequoia)
Compiler: Apple clang 17.0.0 (clang-1700.6.3.2)
Rust: rustc 1.93.1 (01f6ddf75 2026-02-11)

zlib-ng: CRC32 ARM Interleaved Copy Benchmark Results

Comparison

Baseline: develop @ 54352daf (Make extra length/distance bits computation branchless)
Contender: improvements/crc32-arm-copy @ b4043c6f (Implement crc32 interleaved copy for ARM PMULL+EOR3)
Repetitions: 5 per benchmark, aggregates only

Machine

Project Basics

Use CMake build system.
Always check the commits for HEAD and BASE or other branch names as they can change often.
To build for other architectures than the current architecture use llvm-clang unless gcc is specified.

Key Directories

arch/ - Architecture specific optimizations
test/ - Unit tests written using Google Test Framework (gtest_zlib project)

Benchmark: `improvements/crc32-arm-copy` vs `develop`

Date: 2026-02-23 Platform: Apple Silicon (ARM64), 8 cores, L1D 64 KiB, L2 4096 KiB Build: CMake Release, static libs Repetitions: 5 (median CPU time reported)

`crc32/armv8_pmull_eor3` (CRC32 only)

Compress Benchmark: HEAD (improvements/tally-v2) vs develop

Environment

Platform: macOS Darwin 24.6.0, Apple Silicon (ARM64)
CPU: 8 cores, L1D 64 KiB, L1I 128 KiB, L2 4096 KiB
Build: CMake Release, static libs

Commits

HEAD (improvements/tally-v2): c51ce99e — Combine extra_lbits/base_length and extra_dbits/base_dist lookup tables
develop: 1b880ba9 — Make extra length/distance bits computation branchless using bit masking

Assembly Analysis: Keep bi_buf/bi_valid in Registers Across compress_block

Change

Hoist s->bi_buf and s->bi_valid into local variables in compress_block() and pass them by pointer to the emit functions. This eliminates redundant load/store pairs between zng_emit_lit and zng_emit_dist calls within the main compression loop.

Results

bi_buf/bi_valid Memory Operations (offsets 168/176 from deflate_state*)

Conditional Preload Optimization Analysis

Comparison of develop (08fa4859) vs HEAD (conditional preload with MIN_HAVE=15).

The patch decodes the next iteration's Huffman symbol before performing the chunk copy, allowing the table lookup latency to overlap with copy operations. A can_preload flag skips the preload when the bit accumulator is low (the UNLIKELY 2+ literal path), keeping INFLATE_FAST_MIN_HAVE at 15 instead of 22.

Benchmark Results

Functable Dispatch Matrix — x86 `-march` Variants

Extracted by inspecting undefined symbols in functable.c.o for each build — these are the function pointers the functable actually assigns at runtime. Builds use clang -target x86_64-apple-macos with runtime CPU detection enabled (the default).

`-march` native features

`-march`	SSE2	SSSE3	SSE4.1	SSE4.2	PCLMUL	AVX2	AVX-512	AVX512VNNI	VPCLMUL
x86-64	-	-	-	-	-	-	-	-	-
nehalem	native	native	native	native	-	-	-	-	-

	import argparse
	import imaplib
	import email.utils
	import sys
	from collections import Counter

	from rich.console import Console
	from rich.table import Table
	from rich.progress import Progress, BarColumn, TextColumn, TimeRemainingColumn, MofNCompleteColumn

	/* ===========================================================================
	* Symbol buffer write/read macros.
	*
	* The symbol buffer stores literal and distance/length pairs. The storage
	* format differs based on LIT_MEM (separate buffers) vs sym_buf (interleaved),
	* and on whether the platform supports fast unaligned 32-bit access
	* (OPTIMAL_CMP >= 32), which allows packing a 3-byte symbol into a single
	* 32-bit write/read.
	*
	* SYM_WRITE_LIT and SYM_WRITE_DIST write a symbol and advance sym_next.