- CPU: Apple M3 (8 cores)
- RAM: 24 GB
- OS: Darwin 24.6.0 arm64 (macOS Sequoia)
- Compiler: Apple clang 17.0.0 (clang-1700.6.3.2)
- Rust: rustc 1.93.1 (01f6ddf75 2026-02-11)
| import argparse | |
| import imaplib | |
| import email.utils | |
| import sys | |
| from collections import Counter | |
| from rich.console import Console | |
| from rich.table import Table | |
| from rich.progress import Progress, BarColumn, TextColumn, TimeRemainingColumn, MofNCompleteColumn |
- Baseline:
develop@54352daf(Make extra length/distance bits computation branchless) - Contender:
improvements/crc32-arm-copy@b4043c6f(Implement crc32 interleaved copy for ARM PMULL+EOR3) - Repetitions: 5 per benchmark, aggregates only
- Use CMake build system.
- Always check the commits for
HEADandBASEor other branch names as they can change often. - To build for other architectures than the current architecture use
llvm-clangunlessgccis specified.
arch/- Architecture specific optimizationstest/- Unit tests written using Google Test Framework (gtest_zlib project)
- Platform: macOS Darwin 24.6.0, Apple Silicon (ARM64)
- CPU: 8 cores, L1D 64 KiB, L1I 128 KiB, L2 4096 KiB
- Build: CMake Release, static libs
- HEAD (
improvements/tally-v2):c51ce99e— Combine extra_lbits/base_length and extra_dbits/base_dist lookup tables - develop:
1b880ba9— Make extra length/distance bits computation branchless using bit masking
Hoist s->bi_buf and s->bi_valid into local variables in compress_block() and pass them by pointer to the emit functions. This eliminates redundant load/store pairs between zng_emit_lit and zng_emit_dist calls within the main compression loop.
Comparison of develop (08fa4859) vs HEAD (conditional preload with MIN_HAVE=15).
The patch decodes the next iteration's Huffman symbol before performing the
chunk copy, allowing the table lookup latency to overlap with copy operations.
A can_preload flag skips the preload when the bit accumulator is low (the
UNLIKELY 2+ literal path), keeping INFLATE_FAST_MIN_HAVE at 15 instead of 22.
Extracted by inspecting undefined symbols in functable.c.o for each build — these are the function pointers the functable actually assigns at runtime. Builds use clang -target x86_64-apple-macos with runtime CPU detection enabled (the default).
-march |
SSE2 | SSSE3 | SSE4.1 | SSE4.2 | PCLMUL | AVX2 | AVX-512 | AVX512VNNI | VPCLMUL |
|---|---|---|---|---|---|---|---|---|---|
| x86-64 | - | - | - | - | - | - | - | - | - |
| nehalem | native | native | native | native | - | - | - | - | - |
| /* =========================================================================== | |
| * Symbol buffer write/read macros. | |
| * | |
| * The symbol buffer stores literal and distance/length pairs. The storage | |
| * format differs based on LIT_MEM (separate buffers) vs sym_buf (interleaved), | |
| * and on whether the platform supports fast unaligned 32-bit access | |
| * (OPTIMAL_CMP >= 32), which allows packing a 3-byte symbol into a single | |
| * 32-bit write/read. | |
| * | |
| * SYM_WRITE_LIT and SYM_WRITE_DIST write a symbol and advance sym_next. |