Skip to content

Instantly share code, notes, and snippets.

@nascheme
nascheme / gc_inc_work_to_do.diff
Created April 23, 2026 06:18
Proposed change to Python 3.14 incremental gc
diff --git a/Include/internal/pycore_interp_structs.h b/Include/internal/pycore_interp_structs.h
index 6b3d5711b92..aadc577306f 100644
--- a/Include/internal/pycore_interp_structs.h
+++ b/Include/internal/pycore_interp_structs.h
@@ -229,7 +229,8 @@ struct _gc_runtime_state {
PyObject *callbacks;
Py_ssize_t heap_size;
- Py_ssize_t work_to_do;
+ /* Total number of young objects since the last complete collection */
@nascheme
nascheme / bm_gc_cycles.py
Created April 13, 2026 21:58
Python cyclic GC benchmark script
import argparse
import gc
import sys
import time
def get_memory_usage():
"""Memory usage of the current process in KB."""
result = {'peak': 0, 'rss': 0}
with open('/proc/self/status') as status:
Tested with https://github.com/nascheme/cpython/tree/gc-gen-3.14 (b4ea6d82324cf84f74731a88decd7577eccde2f9)
Benchmark script: https://gist.github.com/nascheme/6d09885d3696055b9665d3b88c7aacaa
inc_gc OFF (0) vs ON (1) comparison
cycle extra live t0(s) t1(s) Δt% rss0 rss1 Δrss% trash0 trash1
--------------------------------------------------------------------------------------------------
10 0 100 0.66 0.73 +10.0 16M 17M +6 6k 28k
10 0 1.0k 0.66 0.76 +14.2 17M 20M +19 14k 82k
10 0 10.0k 0.78 0.84 +7.8 20M 26M +29 88k 224k
10 1.0k 100 0.73 0.76 +4.9 16M 20M +21 6k 28k
This is 3.14 branch with https://github.com/python/cpython/pull/142001 applied
Benchmark script: https://gist.github.com/nascheme/6d09885d3696055b9665d3b88c7aacaa
inc_gc OFF (0) vs ON (1) comparison
cycle extra live t0(s) t1(s) Δt% rss0 rss1 Δrss% trash0 trash1
--------------------------------------------------------------------------------------------------
10 0 100 0.65 0.75 +16.3 16M 17M +2 6k 11k
10 0 1.0k 0.68 0.78 +14.3 17M 19M +11 14k 54k
10 0 10.0k 0.76 0.89 +17.8 20M 22M +10 88k 138k
@nascheme
nascheme / gc_gen_inc_compare.txt
Last active April 11, 2026 20:31
Comparing generational vs incremental GC in 3.14, with fastbench
base: d76c56e958c9a603ded42d27b39ab51c1e3794e4 (head of 3.14 branch)
gen_gc: https://github.com/nascheme/cpython/tree/gc-gen-3.14 (b4ea6d82324cf84f74731a88decd7577eccde2f9)
inc_gc: as above but with -Xinc_gc on
Benchmark results produced by "fastbench" 2ab489f1f315582b8a537b537a7664898d90a0a4
Benchmark base ms gen_gc ms ratio inc_gc ms ratio
------------------------------------------------------------------------------------------------
2to3 95.2 96.6 1.015 97.7 1.026

vLLM Tensor-Parallel Benchmark Profiler Analysis

Source: tp_generate_2026-10-25_2.json.gz — samply profiler output from TP (tensor-parallel) benchmark Model: LlamaForCausalLM (Llama architecture) vLLM version: v1 engine (inferred from vllm/v1/ paths) Executor: MultiprocExecutor (TP=2, 2 Worker_TP processes) Python: 3.14t (free-threaded build) Time window: 39s–100s (steady-state LLM inference) Total samples (in window): ~245,437

vLLM Benchmark Profiler Analysis — CUDA Graphs Enabled

Source: mp_gen_cuda.json.gz — samply profiler output from mp_generate.py with --cuda-graphs enabled (multi-process, single-GPU-per-worker) Model: LlamaForCausalLM (Llama architecture, FP16 / c10::Half) vLLM version: v1 engine (inferred from vllm/v1/ paths) Time window: 56s–96s (steady-state LLM inference, excluding startup/shutdown) Total samples (in window): ~96,675


@nascheme
nascheme / cuda_contention_bench2.py
Created February 20, 2026 19:40
CUDA threaded benchmark, dual GPUs
# CUDA contention benchmark v2: realistic LLM-like workload.
#
# Simulates a transformer decode step more faithfully than v1:
# - Multiple linear layers (like attention Q/K/V projections + MLP)
# - Small element-wise ops (RMSNorm, activations, residual adds)
# - Index/gather operations (like token embedding lookup, KV cache scatter)
# - Tensor creation/destruction churn
# - Periodic CPU readback (like reading sampled token IDs)
#
# This should stress the same code paths as a real LLM step — many CUDA
@nascheme
nascheme / threaded_generate.py
Last active February 12, 2026 22:56
Multi-threaded vllm benchmark
# Dual-engine multi-GPU threaded vLLM throughput benchmark.
#
# Architecture: two independent LLMEngine instances (one per GPU) fed from
# a shared tokenized-request queue, with a single tokenizer thread.
#
# Tokenizer Thread (CPU) Engine Thread 0 (cuda:0) Engine Thread 1 (cuda:1)
# input_processor.process() add_request (from queue) add_request (from queue)
# tokenized_queue.put(ecr) engine0.step() engine1.step()
# (continuous streaming) (continuous streaming)
#
@nascheme
nascheme / pool_warning_filter_example.py
Last active January 8, 2026 21:57
Example of context_aware_warnings behaviour
# See: https://github.com/pytest-dev/pytest/issues/14077
from concurrent.futures import ThreadPoolExecutor
import numpy as np
import pytest
@pytest.fixture(scope="module")
def pool():
yield ThreadPoolExecutor(max_workers=1)
def test_1(pool):