Source: tp_generate_2026-10-25_2.json.gz — samply profiler output from TP (tensor-parallel) benchmark
Model: LlamaForCausalLM (Llama architecture)
vLLM version: v1 engine (inferred from vllm/v1/ paths)
Executor: MultiprocExecutor (TP=2, 2 Worker_TP processes)
Python: 3.14t (free-threaded build)
Time window: 39s–100s (steady-state LLM inference)
Total samples (in window): ~245,437
Samples % Activity
──────── ──── ──────────────────────────────────────────────────
66,639 27.2% SHM spin-wait: sched_yield (OS-level yield)
52,789 21.5% GPU sync wait (cuEventSynchronize)
19,141 7.8% SHM spin-wait: MessageQueue.acquire_read (self)
10,611 4.3% Clock reads (_GI___clock_gettime) — GPU sync + SHM
7,375 3.0% SHM spin-wait: memory_fence
6,210 2.5% SHM spin-wait: SpinTimer.spin
6,002 2.4% SHM spin-wait: sched_yield (Python wrapper)
2,972 1.2% CPython: Py_TryIncrefCompareStackRef
2,218 0.9% SHM spin-wait: Event.is_set check
2,141 0.9% CPython: initialize_locals
1,377 0.6% Triton kernel loading (cuModuleLoadData)
1,290 0.5% CUDA module lookup (cuLibraryGetModule)
1,159 0.5% SHM spin-wait: os_sched_yield_impl
64,513 26.3% Everything else (below top-40 cutoff)
The profile is dominated by two kinds of spin-waiting:
- SHM IPC spin-wait (~44%) — EngineCore and Worker_TP processes waiting for shared-memory message queue responses
- GPU sync wait (~25%) — WorkerAsyncOutput thread waiting for GPU completion
┌─────────────────────────────────────────────────────────────────┐
│ tp_generate (parent process) │
│ spawns EngineCore + 2 Worker_TP processes │
└──────┬──────────────────────────┬──────────────┬────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ EngineCore │ │ Worker_TP 0 │ │ Worker_TP 1 │
│ Process │ │ Process │ │ Process │
│ :3603434 │ │ :3603636 │ │ :3603637 │
│ │ │ │ │ │
│ 60,998 samples │ │ 60,970 samples │ │ 60,978 samples │
│ CPU: 61.0s │ │ CPU: 61.0s │ │ CPU: 61.0s │
│ Wait: 34ms │ │ Wait: 33ms │ │ Wait: 36ms │
│ │ │ │ │ │
│ ┌──────────────┐│ │ ┌──────────────┐│ │ ┌──────────────┐│
│ │ run_busy_loop││ │ │ worker_busy_ ││ │ │ worker_busy_ ││
│ │ _process_ ││ │ │ loop ││ │ │ loop ││
│ │ engine_step ││ │ │ ││ │ │ ││
│ │ ││ │ │ SHM dequeue ││ │ │ SHM dequeue ││
│ │ SHM dequeue ││ │ │ → execute ││ │ │ → execute ││
│ │ (wait for ││ │ │ → SHM enqueue││ │ │ → SHM enqueue││
│ │ workers) ││ │ │ ││ │ │ ││
│ └──────────────┘│ │ └──────────────┘│ │ └──────────────┘│
│ │ │ │ │ │
│ │ │ ┌──────────────┐│ │ │
│ │ │ │ WorkerAsync ││ │ │
│ │ │ │ Output ││ │ │
│ │ │ │ :3603654 ││ │ │
│ │ │ │ 60,938 samp ││ │ │
│ │ │ │ GPU sync wait││ │ │
│ │ │ └──────────────┘│ │ │
└──────────────────┘ └──────────────────┘ └──────────────────┘
4 main threads across 3 processes, all burning ~61s of CPU:
- 1 EngineCore — orchestrates inference steps, sends commands to workers via SHM, waits for responses
- 2 Worker_TP — receive commands via SHM, execute model (TP shards), respond via SHM
- 1 WorkerAsyncOutput — spin-waits for GPU completion
(
cudaEventSynchronize)
All 4 threads show near-zero wait time (34ms, 33ms, 36ms, 69ms) — they are all CPU-bound spin-waiting, not sleeping.
Unlike the mp_generate profile (where each GPU has its own EngineCore), the TP
architecture requires IPC between the EngineCore and Worker_TP processes. vLLM
uses shared-memory message queues (shm_broadcast.py) for this, and the
receive side is a busy-wait spin loop.
EngineCore / Worker_TP thread
──────────────────────────────
MessageQueue.dequeue() ◄── Python entry
└── MessageQueue.acquire_read() ◄── 7.8% self (Python loop body)
│
│ Spin loop: while not message_ready:
│
├── memory_fence() ◄── 3.0% (memory barriers)
│
├── SpinTimer.spin() ◄── 2.5% (tracks spin duration)
│ └── sched_yield() ◄── 2.4% (Python wrapper)
│ └── os.sched_yield() ◄── 0.5% (C impl)
│ └── _GI_sched_yield ◄── 27.2% (OS syscall)
│
├── Event.is_set() ◄── 0.9% (check stop event)
│
└── time.monotonic() ◄── part of 4.3% clock_gettime
└── _GI___clock_gettime
Where each process is spin-waiting:
| Thread | Waiting for | Stacks |
|---|---|---|
| EngineCore :3603434 | Worker responses (FutureWrapper.result → MessageQueue.dequeue) | #2, #6, #9, #12, #13, #18 |
| Worker_TP :3603636 | EngineCore commands (worker_busy_loop → MessageQueue.dequeue) | #3, #8, #11, #14, #16, #19 |
| Worker_TP :3603637 | EngineCore commands (worker_busy_loop → MessageQueue.dequeue) | #4, #7, #10, #15, #17, #20 |
The EngineCore sends a command to both workers, then spin-waits for both to respond. The workers spin-wait for the next command. Since inference is GPU-bound, the GPU is the bottleneck and all three processes spend most of their time spin-waiting.
Samples % Component
──────── ──── ──────────────────────────────────────
66,639 27.2% _GI_sched_yield (the actual syscall)
19,141 7.8% acquire_read self (Python loop overhead)
7,375 3.0% memory_fence (barrier calls)
6,210 2.5% SpinTimer.spin (timer bookkeeping)
6,002 2.4% sched_yield Python wrapper
2,218 0.9% Event.is_set (stop-event check)
2,341 1.0% _GI___clock_gettime (from SHM timer)
1,159 0.5% os_sched_yield_impl
821 0.3% PyTime_AsSecondsDouble
──────── ────
111,906 45.6% Total SHM spin-wait
WorkerAsyncOutput thread :3603654
──────────────────────────────────
WorkerProc.async_output_busy_loop
└── WorkerProc.enqueue_output
└── AsyncGPUModelRunnerOutput.get_output
└── THPEvent_synchronize
└── CUDAGuardImpl::synchronizeEvent
└── cudaEventSynchronize
└── cuEventSynchronize ◄── 21.5% self
└── _GI___clock_gettime ◄── 3.2% (spin-wait clock)
The same pattern as in mp_generate: the WorkerAsyncOutput thread spin-waits for GPU inference completion. At ~25% it's the second-largest cost category (vs 65% in the mp_generate profile, because the SHM overhead now takes a large share).
This profile shows significant CPython interpreter overhead, largely from executing the SHM spin loop in Python. Key functions:
Samples % Function Context
──────── ──── ────────────────────────────────── ───────────────────
2,972 1.2% Py_TryIncrefCompareStackRef refcount (free-threaded)
2,141 0.9% initialize_locals frame setup
1,553 0.6% PyFrame_GetStackPointer frame access
1,381 0.6% PyEvalFramePushAndInit frame creation
1,188 0.5% PyFrame_ClearExceptCode frame teardown
1,166 0.5% PyStackRef_FromPyObjectImmortal stack ref mgmt
1,156 0.5% PyUnstable_Code_GetExtra code object extras
1,094 0.4% Py_atomic_load_uint16_relaxed atomic ops (free-threaded)
1,047 0.4% Py_CheckFunctionResult call result check
1,017 0.4% Py_DECREF refcount
1,010 0.4% PyFrame_ClearLocals frame cleanup
... ... (many more small items)
──────── ────
~24,000 9.8% Total CPython overhead
Notable: free-threaded Python (3.14t) shows up in the atomic operations
(Py_atomic_load_uint16_relaxed, Py_atomic_store_uint64_release,
Py_atomic_load_uint32_relaxed, Py_atomic_load_uint64_acquire) and
reference counting (Py_TryIncrefCompareStackRef, PySeqLock_BeginRead).
These are the per-object atomic operations required when the GIL is disabled.
Much of this overhead is from executing the SHM spin loop body in Python
hundreds of thousands of times — each iteration does frame push/pop, refcount
ops, function calls to memory_fence, sched_yield, Event.is_set, etc.
┌─────────────────────────────────────────────────────────────────────────┐
│ vLLM v1 Tensor-Parallel Architecture │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ EngineCore Process :3603434 │ │
│ │ │ │
│ │ EngineCoreProc.run_engine_core() │ │
│ │ └── run_busy_loop() │ │
│ │ └── _process_engine_step() │ │
│ │ └── EngineCore.step_with_batch_queue() │ │
│ │ │ │ │
│ │ ├── MultiprocExecutor │ │
│ │ │ .collective_rpc() │ │
│ │ │ └── SHM enqueue (command)───│──► to workers │
│ │ │ └── FutureWrapper.result ◄──│── from workers│
│ │ │ └── get_response │ │
│ │ │ └── MessageQueue │ │
│ │ │ .dequeue() │ ~44% CPU │
│ │ │ └── acquire_read │
│ │ │ └── SPIN WAIT ◄─────│
│ │ │ │ │
│ │ └── (scheduler, output processing)│ │
│ └─────────────────────────────────────────────────────┘ │
│ │ SHM │
│ ┌─────────┴─────────┐ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Worker_TP 0 :3603636│ │ Worker_TP 1 :3603637 │ │
│ │ │ │ │ │
│ │ WorkerProc │ │ WorkerProc │ │
│ │ .worker_busy_loop() │ │ .worker_busy_loop() │ │
│ │ └── MessageQueue │ │ └── MessageQueue │ │
│ │ .dequeue() │ │ .dequeue() │ ~44% CPU │
│ │ └── SPIN WAIT │ │ └── SPIN WAIT │ (shared with │
│ │ └── execute_model() │ │ └── execute_model() │ EngineCore) │
│ │ │ │ │ │
│ │ ┌────────────────┐ │ │ │ │
│ │ │ WorkerAsync │ │ │ │ │
│ │ │ Output :3603654 │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ async_output_ │ │ │ │ │
│ │ │ busy_loop │ │ │ │ │
│ │ │ └── get_output│ │ │ │ ~25% CPU │
│ │ │ └── cuEvent │ │ │ │ (GPU sync) │
│ │ │ Synchronize │ │ │ │
│ │ └────────────────┘ │ │ │ │
│ └──────────────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Triton/CUDA module loading (~1.1%):
cuModuleLoadData: 1,377 (0.6%) — loading compiled Triton PTXcuLibraryGetModule: 1,290 (0.5%) — CUDA library module lookup
String operations:
_strchr_avx2: 1,129 (0.5%) — string scanning, likely from Python argument parsing or path operations
████████████████████████████████████████████████████ 44.6% SHM IPC spin-wait
██████████████████████████████ 24.9% GPU sync wait
████████████████ 9.8% CPython interpreter overhead
██ 1.1% Triton/CUDA module loading
██████████████████████████ 19.6% Other
1. SHM message queue spin-waiting dominates at ~45%. The TP architecture
requires the EngineCore to send commands to 2 Worker_TP processes and wait for
responses via shared-memory message queues. All 3 processes spin-wait using
sched_yield when no message is available. Since inference is GPU-bound, the
CPU spends most of its time in these spin loops.
2. GPU sync wait is ~25% — same cuEventSynchronize spin-wait as in the
mp_generate profile, but proportionally smaller because the SHM spin-wait now
consumes a large share. Only 1 WorkerAsyncOutput thread (vs 2 in mp_generate).
3. Free-threaded Python (3.14t) atomic overhead is visible. Functions like
Py_TryIncrefCompareStackRef, Py_atomic_load_*, PySeqLock_BeginRead show
the cost of per-object atomic operations in the no-GIL build. These are
particularly expensive in the spin loop which does high-frequency Python
function calls.
4. The SHM spin loop executes in Python — unlike the CUDA spin-wait which
is in native code, the SHM acquire_read loop is implemented in Python
(shm_broadcast.py). Each iteration pays Python frame setup/teardown,
refcounting, and function call overhead. A native implementation of the spin
loop could significantly reduce the ~10% CPython overhead.
5. Three processes all spin at 100% CPU. The EngineCore and both Worker_TP processes each consume ~61s of CPU over the 61s window (near-zero wait time). This means 3 CPU cores are saturated just for spin-waiting, plus 1 more for GPU sync — 4 cores total at 100%.
6. Comparison with mp_generate (single-GPU-per-process):
| Metric | mp_generate | TP generate |
|---|---|---|
| GPU sync wait | 64.7% | 24.9% |
| SHM IPC spin-wait | 0% | 44.6% |
| Torch dispatch | 4.7% | <1% (CUDA graphs) |
| CPython overhead | 4.1% | 9.8% |
| Cores at 100% | 4 (2×EngineCore + 2×AsyncOutput) | 4 (1×EngineCore + 2×Worker_TP + 1×AsyncOutput) |
| IPC mechanism | None (self-contained) | SHM message queues |
| Model execution | Eager mode (torch dispatch) | CUDA graphs (CUDAGraph.replay, not visible in top-40) |
7. The async output pattern is working correctly. GPU sync happens on the dedicated WorkerAsyncOutput thread, not on the EngineCore or Worker_TP threads.
8. CUDA graphs eliminate torch dispatch overhead. Unlike the eager-mode
mp_generate profile which showed 4.7% torch dispatch, this profile uses CUDA
graphs — CUDAGraph.replay doesn't appear in the top-40 hotspots, meaning
the model forward pass CPU cost is negligible.
9. Potential optimization: native SHM spin loop. Moving
MessageQueue.acquire_read from Python to C/Cython could eliminate most of
the ~10% CPython interpreter overhead and reduce the spin loop's per-iteration
cost. The spin loop currently pays for Python frame creation, refcounting
(especially expensive with free-threaded Python), and function dispatch on
every iteration.