vLLM Tensor-Parallel Benchmark Profiler Analysis

Source: tp_generate_2026-10-25_2.json.gz — samply profiler output from TP (tensor-parallel) benchmark Model: LlamaForCausalLM (Llama architecture) vLLM version: v1 engine (inferred from vllm/v1/ paths) Executor: MultiprocExecutor (TP=2, 2 Worker_TP processes) Python: 3.14t (free-threaded build) Time window: 39s–100s (steady-state LLM inference) Total samples (in window): ~245,437

1. High-Level Time Breakdown

 Samples   %     Activity
────────  ────   ──────────────────────────────────────────────────
 66,639   27.2%  SHM spin-wait: sched_yield (OS-level yield)
 52,789   21.5%  GPU sync wait (cuEventSynchronize)
 19,141    7.8%  SHM spin-wait: MessageQueue.acquire_read (self)
 10,611    4.3%  Clock reads (_GI___clock_gettime) — GPU sync + SHM
  7,375    3.0%  SHM spin-wait: memory_fence
  6,210    2.5%  SHM spin-wait: SpinTimer.spin
  6,002    2.4%  SHM spin-wait: sched_yield (Python wrapper)
  2,972    1.2%  CPython: Py_TryIncrefCompareStackRef
  2,218    0.9%  SHM spin-wait: Event.is_set check
  2,141    0.9%  CPython: initialize_locals
  1,377    0.6%  Triton kernel loading (cuModuleLoadData)
  1,290    0.5%  CUDA module lookup (cuLibraryGetModule)
  1,159    0.5%  SHM spin-wait: os_sched_yield_impl
 64,513   26.3%  Everything else (below top-40 cutoff)

The profile is dominated by two kinds of spin-waiting:

SHM IPC spin-wait (~44%) — EngineCore and Worker_TP processes waiting for shared-memory message queue responses
GPU sync wait (~25%) — WorkerAsyncOutput thread waiting for GPU completion

2. Process & Thread Architecture

┌─────────────────────────────────────────────────────────────────┐
│  tp_generate (parent process)                                   │
│    spawns EngineCore + 2 Worker_TP processes                    │
└──────┬──────────────────────────┬──────────────┬────────────────┘
       │                          │              │
       ▼                          ▼              ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  EngineCore      │  │  Worker_TP 0     │  │  Worker_TP 1     │
│  Process         │  │  Process         │  │  Process         │
│  :3603434        │  │  :3603636        │  │  :3603637        │
│                  │  │                  │  │                  │
│  60,998 samples  │  │  60,970 samples  │  │  60,978 samples  │
│  CPU: 61.0s      │  │  CPU: 61.0s      │  │  CPU: 61.0s      │
│  Wait: 34ms      │  │  Wait: 33ms      │  │  Wait: 36ms      │
│                  │  │                  │  │                  │
│  ┌──────────────┐│  │  ┌──────────────┐│  │  ┌──────────────┐│
│  │ run_busy_loop││  │  │ worker_busy_ ││  │  │ worker_busy_ ││
│  │ _process_    ││  │  │ loop         ││  │  │ loop         ││
│  │ engine_step  ││  │  │              ││  │  │              ││
│  │              ││  │  │ SHM dequeue  ││  │  │ SHM dequeue  ││
│  │ SHM dequeue  ││  │  │ → execute    ││  │  │ → execute    ││
│  │ (wait for    ││  │  │ → SHM enqueue││  │  │ → SHM enqueue││
│  │  workers)    ││  │  │              ││  │  │              ││
│  └──────────────┘│  │  └──────────────┘│  │  └──────────────┘│
│                  │  │                  │  │                  │
│                  │  │  ┌──────────────┐│  │                  │
│                  │  │  │ WorkerAsync  ││  │                  │
│                  │  │  │ Output       ││  │                  │
│                  │  │  │ :3603654     ││  │                  │
│                  │  │  │ 60,938 samp  ││  │                  │
│                  │  │  │ GPU sync wait││  │                  │
│                  │  │  └──────────────┘│  │                  │
└──────────────────┘  └──────────────────┘  └──────────────────┘

4 main threads across 3 processes, all burning ~61s of CPU:

1 EngineCore — orchestrates inference steps, sends commands to workers via SHM, waits for responses
2 Worker_TP — receive commands via SHM, execute model (TP shards), respond via SHM
1 WorkerAsyncOutput — spin-waits for GPU completion (cudaEventSynchronize)

All 4 threads show near-zero wait time (34ms, 33ms, 36ms, 69ms) — they are all CPU-bound spin-waiting, not sleeping.

3. The Dominant Cost: SHM Message Queue Spin-Wait (~44%)

Unlike the mp_generate profile (where each GPU has its own EngineCore), the TP architecture requires IPC between the EngineCore and Worker_TP processes. vLLM uses shared-memory message queues (shm_broadcast.py) for this, and the receive side is a busy-wait spin loop.

 EngineCore / Worker_TP thread
 ──────────────────────────────

 MessageQueue.dequeue()                            ◄── Python entry
   └── MessageQueue.acquire_read()                 ◄── 7.8% self (Python loop body)
         │
         │  Spin loop: while not message_ready:
         │
         ├── memory_fence()                        ◄── 3.0% (memory barriers)
         │
         ├── SpinTimer.spin()                      ◄── 2.5% (tracks spin duration)
         │     └── sched_yield()                   ◄── 2.4% (Python wrapper)
         │           └── os.sched_yield()          ◄── 0.5% (C impl)
         │                 └── _GI_sched_yield     ◄── 27.2% (OS syscall)
         │
         ├── Event.is_set()                        ◄── 0.9% (check stop event)
         │
         └── time.monotonic()                      ◄── part of 4.3% clock_gettime
               └── _GI___clock_gettime

Where each process is spin-waiting:

Thread	Waiting for	Stacks
EngineCore :3603434	Worker responses (FutureWrapper.result → MessageQueue.dequeue)	#2, #6, #9, #12, #13, #18
Worker_TP :3603636	EngineCore commands (worker_busy_loop → MessageQueue.dequeue)	#3, #8, #11, #14, #16, #19
Worker_TP :3603637	EngineCore commands (worker_busy_loop → MessageQueue.dequeue)	#4, #7, #10, #15, #17, #20

The EngineCore sends a command to both workers, then spin-waits for both to respond. The workers spin-wait for the next command. Since inference is GPU-bound, the GPU is the bottleneck and all three processes spend most of their time spin-waiting.

SHM spin-wait time budget breakdown

 Samples   %     Component
────────  ────   ──────────────────────────────────────
 66,639   27.2%  _GI_sched_yield (the actual syscall)
 19,141    7.8%  acquire_read self (Python loop overhead)
  7,375    3.0%  memory_fence (barrier calls)
  6,210    2.5%  SpinTimer.spin (timer bookkeeping)
  6,002    2.4%  sched_yield Python wrapper
  2,218    0.9%  Event.is_set (stop-event check)
  2,341    1.0%  _GI___clock_gettime (from SHM timer)
  1,159    0.5%  os_sched_yield_impl
    821    0.3%  PyTime_AsSecondsDouble
────────  ────
111,906   45.6%  Total SHM spin-wait

4. GPU Sync Wait (~25%)

 WorkerAsyncOutput thread :3603654
 ──────────────────────────────────

 WorkerProc.async_output_busy_loop
   └── WorkerProc.enqueue_output
         └── AsyncGPUModelRunnerOutput.get_output
               └── THPEvent_synchronize
                     └── CUDAGuardImpl::synchronizeEvent
                           └── cudaEventSynchronize
                                 └── cuEventSynchronize          ◄── 21.5% self
                                       └── _GI___clock_gettime   ◄── 3.2% (spin-wait clock)

The same pattern as in mp_generate: the WorkerAsyncOutput thread spin-waits for GPU inference completion. At ~25% it's the second-largest cost category (vs 65% in the mp_generate profile, because the SHM overhead now takes a large share).

5. CPython Interpreter Overhead (~10%)

This profile shows significant CPython interpreter overhead, largely from executing the SHM spin loop in Python. Key functions:

 Samples   %     Function                              Context
────────  ────   ──────────────────────────────────     ───────────────────
  2,972    1.2%  Py_TryIncrefCompareStackRef            refcount (free-threaded)
  2,141    0.9%  initialize_locals                      frame setup
  1,553    0.6%  PyFrame_GetStackPointer                frame access
  1,381    0.6%  PyEvalFramePushAndInit                 frame creation
  1,188    0.5%  PyFrame_ClearExceptCode                frame teardown
  1,166    0.5%  PyStackRef_FromPyObjectImmortal        stack ref mgmt
  1,156    0.5%  PyUnstable_Code_GetExtra               code object extras
  1,094    0.4%  Py_atomic_load_uint16_relaxed          atomic ops (free-threaded)
  1,047    0.4%  Py_CheckFunctionResult                 call result check
  1,017    0.4%  Py_DECREF                              refcount
  1,010    0.4%  PyFrame_ClearLocals                    frame cleanup
    ...    ...   (many more small items)
────────  ────
~24,000   9.8%  Total CPython overhead

Notable: free-threaded Python (3.14t) shows up in the atomic operations (Py_atomic_load_uint16_relaxed, Py_atomic_store_uint64_release, Py_atomic_load_uint32_relaxed, Py_atomic_load_uint64_acquire) and reference counting (Py_TryIncrefCompareStackRef, PySeqLock_BeginRead). These are the per-object atomic operations required when the GIL is disabled.

Much of this overhead is from executing the SHM spin loop body in Python hundreds of thousands of times — each iteration does frame push/pop, refcount ops, function calls to memory_fence, sched_yield, Event.is_set, etc.

6. Full vLLM v1 TP Call Flow (from profiler)

┌─────────────────────────────────────────────────────────────────────────┐
│                   vLLM v1 Tensor-Parallel Architecture                  │
│                                                                         │
│  ┌─────────────────────────────────────────────────────┐                │
│  │           EngineCore Process :3603434               │                │
│  │                                                     │                │
│  │  EngineCoreProc.run_engine_core()                   │                │
│  │    └── run_busy_loop()                              │                │
│  │          └── _process_engine_step()                 │                │
│  │                └── EngineCore.step_with_batch_queue() │                │
│  │                      │                                │                │
│  │                      ├── MultiprocExecutor             │                │
│  │                      │   .collective_rpc()             │                │
│  │                      │     └── SHM enqueue (command)───│──► to workers │
│  │                      │     └── FutureWrapper.result ◄──│── from workers│
│  │                      │           └── get_response      │                │
│  │                      │                 └── MessageQueue │                │
│  │                      │                     .dequeue()   │   ~44% CPU   │
│  │                      │                       └── acquire_read           │
│  │                      │                             └── SPIN WAIT ◄─────│
│  │                      │                                │                │
│  │                      └── (scheduler, output processing)│                │
│  └─────────────────────────────────────────────────────┘                │
│                              │  SHM                                      │
│                    ┌─────────┴─────────┐                                │
│                    ▼                   ▼                                 │
│  ┌──────────────────────┐  ┌──────────────────────┐                     │
│  │  Worker_TP 0 :3603636│  │  Worker_TP 1 :3603637 │                     │
│  │                      │  │                       │                     │
│  │  WorkerProc            │  │  WorkerProc            │                     │
│  │  .worker_busy_loop()   │  │  .worker_busy_loop()   │                     │
│  │    └── MessageQueue    │  │    └── MessageQueue    │                     │
│  │        .dequeue()      │  │        .dequeue()      │    ~44% CPU       │
│  │          └── SPIN WAIT │  │          └── SPIN WAIT │    (shared with   │
│  │    └── execute_model() │  │    └── execute_model() │     EngineCore)   │
│  │                       │  │                       │                     │
│  │  ┌────────────────┐   │  │                       │                     │
│  │  │ WorkerAsync     │   │  │                       │                     │
│  │  │ Output :3603654 │   │  │                       │                     │
│  │  │                 │   │  │                       │                     │
│  │  │ async_output_   │   │  │                       │                     │
│  │  │ busy_loop       │   │  │                       │                     │
│  │  │   └── get_output│   │  │                       │   ~25% CPU         │
│  │  │     └── cuEvent │   │  │                       │   (GPU sync)       │
│  │  │       Synchronize   │  │                       │                     │
│  │  └────────────────┘   │  │                       │                     │
│  └──────────────────────┘  └──────────────────────┘                     │
└─────────────────────────────────────────────────────────────────────────┘

7. Other Notable Costs

Triton/CUDA module loading (~1.1%):

cuModuleLoadData: 1,377 (0.6%) — loading compiled Triton PTX
cuLibraryGetModule: 1,290 (0.5%) — CUDA library module lookup

String operations:

_strchr_avx2: 1,129 (0.5%) — string scanning, likely from Python argument parsing or path operations

8. Time Budget Bar Chart

████████████████████████████████████████████████████  44.6%  SHM IPC spin-wait
██████████████████████████████                        24.9%  GPU sync wait
████████████████                                       9.8%  CPython interpreter overhead
██                                                     1.1%  Triton/CUDA module loading
██████████████████████████                            19.6%  Other

9. Summary & Interpretation

1. SHM message queue spin-waiting dominates at ~45%. The TP architecture requires the EngineCore to send commands to 2 Worker_TP processes and wait for responses via shared-memory message queues. All 3 processes spin-wait using sched_yield when no message is available. Since inference is GPU-bound, the CPU spends most of its time in these spin loops.

2. GPU sync wait is ~25% — same cuEventSynchronize spin-wait as in the mp_generate profile, but proportionally smaller because the SHM spin-wait now consumes a large share. Only 1 WorkerAsyncOutput thread (vs 2 in mp_generate).

3. Free-threaded Python (3.14t) atomic overhead is visible. Functions like Py_TryIncrefCompareStackRef, Py_atomic_load_*, PySeqLock_BeginRead show the cost of per-object atomic operations in the no-GIL build. These are particularly expensive in the spin loop which does high-frequency Python function calls.

4. The SHM spin loop executes in Python — unlike the CUDA spin-wait which is in native code, the SHM acquire_read loop is implemented in Python (shm_broadcast.py). Each iteration pays Python frame setup/teardown, refcounting, and function call overhead. A native implementation of the spin loop could significantly reduce the ~10% CPython overhead.

5. Three processes all spin at 100% CPU. The EngineCore and both Worker_TP processes each consume ~61s of CPU over the 61s window (near-zero wait time). This means 3 CPU cores are saturated just for spin-waiting, plus 1 more for GPU sync — 4 cores total at 100%.

6. Comparison with mp_generate (single-GPU-per-process):

Metric	mp_generate	TP generate
GPU sync wait	64.7%	24.9%
SHM IPC spin-wait	0%	44.6%
Torch dispatch	4.7%	<1% (CUDA graphs)
CPython overhead	4.1%	9.8%
Cores at 100%	4 (2×EngineCore + 2×AsyncOutput)	4 (1×EngineCore + 2×Worker_TP + 1×AsyncOutput)
IPC mechanism	None (self-contained)	SHM message queues
Model execution	Eager mode (torch dispatch)	CUDA graphs (CUDAGraph.replay, not visible in top-40)

7. The async output pattern is working correctly. GPU sync happens on the dedicated WorkerAsyncOutput thread, not on the EngineCore or Worker_TP threads.

8. CUDA graphs eliminate torch dispatch overhead. Unlike the eager-mode mp_generate profile which showed 4.7% torch dispatch, this profile uses CUDA graphs — CUDAGraph.replay doesn't appear in the top-40 hotspots, meaning the model forward pass CPU cost is negligible.

9. Potential optimization: native SHM spin loop. Moving MessageQueue.acquire_read from Python to C/Cython could eliminate most of the ~10% CPython interpreter overhead and reduce the spin loop's per-iteration cost. The spin loop currently pays for Python frame creation, refcounting (especially expensive with free-threaded Python), and function dispatch on every iteration.

nascheme/tp_analysis.md

Select an option

No results found