We want the cost of applying pprof labels to be low enough that we stop conditionally labeling. Today, CockroachDB only applies labels when a profile is actively being collected, checked via an atomic. This means operations that passed the labeling point before profiling started are unlabeled, goroutine dumps are unlabeled, and when labeling IS on, it costs more than it should.
pprof.Do(ctx, Labels("req", strconv.Itoa(id)), func(ctx) { ... }) triggers 4-5 heap allocations:
strconv.Itoa(id)— the caller must format numeric IDs to strings becauseLabels()only acceptsstring, stringpairs. Allocates the string.func(ctx) { ... }— the closure passed toDocaptures local variables. Go's inliner does not reliably inlineDo— the setup/teardown calls cost 57 each against an 80-point budget, exceeding it without PGO. PGO-guided inlining may help, but isn't predictable. WhenDoisn't inlined, the closure escapes to the heap.context.WithValue—DocallsWithLabelswhich wraps the context with labels. Allocates avalueCtx.&labelMap{...}—WithLabelscreates a new labelMap to hold the merged labels. Heap-allocated.mergeLabelSets— if parent labels exist, allocates a new[]label.Labelslice for the merged result.
The actual runtime operation — getg().labels = ptr — is just a pointer write. The cost is entirely in the Go-level label construction and context wrapping.
Allocations 1-3 are forced by the API's shape: string-only values, closure-based scoping, context-based storage. We address these with a new API that accepts integer values directly, doesn't take a func(), and doesn't involve context.
Allocations 4-5 are the labelMap and its backing slice. We address these by pooling labelMaps in the runtime, with ref counting and lifetime tracking that handles goroutine inheritance and profiler sampling safely.
The existing Do(ctx, labels, func(ctx)) API forces a context.WithValue allocation and, without PGO, a closure allocation (the inliner doesn't reliably eliminate it). The context wrapping also introduces a correctness concern: pprof.Label(ctx, key) reads labels from the context, but the profiler reads from g.labels. When a context crosses a goroutine boundary (via channel), these diverge silently — Label() returns the sender's labels while the profiler sees the receiver's inherited labels.
The new API sets labels directly on the goroutine. No context, no closure:
pprof.SetLabel("req", pprof.Str(reqID)) // set a string label
pprof.SetLabel("job", pprof.Int(123)) // set an int label, no string formattingLabels persist until replaced or the goroutine exits. SetLabel returns the previous value for the given key, enabling scoped reset via the defer f(f()) idiom:
defer pprof.SetLabel("req", pprof.SetLabel("req", pprof.Str(reqID)))
defer pprof.SetLabel("job", pprof.SetLabel("job", pprof.Int(123)))The inner call sets the label and returns the old value. The deferred outer call restores it. All values are passed/returned by value — no closures, no heap allocations.
Since the returned old value comes from g.labels (goroutine state), it must be opaque to avoid creating goroutine-local storage. LabelValue is a value type with unexported fields (string + int64, 24 bytes). Str() and Int() are constructors. The caller can create values and receive old ones back, but can't inspect what was returned — they can only pass it back to SetLabel for restore.
Setting multiple labels one at a time is wasteful: each SetLabel call claims a new labelMap from the pool, merges the new label into it, and releases the old one. Calling it N times in a row means N pool round-trips, each copying all existing labels plus one. Ideally we set many labels in a single operation.
Making SetLabel variadic (e.g. SetLabel(labels ...LabelValue)) would be appealing but variadic functions allocate a backing slice. Instead, we could offer a handful of fixed-arity variants (SetLabel, SetLabel2, SetLabel3), or a builder pattern:
pprof.SetLabels(pprof.NewLabels().Str("req", reqID).Int("job", 123))SetLabels returns the old values for all keys the builder set, as an opaque value of the same builder type. This enables the same defer f(f()) pattern for batch labels:
defer pprof.SetLabels(pprof.SetLabels(
pprof.NewLabels().Str("req", reqID).Int("job", 123),
))The exact multi-label API shape is still being discussed with peers.
A natural question given that Do provides scoped labels: why not scope := SetLabel(...); defer scope.Close()? We explored this but there are complications.
A scope object saves a snapshot of the old labelMap and restores it on Close. This works if scopes are strictly LIFO — which defer guarantees within a single function. But unlike Do, which enforces LIFO through its func() argument, a scope object can be passed to other functions or goroutines. If scopes are closed out of order, the snapshot restore silently corrupts labels: restoring an old snapshot discards any labels set by intervening scopes.
The defer f(f()) pattern with per-key SetLabel avoids this: each call operates on a single key and returns that key's old value. Out-of-order restores on different keys are harmless (they're independent). Only same-key nesting has ordering sensitivity, which is inherent to any approach.
A Do-style API that takes a func() argument would enforce LIFO scoping, but the closure allocates — Go's inliner doesn't reliably inline the wrapper (the setup/teardown calls cost 57 each against the inliner's 80-point budget, and PGO-guided inlining isn't predictable), so the closure escapes to the heap.
Add IntVal int64 to the label type alongside StrVal string. New API functions (SetIntLabel, NewLabels().Int(...)) accept integers directly. String formatting is deferred to profile serialization time (rare). The pprof protobuf format already supports Label.num for integer values.
The label struct:
type label struct {
Name string
StrVal string
IntVal int64
}No IsInt bool — zero values for both StrVal and IntVal are distinguishable enough.
P-local cache in the runtime, modeled after sudogcache/deferpool. Each P holds a free list of labelMaps. This gives sync.Pool-level performance (P-local, minimal cross-core contention) without needing to import sync from the runtime.
The labelMap struct, defined in the runtime:
type labelMap struct {
next *labelMap // P-local free list linkage
refs int64 // ref count (atomic)
list []label // the actual labels; capacity reused across pool cycles
}g.labels stays unsafe.Pointer; the runtime casts to *labelMap internally. LabelMaps are immutable while in use — mutation only happens at pool transitions.
A pooled labelMap can be referenced by:
- The goroutine that created it (via
g.labels) - Child goroutines that inherited it (
newproc1copiesnewg.labels = mp.curg.labels) - The profiler, which copies the pointer during CPU sampling (via
profBuf.tags)
We can't return a labelMap to the pool until all references are gone. Simple "return on scope exit" doesn't work because child goroutines and the profiler may outlive the scope.
Each labelMap has an atomic refs field. Goroutine spawn decrements it (one more sharer), goroutine exit and creator release increment it (one fewer sharer). When all goroutine references are released AND the profiler is done with any sampled copies, the map returns to the pool.
The profiler complicates this because it copies the g.labels pointer during CPU sampling ticks (signal handler context — can't allocate, can't do per-sample ref count ops). We handle this with a global profileEpoch counter that acts as a moving "zero point" for the ref count. When profiling starts, the epoch advances, which prevents any in-flight labelMaps from being pooled prematurely. After serialization, the profiler releases its claim on each sampled map by incrementing its refs toward the new epoch. This avoids needing a per-goroutine atomic in the sampling tick.
| Event | Operation | Who |
|---|---|---|
| Claim from pool | refs = epoch - 1 |
runtime/pprof |
| Child goroutine spawned | refs-- (atomic) |
runtime newproc1 |
| Child goroutine exits | refs++; if refs == currentEpoch → pool |
runtime goexit0 |
| Creator releases | refs++; if refs == currentEpoch → pool |
runtime/pprof |
| Profile starts | profileEpoch++ (raises the bar) |
runtime |
| Profile serialization done with map | refs++; if refs == currentEpoch → pool |
runtime/pprof |
The last one out pools it — whether that's a child goroutine exiting, the creator releasing, or the profiler finishing serialization.
Non-pooled labelMaps (from existing Do/WithLabels API) have refs = 0. They'll never reach the epoch (which only increases), so profLabelRelease is a no-op on them. They get GC'd as before.
runtime/runtime2.go— AddlabelCache *labelMapto thepstructruntime/proflabel.go—labelMapandlabelstructs,profileEpoch, pool functions (profLabelGet,profLabelPut,profLabelRelease,profLabelAddRef,profLabelBumpEpoch)runtime/proc.go—newproc1(~line 5394): callprofLabelAddRefafter label inheritance.goexit0(~line 4529): callprofLabelReleasebefore clearing labels.runtime/cpuprof.go—profLabelBumpEpoch()when CPU profiling startsruntime/mprof.go—profLabelBumpEpoch()when goroutine profiling starts
runtime/pprof/label.go—SetLabel,SetIntLabel,SetLabels,NewLabelsbuilder,IntLabelconstructorruntime/pprof/runtime.go— linkname declarations for new runtime functionsruntime/pprof/proto.go— release refs after serialization; int label support viatagLabel_Num
-
Write barriers in
goexit0.profLabelReleasemay push to P-local cache (pointer writes). Must happen beforedropg()at line 4545 since the P-local cache is per-P. -
Signal handler unchanged. CPU profiler's signal handler copies
gp.labelspointer intoprofBuf.tags. No change. The epoch bump prevents premature pooling. -
GC interaction. Pooled labelMaps contain string pointers; GC traces them normally. Consider clearing P-local caches during GC (register a
poolcleanupcallback) to bound memory retention. -
Race detector. Extend existing
racereleasemerge/raceacquireonlabelSyncto cover refcount ops. -
Existing
DoAPI unchanged.Docontinues to work exactly as today. Its labelMaps are heap-allocated, haverefs = 0, and are GC'd normally. The pooling infrastructure doesn't affect it. -
First-use allocation. First
SetLabelcall allocates the labelMap and its[]labelslice. Subsequent calls reuse capacity from the pool. Steady-state is zero allocations.
- Runtime infrastructure —
labelMap/labeltypes, P-local cache, epoch, refcount functions - Refcount wiring in
newproc1/goexit0 - Epoch bumping in profiler start paths
- Profiler serialization release in
proto.go - New
SetLabel/SetIntLabel/SetLabelsAPI - Tests + benchmarks
go test runtime/pprof/...— existing tests pass (Do unchanged)go test runtime/...— scheduler/runtime tests pass- New
BenchmarkSetLabelreports 0 allocs/op (after warmup) - New test: SetLabel → spawn children → children exit → verify pool return
- New test: SetLabel → start profile → verify map not pooled → finish serialization → verify pooled
- New test: nested SetLabel with correct per-key restore (defer f(f()) pattern)
- New test: labels visible in CPU profile and goroutine dump
- New test: set-and-forget (no defer) → labels persist until goroutine exits
- New test: int labels roundtrip through profile serialization (verify
Label.numin proto) - New test:
Dostill works exactly as before (regression)