a.k.a. have most hot allocation paths be transparently served by dedicated freelists managed by the compiler and runtime
Important
This document is a work in progress. Early feedback and discussions are welcome.
Known TODOs:
- How to handle slices
- How to return dynamic metadata from goroutines
Not sure if this has already been suggested previously (the closest thing I have found is golang/go#6331), but I suspect it would help significantly when heap allocations happen in hot loops.
So far, as they have lower overhead, object allocations have been demoted to stack allocations whenever possible. There are a number of situations in which this does not happen, e.g.
- when the object escapes to the heap
- when the object is too big
- when the object lifetime exceeds the lifetime of the stack frame where the object is created
- (possibly more)
Whenever one of the above applies, the object is heap allocated, incurring the overhead of allocation and garbage collection, as well as memory overhead and worse memory locality.
Both the runtime and standard library recognize this as a performance issue that is normally addressed with ad-hoc solutions:
- the runtime uses dedicated pools for the
g
struct1, the goroutine stacks2, as well as for_defer
structs3 - the standard library offers (and internally uses)
sync.Pool
,arena
, andunique
to allow user code to mitigate this class of issues - user code in occasion is written to manually hoist allocations out of hot paths (possibly at the cost of readability)
- a fairly common example of this is the act of manually hoisting an allocation from inside a loop to before the loop, and then reuse the same allocation for each iteration of the loop
- a significant number of online resources contain recommendations, best practices, and suggestions to reduce memory allocation pressure
Note
This assumes familiarity with the Go compiler and runtime. The tl;dr is: let's have the compiler automatically identify locations in the code where the use of (something similar to) a sync.Pool
would reduce the number of allocations performed without changing program behaviour, and have the compiler automatically add these pools while compiling the code.
This situation could be improved if the compiler and runtime were able to, when and only when they can cheaply prove where in the code an allocation becomes fully dead, put at that specific point the dying object back into a per-type pool/freelist so that it can be cheaply reused whenever a new object of the same type is needed.
This would effectively be a generalization of the runtime pools mentioned above (g
, g
stacks, _defer
), that are normally implemented with a per-P pool and a global fallback pool. These pools would be fully or partially emptied at the start of each GC cycle, similarly to what existing pools do, to ensure that unneeded objects are eventually garbage collected.
This mechanism would work in addition to the existing stack allocation, as an intermediate step between stack allocation and regular heap allocation. Objects that can be stack allocated will continue to be stack allocated; but if heap object reuse is enabled for a type then when a new object of that type is needed, the pool/freelist for that type would be checked first, before falling back to regular heap allocation otherwise. Objects obtained from the freelist would follow the same requirements (including needzero
) of any other object obtained from mallocgc
.
Simplifying, this are the current states of a memory allocation:
flowchart LR
U{{Unallocated}} -->|Stack Alloc| X[In use / Stack]
A -->|Unreachable| D[Dead]
D -->|GC| U
U -->|Heap Alloc| A[In use / Heap]
X -->|Return| U
With the proposed heap reuse mechanism, we would be adding effectively an additional state, and the associated transitions:
flowchart LR
U{{Unallocated}} -->|Stack Alloc| X[In use / Stack]
A -->|Lifetime End| R[Reusable]:::red
R -->|Reuse| A
A -->|Unreachable| D[Dead]
D -->|GC| U
R -->|GC| U
U -->|Heap Alloc| A[In use / Heap]
X -->|Return| U
classDef red stroke:red
linkStyle 1,2,5 stroke:red
Heap object reuse is enabled for a type when the compiler can determine that objects of that type become dead at one or more known points of the application. PGO and heuristics can be used to further restrict the set of types for which reuse is enabled, e.g. by only considering death points in hot paths. To avoid making latency/throughput worse, reuse can be best effort (e.g. in case of contention on the central pool, objects can be dropped instead of added to the pool/freelist) and appropriate fast paths can be added (e.g. checking if the global pool is empty should not require the acquisition of a lock).
This document is mostly a collection of notes, not really a formal proposal ready for consideration. Almost none of the contents have been discussed widely, so it is likely that the ideas herein are at least somewhat deficient in some respect. Any feedback is welcome.
An object with an associated finalizer or cleanup operation is not fully dead, and therefore is not elegible for reuse until the finalizer/cleanup has completed (and the object is not resuscitated in the process). When the object becomes fully dead (most likely at some point in the finalizer/cleanup code) then it can be potentially added back to the reuse pool.
Obviously, while this would work, it is clear that it would be somewhat pointless - because the whole point of this proposal is to reduce the amount of allocations that need to be garbage collected, but for finalizers to be executed garbage collection itself needs to detect that the object is unreachable. This could be improved though: see the Finalizers/cleanup integration section below.
- No visible changes in functional behaviour, nor code changes required in user code.
- Some4 uses of manual object pooling or object reuse (including e.g.
sync.Pool
) will become unnecessary, but continue working. - CPU time spent on memory allocation and garbage collection should decrease; it should also become possible to run the same workload with a smaller heap.
- Some additional metrics may become available to allow users to monitor the behavior of this mechanism.
The following is the outline of the implementation for the MVP version of this mechanism:
- Add
GOEXPERIMENT=heapreuse
- Add the following functions to runtime:
heapReusePut(heapReusePoolIdx int, ptr unsafe.Pointer)
adds the pointer to the corresponding heap reuse poolheapReuseGet(heapReusePoolIdx int) unsafe.Pointer
checks the corresponding heap reuse pool for a reusable heap allocationheapReusePoolId(size uintptr, typ *_type) int
returns the correct heap reuse pool to use for a specific heap allocationheapReusePoolCleanup()
performs cleanup of the reuse pools, similar to howsync.Pool
cleanup works- objects in pools must be garbage collected if unused for a bounded number of GC cycles
- pools must consume no memory (beyond the one used by the global and per-P roots) if unused for a bounded number of GC cycles
- Add per-P pools, that can be accessed only by the P that owns them
- The per-P pools are accessed via the
p
struct
- The per-P pools are accessed via the
- Add global pools, that can be accessed by all Ps
- The global pools are accessed via a global variable
- The global pools must be safe for concurrent use
- The global pools can be best effort (e.g. be considered empty/full) in case of contention, to avoid excessive increases in heap allocation tail latencies
- Add a call to
heapReusePoolId
andheapReuseGet
inmallocgc
; if the latter returns a valid object, skip the allocation- if
GOEXPERIMENT=heapreuse
is not active, both calls are skipped/noops
- if
- Add a call to
heapReusePoolCleanup()
inclearpools
- In the compiler, start identifying points in the code being compiled where heap object lifetime can be statically proven to end (the kill location)
- Create a list of the types for which at least one such point can be found
- For each type in this list, assign a unique heap reuse pool id
- At each kill location identified, emit a call to
heapReusePut
(the pool id is in the list created in the previous point)- For the MVP, this could be just composed of heap allocations that could be already eligible today for stack allocation if they did not run into the stack restrictions (allocation must be max 64KB-128KB in size, allocation must be of constant size, allocation must not require alignment over the size of a pointer, etc.)
- if
GOEXPERIMENT=heapreuse
is not active, calls toheapReusePut
are skipped/noops
- Emit the list of types so that it can be used by
heapReusePoolId
at runtime
Once the approach has been validated, the GOEXPERIMENT=heapreuse
can be removed and the mechanism be enabled by default5.
This section contains notes about potential changes that are likely out of scope at least for the MVP. These are recorded here just as a way to show potential avenues for further improvements to this mechanism, or to other aspects of the compiler/runtime.
The MVP above only considers lifetimes that we already track today (heap allocations that are not stack allocated because they are too big). This can be improved over time to also handle heap allocations that e.g. have their ownership transferred to a callee, caller, goroutine, channel, etc.
Such improvements would likely build on top of improvements to other parts of the compiler, e.g. escape analysis, inlining, etc.
If lifetime can be statically tracked, the compiler should do so. But in many cases lifetime may not be statically trackeable in practice, e.g. because it is path-dependent, or because it would require whole-program analysis.
In these cases, when the compiler can not statically prove the lifetime of an object, it could be feasible to instead track the lifetime of an object with metadata associated at runtime to the object itself. This metadata could be propagated as hidden arguments/return values in function calls or goroutine invocations, and a hidden per-object header when the object is sent via a channel.
func (f *foo) bar(a1 *foo, a2 any) (r1 *foo)
// internally becomes
// func (f *foo) bar(amd foo_bar_arg_md, a1 *foo, a2 any) (r1 *foo, rmd foo_bar_ret_md)
// where foo_bar_arg_md and foo_bar_ret_md are packed bitfields for the metadata of
// arguments and return values, respectively.
// In this case foo_bar_arg_md would contain
// bit 0 - f ownership
// bit 1 - a1 ownership
// bit 2 - a2 ownership
// and foo_bar_ret_md would contain
// bit 0 - r1 ownership
ch := make(chan *foo, 2)
ch <- &foo{}
f := <-ch
// internally becomes
// ch := make(struct{v *foo, md foo_ch_md}, 2)
// ch <- struct{v *foo, md foo_ch_md}{&foo{}, foo_ch_md(1))
// _t0 := <-ch
// f, fmd := _t0.v, _t0.md
a := &foo{}
go func() {
_ = a
}()
// internally becomes
// a := &foo{}
// go func(md go1_md) {
// _ = a
// }(go1_md(1))
// Where go1_md in this case would contain
// bit 0 - a ownership
This metadata could be as simple as a single bit per heap object, where the bit signals that the reference being passed is - according to the compiler - the only live reference to that object, a.k.a. the object's ownership is being transferred to the receiver.
If a receiver sees that this "ownership bit" corresponding to an object is set, and the object does not further escape its scope, then heapReusePut
can be safely invoked at the kill location of that object.
This could be optimized in case the compiler knows that some bits of the argument are always unused and use that unused space to store the metadata inline (e.g. with a mechanism similar to pointer tagging for pointers or interfaces, by stuffing the metadata bits into padding space in structs, etc.) instead of relying on an additional argument.
This could be further extended in case the compiler can further prove that e.g. all callers of a specific callee always/never set the ownership bit for a certain object: in this case the bit can be completely omitted for that callee, and the callee can assume that the bit is respectively either constant 0 or 1. Another possible extension would be detecting at compile time receivers that can not make use of this information (e.g. cgo functions, assembler functions that are not ABI-compatible, functions where an object unconditionally escapes to the heap), and transitively avoid propagating the corresponding metadata.
The frequent bit manipulation required to efficiently perform propagation of the metadata may benefit signficantly (in terms of lowering the CPU overhead imposed) from improved codegen (e.g. on AMD64 using PDEP/PEXT/PSHUFB, BDEP/BEXT on ARM, PDEP/PEXT on PowerPC, and falling back to bit twiddling hacks, optimized bit permutations, or similar peephole optimizations for very simple cases or whenever dedicated instructions are not available or otherwise not yielding good performance). In some cases the compiler, instead of assigning bits sequentially, may be able to generate the optimal indexes of each bit of metadata to allow the compiler backend to generate simpler/faster code.
As maintaining reuse pools imposes a small but non-zero overhead, it likely makes sense to only enable heap reuse on code paths where it is proven that the majority of allocations occur. On cold paths heap reuse could be completely skipped.
Without PGO, the compiler will likely use some heuristics to exclude some kill locations, but will nevertheless be likely to inject heapReusePut
calls also on code paths that are not actually hot.
PGO could inform which code paths are actually cold, and the compiler could use this information to ignore the kill locations on these code paths. This could in turn cause a smaller number of types to become eligible for heap reuse, and therefore a smaller number of pools to be created.
This also applies to dynamic lifetime tracking.
Some of the pools internal to the runtime (e.g. g
stacks, g
s, defer
s) could be removed, instead relying on the more general mechanism proposed here. Due to complex lifetimes involved, the runtime may need to manually push objects into the pools when appropriate (as it happens today).
Other packages in the standard library may drop some their internal pooling if in practice the proposed mechanism is good enough to cover most use cases.
Instead of a simple slice-based pool, different storage strategies may be investigated:
- Chunked storage: instead of reallocating the slice and copying the old contents, use a linked list of chunks: when needed allocate a new chunk and append it to the linked list
- Linked list with pointer stuffing: for allocations that are big enough (at least of the size of a
uintptr
), store the pointer to the next entry in the list as auintptr
stored in the first bytes of the allocation. This should be safe, if implemented properly, at the very least for allocations of objects that contain pointers. The benefit is that the heap reuse machinery itself in this case would require no auxiliary allocations at all (e.g. for the slices containing the poitners to reusable objects).
The proposed mechanism is best-effort (meaning: it does not offer guarantees about whether an object is actually reused) as that allows a greater degree of freedom in implementing optimizations. To minimize contention, the global pools could e.g.:
- Use obstruction-free fast paths to detect pathological cases (e.g. a global pool is empty/full)
- Use a lock-free design, optionally dropping objects if contention is detected
- Only perform bulk operations on the global pool, to minimize the number of times the global pool needs to be accessed
Reusing the same mechanisms that identifies an object's kill location (but ignoring the finalizer/cleanup itself) finalizers/cleanup could be immediately executed at that point, instead of waiting for the garbage collector to detect that the object is unreachable and then triggering the finalizer/cleanup. Assuming the object remains unreachable after the finalizer/cleanup has completed, the object could then be immediately placed in the reuse pool.
Whether this change is warranted needs to be verified, as finalizers/cleanup operations are normally not so common (and normally not in hot code paths).
The caller/callee negotiation mechanism discussed in the dynamic lifetime tracking section could be extended also for other uses/optimizations, e.g.:
- A callee could expose a bit that informs callers that a certain argument is unused (in which case the argument could be omitted entirely, or simply be unset)
- A callee could expose a bit that informs callers that a certain argument is not modified (in which case the argument could be safely passed as reference, instead of copying it)
- A callee could expose a bit that informs callers that a certain argument is not retained and does not escape (in which case the argument could potentially become eligible for stack allocation instead of heap allocation)
and so on. As discussed previously, such bits could either be static (either because the calls are direct, or because they are indirect but the whole caller or callee set has the same value for a specific bit) or dynamic (in which case different code paths could be taken at runtime depending on the specific dynamic caller/callee combination).
Note that this extension is largely outside of the scope of the current proposal; it's mentioned here just to show that the proposed dynamic lifetime tracking support mechanism could be useful also beyond the specific heap reuse proposal.
The goal is for these examples to all cause at most O(1) allocations (playground).
package main
import (
"bytes"
"encoding/json"
"fmt"
"runtime"
)
func main() {
// The following should all report 0 allocations.
bench(100_000, func() {
v := new(int)
go func() {
*v = 42
}()
})
ch := make(chan struct{})
bench(100_000, func() {
go func() {
ch <- struct{}{}
}()
<-ch
})
bench(100_000, func() {
v := new(int)
go func() {
*v = 42
ch <- struct{}{}
}()
<-ch
})
var x *int
bench(100_000, func() {
x = new(int)
*x = 42
})
bench(100_000, func() {
v := new(bytes.Buffer)
v.WriteString("hello world")
})
bench(1_000, func() {
for i := 0; i < 100; i++ {
v := new(bytes.Buffer)
v.WriteString("hello world")
}
})
bench(100_000, func() {
var v []string
json.Unmarshal([]byte(`["hello", "world"]`), &v)
})
}
func bench(n int, fn func()) {
var m runtime.MemStats
runtime.ReadMemStats(&m)
mallocs, bytes := m.Mallocs, m.TotalAlloc
for range n {
fn()
}
runtime.ReadMemStats(&m)
mallocs, bytes = m.Mallocs-mallocs, m.TotalAlloc-bytes
fmt.Printf("Bytes: %8.1f/iter\nMallocs: %8.1f/iter\n\n", float64(bytes)/float64(n), float64(mallocs)/float64(n))
}
This proposal does not add APIs. While the heapReusePut
function may be mistook to be functionally similar to the C function free
, it is not meant to be called directly. Only the compiler will normally inject calls to heapReusePut
when it can prove that it is safe to do it. With careful review, the runtime itself may also call it when appropriate (e.g. in the functions where the runtime already manually implements pooling).
- Guides
- https://goperf.dev/01-common-patterns/#memory-management-efficiency
- https://gist.github.com/CAFxX/e96e8a5c3841d152f16d266a1fe7f8bd
- https://dev.to/saleh_rahimzadeh/zero-allocation-in-go-golang-237k
- https://go101.org/optimizations/0.3-memory-allocations.html
- https://medium.com/@aditimishra_541/best-practices-in-go-to-improve-performance-and-reduce-pressure-on-gos-garbage-collector-df10904e0244
- Articles
- https://victoriametrics.com/blog/tsdb-performance-techniques-sync-pool/
- https://blog.cloudflare.com/go-dont-collect-my-garbage/
- https://leapcell.medium.com/optimizing-go-performance-with-sync-pool-and-escape-analysis-79f7e3879847
- https://chris124567.github.io/2021-06-21-go-performance/
- https://hackernoon.com/optimizing-memory-usage-in-golang-when-is-a-variable-allocated-to-the-heap
Footnotes
-
see
gfget
/gfput
in https://github.com/golang/go/blob/master/src/runtime/proc.go ↩ -
see
stackalloc
/stackfree
in https://github.com/golang/go/blob/master/src/runtime/stack.go ↩ -
see
newdefer
/popDefer
in https://github.com/golang/go/blob/master/src/runtime/panic.go ↩ -
Users will want to continue to use manual pooling in at least two different scenarios. The first one is if they want the guarantee that objects are reused, e.g. because they have determined that for their use case relying on the normal allocation/GC cycle and a best effort heap reuse optimization is not acceptable. The second one is if the state of the object should be preserved during the roundtrip through the pool. ↩
-
Once the mechanism is enabled by default, the inclusion of a
GODEBUG
directive to turn heap reuse off process-wide for debugging and/or of a dedicatedruntime/debug.NoHeapReuse[T](obj *T)
to force the compiler/runtime to not reuse a specific object can be considered, if necessary. ↩