Cross-Process Tracing Plan

Goal

Reduce misleading SLEEP blocks by correlating work that crosses BEAM process boundaries, especially GenServer.call/3, while preserving the existing per-process stack profiler.

Desired Outcome

Keep current per-process stack traces for compatibility.
Attribute blocked caller time to downstream process work when possible.
Represent cross-process activity as a trace tree/span graph, not as a fake merged call stack.
Keep unattributed off-CPU time visible as fallback wait time instead of overemphasized SLEEP flamegraph blocks.

Proposed Phases

1. Validate Correlation Approach

Build a small spike using :seq_trace around GenServer.call/3.
Confirm that a root process can propagate a trace token across request/reply message flow.
Verify we can reliably identify caller, callee, send time, receive time, reply time, and resume time.
Abort this path if :seq_trace is too brittle or too expensive in practice.

2. Expand Trace Inputs

Keep existing :call, :return_to, and :running tracing.
Add the minimum message/process tracing needed for correlation, likely :send, 'receive', and :procs.
Capture enough metadata to infer edges such as:
- caller sends request
- callee receives request
- callee executes work
- callee replies
- caller resumes

3. Introduce A Graph-Level Session

Keep FlameOn.Client.TraceSession focused on a single process timeline.
Add a new TraceGraphSession GenServer to own one logical request trace.
Have TraceGraphSession manage:
- root trace metadata
- per-process trace sessions
- caller/callee relationships
- wait edges and timing
- graph finalization and shipping

4. Support GenServer Boundaries First

Limit the first implementation to GenServer.call/3-style request/reply behavior.
When the root process blocks on another process, attach the callee work to the same trace id.
Model the caller's blocked interval as explicit wait time on a known callee rather than anonymous SLEEP.
Defer arbitrary message-passing patterns until the request/reply case is stable.

5. Add A Cross-Process Data Model

Introduce a graph/span representation alongside collapsed stacks.
Each process span should include fields such as:
- trace_id
- span_id
- parent_span_id
- pid
- started_at
- ended_at
- self_us
- wait_us
- waiting_on_pid
- children
Preserve current collapsed stack output so existing shippers and consumers do not break.

6. Finalization Rules

Finalize each per-process stack as today.
Let the graph session stitch those process-local timelines together.
If a caller wait interval is fully explained by a traced callee, mark that interval as attributed wait.
If no callee can be correlated, retain it as unattributed off-CPU wait.

7. Rendering And Product Behavior

Do not pretend cross-process work is one continuous stack.
Render linked process spans in the UI or agent-facing output.
Change default analysis behavior to de-emphasize SLEEP:
- hide synthetic sleep frames when fully attributed to downstream work
- keep unattributed wait time visible as WAITING or OFF_CPU
- optionally expose a toggle to include raw sleep frames

8. TDD Coverage

Add tests before implementation for:
- existing single-process traces remaining unchanged
- GenServer.call/3 correlation between caller wait and callee execution
- nested hops like A -> B -> C
- unattributed wait remaining visible
- crashes or exits in downstream processes
- threshold and sampling behavior across graph traces
Add focused integration tests under test/flame_on/client/ for both per-process and graph-level behavior.

9. Performance Guardrails

Make cross-process tracing opt-in via config.
Limit max descendant processes per root trace.
Limit trace lifetime and fanout.
Continue sampling before enabling graph capture.
Measure mailbox growth, trace volume, and overhead under concurrent load.

10. Documentation

Update README.md to describe the distinction between:
- per-process stack tracing
- cross-process causal tracing
Document that process boundaries produce linked spans, not continued stack frames.
Explain how unattributed wait time is represented.

Recommended Implementation Order

Prototype :seq_trace correlation around GenServer.call/3.
Prove sender/receiver stitching in tests.
Introduce TraceGraphSession and per-process session registry.
Export graph data without changing current collapsed stack shipping.
Rework SLEEP handling once attribution is trustworthy.

Primary Risk

:seq_trace is the biggest uncertainty. It may be subtle to operate correctly across real-world libraries and may add enough complexity or overhead that explicit app-level correlation becomes a better fallback.

Fallback Strategy

If VM-level correlation is not viable:

keep the current per-process profiler
propagate trace ids explicitly across known boundaries
start child process traces under the same logical trace
ship a stitched span graph built from app-level correlation instead of pure VM trace events

bcardarella/trace-jump.md

Select an option

No results found