Skip to content

Instantly share code, notes, and snippets.

@bcardarella
Created March 18, 2026 15:31
Show Gist options
  • Select an option

  • Save bcardarella/e7a9d2f5423122bbcb30074cc26e6f26 to your computer and use it in GitHub Desktop.

Select an option

Save bcardarella/e7a9d2f5423122bbcb30074cc26e6f26 to your computer and use it in GitHub Desktop.

Cross-Process Tracing Plan

Goal

Reduce misleading SLEEP blocks by correlating work that crosses BEAM process boundaries, especially GenServer.call/3, while preserving the existing per-process stack profiler.

Desired Outcome

  • Keep current per-process stack traces for compatibility.
  • Attribute blocked caller time to downstream process work when possible.
  • Represent cross-process activity as a trace tree/span graph, not as a fake merged call stack.
  • Keep unattributed off-CPU time visible as fallback wait time instead of overemphasized SLEEP flamegraph blocks.

Proposed Phases

1. Validate Correlation Approach

  • Build a small spike using :seq_trace around GenServer.call/3.
  • Confirm that a root process can propagate a trace token across request/reply message flow.
  • Verify we can reliably identify caller, callee, send time, receive time, reply time, and resume time.
  • Abort this path if :seq_trace is too brittle or too expensive in practice.

2. Expand Trace Inputs

  • Keep existing :call, :return_to, and :running tracing.
  • Add the minimum message/process tracing needed for correlation, likely :send, 'receive', and :procs.
  • Capture enough metadata to infer edges such as:
    • caller sends request
    • callee receives request
    • callee executes work
    • callee replies
    • caller resumes

3. Introduce A Graph-Level Session

  • Keep FlameOn.Client.TraceSession focused on a single process timeline.
  • Add a new TraceGraphSession GenServer to own one logical request trace.
  • Have TraceGraphSession manage:
    • root trace metadata
    • per-process trace sessions
    • caller/callee relationships
    • wait edges and timing
    • graph finalization and shipping

4. Support GenServer Boundaries First

  • Limit the first implementation to GenServer.call/3-style request/reply behavior.
  • When the root process blocks on another process, attach the callee work to the same trace id.
  • Model the caller's blocked interval as explicit wait time on a known callee rather than anonymous SLEEP.
  • Defer arbitrary message-passing patterns until the request/reply case is stable.

5. Add A Cross-Process Data Model

  • Introduce a graph/span representation alongside collapsed stacks.
  • Each process span should include fields such as:
    • trace_id
    • span_id
    • parent_span_id
    • pid
    • started_at
    • ended_at
    • self_us
    • wait_us
    • waiting_on_pid
    • children
  • Preserve current collapsed stack output so existing shippers and consumers do not break.

6. Finalization Rules

  • Finalize each per-process stack as today.
  • Let the graph session stitch those process-local timelines together.
  • If a caller wait interval is fully explained by a traced callee, mark that interval as attributed wait.
  • If no callee can be correlated, retain it as unattributed off-CPU wait.

7. Rendering And Product Behavior

  • Do not pretend cross-process work is one continuous stack.
  • Render linked process spans in the UI or agent-facing output.
  • Change default analysis behavior to de-emphasize SLEEP:
    • hide synthetic sleep frames when fully attributed to downstream work
    • keep unattributed wait time visible as WAITING or OFF_CPU
    • optionally expose a toggle to include raw sleep frames

8. TDD Coverage

  • Add tests before implementation for:
    • existing single-process traces remaining unchanged
    • GenServer.call/3 correlation between caller wait and callee execution
    • nested hops like A -> B -> C
    • unattributed wait remaining visible
    • crashes or exits in downstream processes
    • threshold and sampling behavior across graph traces
  • Add focused integration tests under test/flame_on/client/ for both per-process and graph-level behavior.

9. Performance Guardrails

  • Make cross-process tracing opt-in via config.
  • Limit max descendant processes per root trace.
  • Limit trace lifetime and fanout.
  • Continue sampling before enabling graph capture.
  • Measure mailbox growth, trace volume, and overhead under concurrent load.

10. Documentation

  • Update README.md to describe the distinction between:
    • per-process stack tracing
    • cross-process causal tracing
  • Document that process boundaries produce linked spans, not continued stack frames.
  • Explain how unattributed wait time is represented.

Recommended Implementation Order

  1. Prototype :seq_trace correlation around GenServer.call/3.
  2. Prove sender/receiver stitching in tests.
  3. Introduce TraceGraphSession and per-process session registry.
  4. Export graph data without changing current collapsed stack shipping.
  5. Rework SLEEP handling once attribution is trustworthy.

Primary Risk

:seq_trace is the biggest uncertainty. It may be subtle to operate correctly across real-world libraries and may add enough complexity or overhead that explicit app-level correlation becomes a better fallback.

Fallback Strategy

If VM-level correlation is not viable:

  • keep the current per-process profiler
  • propagate trace ids explicitly across known boundaries
  • start child process traces under the same logical trace
  • ship a stitched span graph built from app-level correlation instead of pure VM trace events
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment