A minimal recipe for write your own
CUDA communication kernel as an out-of-tree XLA FFI custom call, give it
symmetric NCCL buffers, and reach peers with the NCCL device API
(ncclGetLsaPointer) — all built against a released jaxlib, with no XLA
source checkout and no recompile.
This is the simplified sibling of the in-tree manual recipe. The in-tree one had
to be built inside /opt/xla because it used XLA's internal FFI collective