add torch.cuda.cudart().cudaProfilerStart()
and torch.cuda.cudart().cudaProfilerStop()
where profiling should start and stop.
launch profiler with
CUDA_VISIBLE_DEVICES=0,1,2,3
nsys profile \
-w true \
-t cuda,nvtx,osrt,cudnn,cublas \
-s cpu \
--capture-range=cudaProfilerApi \
--capture-range-end=stop \
--cudabacktrace=true \
--gpu-metrics-devices=cuda-visible \
--gpu-metrics-set=gh100 \
-x true \
-f true \
-o flux-schnell \
-e SOME_ENV_VAR=123
python flux-schnell.py
add nvtx traces with torch.cuda.nvtx.range(f"step_{self.step_count}")
Starting to like this one better, easier to associate profiling results with general regions of the model. Spits out an enormous json trace, but you can view in vscode with the tensorboard plugin, so no need to download them from the server. if you download it, view the trace using chrome://tracing/ Tutorial here
# do nothing for five steps, run the profiler but don't record metrics on the next step,
# and collect metrics for the next three steps after that, and repeat all of this 0
# times after the 1st time. Will write all profiling data to disk after
# (wait + warmup + active) * (repeat + 1) steps have passed, this may take a long time
# but overhead during running of the profiler is minimal.
prof = torch.profiler.profile(
schedule=torch.profiler.schedule(wait=5, warmup=1, active=3, repeat = 0),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/mymodel'),
record_shapes=True,
with_stack=True)
prof.start()
for i in range(10):
# tell the profiler when a step has passed
prof.step()
y = model(x)
prof.stop()