Solutions to problems with BERT training with tinygrad on AMD GPUs

Thank you to tiny corp for pointing out some problems running BERT training with Tinygrad on AMD GPUs in this Tweet. We had a few engineers at AMD take a look at the problem and they were quickly able to reproduce it.

What they found was an issue related to CWSR (compute wave save restore), which is a mechanism that allows our driver and firmware to preempt and reschedule long-running compute waves on our GPUs. The GFXv11 GPU line requires a workaround to set COMPUTE_PGM_RSRC1.PRIV=1 when dispatching a compute kernel. Normally this is handled by the AQL DISPATCH packet. However, since the Tinygrad implementation leverages a custom runtime, it requires this workaround in its PM4-based dispatch. This patch is specific to GFXv11 GPUs. Other GPUs do not require it and should not use this workaround. The following KFDTest patch can be used as a reference: https://github.com/ROCm/ROCT-Thunk-Interface/commit/507637ed5b82197eecbf483cdc1234939766549a

While investigating this issue another potential problem was discovered in that Tinygrad is not setting the Control Stack size for user mode queues correctly (also related to CWSR). Applications written on top of HSA or HIP would not have this problem because AMD's ROCr runtime handles queue creation and related buffer allocations for them. You can find the calculations for the correct CWSR area and control stack size here: https://gitlab.freedesktop.org/agd5f/linux/-/blob/a1fc9f584c4aaf8bc1ebfa459fc57a3f26a290d8/drivers/gpu/drm/amd/amdkfd/kfd_queue.c#L417

The next ROCm release will include robustness improvements in KFD that will gracefully fail during user mode queue creation if the queue buffer pointers or sizes are missing or fail sanity checks. It will also prevent accidental unmapping or freeing of queue buffers while the queues exist. These patches have been in our public kernel code since July: https://lore.kernel.org/amd-gfx/[email protected]/T/

We're working on a full driver that will just have one system wide kernel mode queue. I'm hoping by not needing to support any graphics, multiuser stuff, or SVM memory we can make something 100x simpler. (the end goal is to support multiuser at a higher abstraction level in tinygrad) Hopefully this finally ends all instability, since everything at that point is statically scheduled. tinygrad/tinygrad#6923

tinygrad should never generate shaders with endless loops, the language isn't Turing complete and all programs must halt. So we don't need the ability to preempt, it's never used. If you kill the process, I'm fine with a full GPU reset. Though with the fixes you suggested (merged tinygrad/tinygrad@137ad55) we can test CWSR again.

Thanks for the pointer to the trap handler, the flow makes sense. When you say CP firmware, do you mean the MEC? I don't see any CP in cat amdgpu_firmware_info https://github.com/geohot/7900xtx

fxkamd/bert-tiny-amd.md

geohot commented Oct 9, 2024 •

edited

Loading

Uh oh!

fxkamd commented Oct 9, 2024

Uh oh!

fxkamd commented Oct 17, 2024

Uh oh!

FlorianHeigl commented Nov 11, 2024 •

edited

Loading

Uh oh!

fxkamd/bert-tiny-amd.md

geohot commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxkamd commented Oct 9, 2024

Uh oh!

fxkamd commented Oct 17, 2024

Uh oh!

FlorianHeigl commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geohot commented Oct 9, 2024 •

edited

Loading

FlorianHeigl commented Nov 11, 2024 •

edited

Loading