This gist documents the first known attempt to use TileLang for custom kernel forging on RDNA 4 hardware (specifically the AMD Radeon RX 9070 XT, gfx1201). While TileLang is a powerful "Blacksmith's Kit" for AMD Instinct (CDNA) hardware, our research reveals critical architectural barriers when targeting consumer RDNA 4 cards.
We successfully compiled and executed a custom "Buffer Copy" smoke test kernel on the RX 9070 XT using TileLang's JIT backend and ROCm 7.2.
Key Finding: The core TileLang compiler and ROCm JIT pipeline are functional for standard memory operations and non-matrix compute on RDNA 4.
Attempts to compile optimized Matrix Multiplication (GEMM) or Flash Attention kernels failed consistently.
TileLang's GEMM engine is mathematically hardcoded for Wave64 (AMD Instinct/CDNA).
- CDNA Hardware: Operates with 64 threads per warp (Wavefront).
- RDNA 4 Hardware: Operates with 32 threads per warp (Wave32).
During layout inference, the TileLang compiler's inverse() function fails because it cannot automatically normalize a 64-thread index map onto 32-thread hardware. This results in a RuntimeError: Could not parse mapping as sum of iterators.
The TileLang C++ backend currently lacks the mai-insts (Matrix Acceleration Instructions) feature mapping for the gfx1201 target. Even with manual re-routing to standard MMA (Matrix Multiplication-Accumulation) paths, the compiler fails to resolve the necessary hardware intrinsics.
We attempted several surgical overrides to bridge the gap:
- Target Detection: Added
TargetIsRDNAto the utility layer to correctly identifygfx10/11/12. - Warp Scaling: Injected dynamic warp size detection into the
MatrixCoreIntrinEmitterto override the hardcoded 64-thread constant. - Implementation Re-routing: Manually re-routed GEMM logic from the CDNA-exclusive
kMFMApath to the standardkMMApath for RDNA targets.
Result: While these patches moved the failure point further down the stack, the final compilation failed due to unresolved tir.ptx_ldmatrix symbols (NVIDIA-specific) and the aforementioned mai-insts limitation.
TileLang remains an excellent tool for writing custom non-matrix kernels on RDNA 4 today. However, for native matrix acceleration (essential for LLMs and Diffusion models), the following are required:
- Manual Index Map Forging: Engineers must manually define Wave32-compatible thread-to-matrix mappings.
- Backend Support: Official RDNA 4 / GFX1201 feature mapping in the TileLang C++ backend.
Documented by Project Apollo | March 2026