Skip to content

Instantly share code, notes, and snippets.

@apollo-mg
Created March 9, 2026 16:05
Show Gist options
  • Select an option

  • Save apollo-mg/e1e4fff87121e7f96b228ae394e69051 to your computer and use it in GitHub Desktop.

Select an option

Save apollo-mg/e1e4fff87121e7f96b228ae394e69051 to your computer and use it in GitHub Desktop.
Technical post-mortem: TileLang kernel forging failures on RDNA 4 (GFX1201)

RDNA 4 (GFX1201) Technical Post-Mortem: TileLang Kernel Forging & The Wave32 Barrier

Executive Summary

This gist documents the first known attempt to use TileLang for custom kernel forging on RDNA 4 hardware (specifically the AMD Radeon RX 9070 XT, gfx1201). While TileLang is a powerful "Blacksmith's Kit" for AMD Instinct (CDNA) hardware, our research reveals critical architectural barriers when targeting consumer RDNA 4 cards.

🏁 The Success: General Purpose Compute

We successfully compiled and executed a custom "Buffer Copy" smoke test kernel on the RX 9070 XT using TileLang's JIT backend and ROCm 7.2.

Key Finding: The core TileLang compiler and ROCm JIT pipeline are functional for standard memory operations and non-matrix compute on RDNA 4.

🚧 The Matrix Wall: Wave64 vs. Wave32

Attempts to compile optimized Matrix Multiplication (GEMM) or Flash Attention kernels failed consistently.

1. The Wave Size Mismatch

TileLang's GEMM engine is mathematically hardcoded for Wave64 (AMD Instinct/CDNA).

  • CDNA Hardware: Operates with 64 threads per warp (Wavefront).
  • RDNA 4 Hardware: Operates with 32 threads per warp (Wave32).

During layout inference, the TileLang compiler's inverse() function fails because it cannot automatically normalize a 64-thread index map onto 32-thread hardware. This results in a RuntimeError: Could not parse mapping as sum of iterators.

2. Missing Backend Flags

The TileLang C++ backend currently lacks the mai-insts (Matrix Acceleration Instructions) feature mapping for the gfx1201 target. Even with manual re-routing to standard MMA (Matrix Multiplication-Accumulation) paths, the compiler fails to resolve the necessary hardware intrinsics.

πŸ”§ Attempted Patches (Alpha Build Findings)

We attempted several surgical overrides to bridge the gap:

  • Target Detection: Added TargetIsRDNA to the utility layer to correctly identify gfx10/11/12.
  • Warp Scaling: Injected dynamic warp size detection into the MatrixCoreIntrinEmitter to override the hardcoded 64-thread constant.
  • Implementation Re-routing: Manually re-routed GEMM logic from the CDNA-exclusive kMFMA path to the standard kMMA path for RDNA targets.

Result: While these patches moved the failure point further down the stack, the final compilation failed due to unresolved tir.ptx_ldmatrix symbols (NVIDIA-specific) and the aforementioned mai-insts limitation.

πŸ›°οΈ Recommendation for the Community

TileLang remains an excellent tool for writing custom non-matrix kernels on RDNA 4 today. However, for native matrix acceleration (essential for LLMs and Diffusion models), the following are required:

  1. Manual Index Map Forging: Engineers must manually define Wave32-compatible thread-to-matrix mappings.
  2. Backend Support: Official RDNA 4 / GFX1201 feature mapping in the TileLang C++ backend.

Documented by Project Apollo | March 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment