Skip to content

Instantly share code, notes, and snippets.

@apollo-mg
Last active March 10, 2026 03:09
Show Gist options
  • Select an option

  • Save apollo-mg/c8dd6dc9ba1d9a986830ba745fdd9a01 to your computer and use it in GitHub Desktop.

Select an option

Save apollo-mg/c8dd6dc9ba1d9a986830ba745fdd9a01 to your computer and use it in GitHub Desktop.
RDNA 4 (GFX1201) Poachers Reproduction Guide: Native High-Speed Vision

πŸ›Έ RDNA 4 (GFX1201) AI MASTER LIST

Last Updated: March 9, 2026 | Environment: ROCm 7.2 / Poachers Special Ed (PyTorch 2.9.1 / Triton 3.5.1)

🟒 1. THE "GREEN ZONE" (Verified Working Bare-Metal)

  • Flash Linear Attention (FLA): ASCENDED. Liberated from Docker; running bare-metal via Triton kernels.
  • 4-bit Resident Vision: CONFIRMED. Qwen 3.5 4B running in 4.7GB VRAM with ~27-40s prefill.
  • Dual-Core Residency: VERIFIED. Logic (DeepSeek-R1 14B @ 51 tok/s) and Vision (Qwen 3.5 4B) running simultaneously in 16GB VRAM.
  • Triton 3.5.1 + PyTorch 2.9.1: Stable native pairing for GFX1201.
  • Unsloth 4-bit Native: Works perfectly once vLLM/CUDA dependency checks are bypassed.

🟑 2. THE "YELLOW ZONE" (Functional Workarounds)

  • Prefill Latency: Currently 27-40s. Bottleneck identified in Triton prefill kernels; target is <5s.
  • FP8 Hardware Status: RESEARCHED. GFX1201 supports float8_e4m3fnuz natively, but Triton 3.5.1 lacks intrinsic legalization. 10x slowdown due to software emulation.
  • Frankenstein Build: Setup uses 24.04 container libraries on 22.04 host. OS migration to 24.04 planned.

πŸ”΄ 3. THE "RED ZONE" (Confirmed Broken)

  • Native FP8 MatMul: PyTorch addmm and Triton kernels currently fail legalization/intrinsic mapping for GFX1201.
  • Native pip install causal-conv1d: Still blocked by hardcoded NVIDIA/NVCC checks.
  • vLLM Native Linking: ABI drift in PyTorch Nightly breaks binary extension loading (getCurrentHIPStream error).

πŸ—οΈ BUILD REPORT: "POACHERS SPECIAL ED"

Methodology: Sovereign Extraction & Infiltration

  1. Liberated optimized RDNA 4 wheels (Torch/Triton/Apex) from rocm/vllm-dev:rocm7.2_navi.
  2. Poached internal Triton kernels (fla, causal_conv1d) directly from container source.
  3. Engineered local shims to strip vllm and cuda dependencies.
  4. Nuclear Patch applied to Unsloth to ignore hardware gatekeeping.

Current Verdict: The RX 9070 XT is a fully-functional, resident-capable AI workstation for Logic (14B) + Vision (4B) workflows.

πŸ΄β€β˜ οΈ RDNA 4 "POACHERS SPECIAL ED" REPRODUCTION GUIDE

Target Hardware: AMD Radeon RX 9070 XT (GFX1201) Objective: Native, High-Speed Vision (Qwen 3.5-VL) without Docker or vLLM Linkage Errors.

πŸ› οΈ 1. PREREQUISITES

  • Host OS: Ubuntu 22.04 or 24.04
  • Python 3.12 installed on host.
  • ROCm 7.2 installed on host.
  • Docker installed (used only as a parts bin).

πŸ“¦ 2. THE HEIST (EXTRACTION)

AMD locks the best GFX1201 kernels inside specific containers. We will liberate them.

Pull the "Parts Bin" Image

docker pull rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0

Extract the Golden Wheels

Run a temporary container and copy out the pre-compiled RDNA 4 binaries:

docker run --name extractor -d rocm/vllm-dev:rocm7.2_navi_ubuntu24.04_py3.12_pytorch_2.9_vllm_0.14.0rc0 sleep infinity
mkdir -p ./liberated_wheels
docker cp extractor:/torch-2.9.1+rocm7.2.0.lw.git5bc97ba0-cp312-cp312-linux_x86_64.whl ./liberated_wheels/
docker cp extractor:/triton-3.5.1+rocm7.2.0.gita272dfa8-cp312-cp312-linux_x86_64.whl ./liberated_wheels/
docker cp extractor:/torchvision-0.24.0+rocm7.2.0.gitb919bd0c-cp312-cp312-linux_x86_64.whl ./liberated_wheels/
docker cp extractor:/torchaudio-2.9.0+rocm7.2.0.gite3c6ee2b-cp312-cp312-linux_x86_64.whl ./liberated_wheels/
docker cp extractor:/apex-1.9.0+rocm7.2.0.gite37ed124-cp312-cp312-linux_x86_64.whl ./liberated_wheels/

Extract the Kernel Source

Liberate the Triton-based Flash Linear Attention and Causal Conv1d kernels:

mkdir -p ./liberated_packages
docker cp extractor:/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/fla ./liberated_packages/
docker cp extractor:/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py ./liberated_packages/causal_conv1d_interface.py
docker rm -f extractor

πŸ—οΈ 3. BARE-METAL SETUP

Initialize Python 3.12 Environment

python3.12 -m venv venv_sovereign
source venv_sovereign/bin/activate
pip install ./liberated_wheels/*.whl
pip install unsloth unsloth_zoo bitsandbytes transformers==5.3.0 datasets==4.3.0

Deploy and Patch Kernels

Copy the liberated source into your venv site-packages and apply the "Poachers Patch" to remove vLLM/CUDA dependencies.

The Causal Conv1d Shim: Create venv_sovereign/lib/python3.12/site-packages/causal_conv1d/__init__.py:

from .causal_conv1d_interface import causal_conv1d_fn, causal_conv1d_update

The Nuclear Unsloth Patch: Edit unsloth/__init__.py and comment out any fix_vllm or patch_vllm calls to decouple from the broken binary extensions.

πŸš€ 4. EXECUTION

Set the critical environment variables to unlock the Triton fast-path:

export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
export HSA_OVERRIDE_GFX_VERSION=12.0.1
python your_vision_script.py

Verdict: This setup provides ~4.7GB resident vision at 3.5x standard Torch speed on GFX1201 hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment