A practical, end-to-end guide: prerequisites → build (ROCm + Vulkan) → run → quantize → wire into the pi-mono coding agent.
Hardware context: AMD Ryzen AI Max+ 395 ("Strix Halo") with integrated Radeon 8060S (gfx1151, RDNA 3.5, 40 CUs). 96 GB allocated as VRAM/GTT out of 128 GB unified memory. This is an APU with a unified memory architecture (UMA) — CPU and GPU share the same physical RAM, which fundamentally changes how you should think about "VRAM" and
--no-mmap.
gfx1151 needs a recent kernel. Omarchy is rolling Arch, so linux from the official repos should be fine (≥ 6.11 has solid Strix Halo support; ≥ 6.16 is better; the bleeding-edge 6.18+/6.19 fixes some VAE/CWSR crashes seen in image gen workloads but is not required for llama.cpp).
uname -r # check current kernel
sudo pacman -Syu # keep system fresh; rolling distros need thisIf you ever see "checkerboard"/SDMA artifacts during long runs, the workaround is export HSA_ENABLE_SDMA=0 before launching — but on recent kernels + recent ROCm you should not need it.
On Strix Halo, "VRAM" is really GTT (Graphics Translation Table) carved out of system RAM. Check what's actually visible to the GPU:
# After installing rocminfo (next section), or via sysfs now:
cat /sys/class/drm/card*/device/mem_info_vram_total 2>/dev/null
# More usefully, after ROCm is installed:
rocminfo | grep -A2 "gfx1151" -A20 | grep -E "Pool|Size"If you allocated 96 GB in BIOS as a fixed UMA carveout, llama.cpp will report ~96 GiB as "ROCm device VRAM". If you used the dynamic GTT path, the kernel will let the GPU grow up to the GTT limit (often ~110–120 GB on a 128 GB machine).
sudo pacman -S --needed git base-devel cmake ninja \
curl libcurl-gnutls openssl \
python python-pip python-virtualenvVulkan path (easiest, recommended starting point):
sudo pacman -S --needed vulkan-radeon vulkan-icd-loader vulkan-headers \
vulkan-tools spirv-headers spirv-tools shaderc glslang
vulkaninfo --summary # should list "AMD Radeon Graphics (RADV GFX1151)"ROCm/HIP path: Arch's official rocm-hip-sdk package ships rocBLAS Tensile libraries that historically did not include gfx1151 kernels and did not include hipBLASLt for it. The reliable path on Arch is TheRock nightly — AMD's open build that ships native gfx1151 kernels.
sudo pacman -S rocwmma
sudo mkdir -p /opt/rocm
# Pick the latest gfx1151 nightly tarball from:
# https://github.com/ROCm/TheRock/releases
# Look for: therock-dist-linux-gfx1151-<version>.tar.gz
cd /tmp
wget https://github.com/ROCm/TheRock/releases/download/<TAG>/therock-dist-linux-gfx1151-<VERSION>.tar.gz
sudo tar -xzf therock-dist-linux-gfx1151-*.tar.gz -C /opt/rocm --strip-components=0Add to ~/.zshrc / ~/.bashrc (omarchy uses bash by default, but check echo $SHELL):
export ROCM_PATH=/opt/rocm
export HIP_PATH=/opt/rocm
export PATH=$ROCM_PATH/bin:$PATH
export LD_LIBRARY_PATH=$ROCM_PATH/lib:$LD_LIBRARY_PATH
# Strix Halo gotcha: PYTORCH_HIP_ALLOC_CONF=backend:malloc crashes things. Don't set it.Verify:
rocminfo | grep -E "Name:|gfx"
# Expect: Name: gfx1151
# Marketing Name: AMD Radeon Graphics (or "Radeon 8060S Graphics")
rocm-smiUser permissions:
sudo usermod -aG render,video $USER
# log out / back ingit clone https://github.com/ggml-org/llama.cpp.git ~/src/llama.cpp
cd ~/src/llama.cppYou'll build two separate build directories so you can A/B test the backends.
cmake -S . -B build-vulkan \
-G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_VULKAN=ON \
-DLLAMA_CURL=ON
cmake --build build-vulkan --config Release -j$(nproc)Smoke test:
./build-vulkan/bin/llama-cli --list-devices
# Expect a Vulkan device line like:
# Vulkan0: AMD Radeon Graphics (RADV GFX1151) (uma:1, fp16:1, ...)The flag set below is the known-good combination for gfx1151 (from the llama.cpp Strix Halo discussion thread). Each flag matters:
| Flag | Why |
|---|---|
GGML_HIP=ON |
Use ROCm/HIP backend |
AMDGPU_TARGETS=gfx1151 |
Build kernels for your GPU. Don't rely on defaults. |
GGML_HIP_ROCWMMA_FATTN=ON |
rocWMMA flash attention — significant prompt-processing speedup |
GGML_HIP_NO_VMM=ON |
Critical on gfx1151. HIP's virtual memory manager is buggy on this GPU; without this flag you get unexplained model-load failures and stability issues. |
export ROCM_PATH=/opt/rocm
export PATH=$ROCM_PATH/bin:$PATH
cmake -S . -B build-rocm \
-G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1151 \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DGGML_HIP_NO_VMM=ON \
-DLLAMA_CURL=ON \
-DCMAKE_C_COMPILER=$ROCM_PATH/bin/amdclang \
-DCMAKE_CXX_COMPILER=$ROCM_PATH/bin/amdclang++
cmake --build build-rocm --config Release -j$(nproc)Smoke test:
./build-rocm/bin/llama-cli --list-devices
# Expect:
# ROCm0: Radeon 8060S Graphics (gfx1151, ~96000 MiB)If the build fails with errors about hipblasDatatype_t / hipblasDiagType_t, your HIP headers and llama.cpp are out of sync — pull the latest master of llama.cpp, then re-build. This API was renamed and old llama.cpp commits don't compile against newer HIP.
Easiest: llama-cli/llama-server can pull directly from Hugging Face with -hf. With 96 GB of VRAM you can comfortably run things up to ~70–120B at Q4-Q5, or 30B-class models at Q8 with huge context.
Good starter picks for a coding-agent workload:
hf download unsloth/Qwen3.6-35B-A3B-GGUF --include "*Q8_K_XL*" --local-dir ~/.cache/llama.cpp/unsloth_Qwen3.6-35B-A3B-GGUF
# ROCm
./build-rocm/bin/llama-cli \
-m ~/.cache/llama.cpp/unsloth_Qwen3.6-35B-A3B-GGUF \
-ngl 999 \
-c 32768 \
-fa 1 \
--temp 0.6 \
-cnv
# Vulkan (same flags work)
./build-vulkan/bin/llama-cli -m ... -ngl 999 -c 32768 -fa 1 -cnvKey flags:
-ngl 999— offload all layers to GPU (you have plenty of VRAM)-c <N>— context window in tokens. KV-cache scales linearly with this-fa 1— flash attention. With ROCm + rocWMMA this is a big win-cnv— conversation mode
This is the OpenAI-compatible HTTP server. Coding agents talk to it.
./build-rocm/bin/llama-server \
-m ~/.cache/llama.cpp/unsloth_Qwen3.6-35B-A3B-GGUF \
--host 127.0.0.1 --port 8080 \
-ngl 999 \
-c 65536 \
-fa 1 \
--parallel 1 \
--no-mmap \
--jinjaAbout these flags:
--no-mmap— on a UMA system with abundant RAM, disabling mmap loads weights into the GTT pool faster and avoids page-fault stalls. If you're tight on memory, drop it and let the kernel page.--parallel 1— for a single coding agent, keep this at 1. Each parallel slot allocates its own KV cache, so--parallel 4 -c 65536means 4× 65k KV caches, which can exhaust VRAM fast.--jinja— use the model's embedded chat template properly, including tool-calling. Required for agents.-c 65536— Qwen3-Coder handles 256k natively, but KV-cache memory cost is real (-ctk q8_0 -ctv q8_0halves it if you need more headroom)- The web UI is at
http://127.0.0.1:8080, the OpenAI API is athttp://127.0.0.1:8080/v1.
This is the right tool for your A/B test:
# ROCm
./build-rocm/bin/llama-bench \
-m ~/.cache/llama.cpp/unsloth_Qwen3.6-35B-A3B-GGUF \
-ngl 999 -fa 1 -p 512,2048 -n 128
# Vulkan
./build-vulkan/bin/llama-bench \
-m ~/.cache/llama.cpp/unsloth_Qwen3.6-35B-A3B-GGUF \
-ngl 999 -fa 1 -p 512,2048 -n 128What to expect on gfx1151 from community results: ROCm with rocWMMA flash attention is generally ~30–50% faster on prompt processing (pp512) than Vulkan. Token generation (tg128) is closer between the two — Vulkan is sometimes within 5–10%, occasionally even matching ROCm depending on the model and llama.cpp build. Vulkan on gfx1151 has no flash-attention support yet, so -fa 1 falls back to CPU on Vulkan for the attention path, which hurts long-context prompt processing specifically.
Concrete reference number from the upstream Strix Halo benchmarking thread on a recent build, Llama 2 7B Q4_0:
| Backend | pp512 (t/s) | tg128 (t/s) |
|---|---|---|
| ROCm + rocWMMA -fa 1 | ~1488 | ~50.4 |
| ROCm -fa 0 | ~1201 | ~46.0 |
Vulkan on the same chip lands roughly in the 600–900 pp range and 40–48 tg range depending on driver vintage. Bottom line: use ROCm for serious work, keep Vulkan as a fallback / for sanity checking driver issues.
You only need this when you have a Hugging Face safetensors model (e.g. your own fine-tune) and no pre-made GGUF on the Hub. If a GGUF exists, just download it.
cd ~/src/llama.cpp
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txthf download Qwen/Qwen3-8B --local-dir ~/models/qwen3-8b-hf \
--include "*.safetensors" "*.json" "*.txt" "tokenizer*"python convert_hf_to_gguf.py ~/models/qwen3-8b-hf \
--outtype bf16 \
--outfile ~/models/qwen3-8b-bf16.gguf--outtype choices: f32, f16, bf16, q8_0, auto. For BF16-trained models (Qwen3, Llama 3.x) use bf16. Always quantize from a high-precision GGUF — never re-quantize an already-quantized GGUF. Quality degrades sharply.
For anything below Q5, this materially improves quality. Use a calibration text representative of your target domain (or download wiki.train.raw for a generic one):
./build-rocm/bin/llama-imatrix \
-m ~/models/qwen3-8b-bf16.gguf \
-f calibration.txt \
-o ~/models/qwen3-8b.imatrix \
--chunk 512 \
-ngl 999./build-rocm/bin/llama-quantize \
--imatrix ~/models/qwen3-8b.imatrix \
~/models/qwen3-8b-bf16.gguf \
~/models/qwen3-8b-Q4_K_M.gguf \
Q4_K_M $(nproc)Picking a quantization (rough rule of thumb for your 96 GB):
| Quant | Bits/weight | Quality | When to use |
|---|---|---|---|
Q8_0 |
8.5 | Near-lossless | Small (≤8B) models or when quality-critical |
Q6_K |
6.6 | Excellent | Default for "I have RAM, don't quantize hard" |
Q5_K_M |
5.7 | Very good | Sweet spot for medium models |
Q4_K_M |
4.8 | Good | Default for ≥30B; widely used |
Q3_K_M |
3.9 | Acceptable | For 70B+ on tighter budgets |
IQ4_XS / IQ3_M |
~4 / ~3.7 | Better than Q at same size, slower decode | Use when you really need to squeeze size |
With 96 GB of VRAM you almost never need to go below Q4_K_M. For Qwen3-Coder-30B-A3B you might as well use Q6_K or Q8_0 — it'll still run blazingly fast because of the MoE active-param count.
./build-rocm/bin/llama-perplexity \
-m ~/models/qwen3-8b-Q4_K_M.gguf \
-f wiki.test.raw \
-ngl 999
# Compare ppl number against the bf16 baseline. Smaller delta = better quant.pi-mono (the pi CLI) is badlogic/pi-mono — a terminal AI coding agent that talks to any OpenAI-compatible endpoint. Your llama-server is exactly that.
The agent ships as an npm package:
sudo pacman -S nodejs npm
npm install -g @badlogic/pi-mono
# binary is `pi`Coding agents are sensitive to a few specifics. This is a tested config:
./build-rocm/bin/llama-server \
-m ~/models/qwen3-coder-30b/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
--host 127.0.0.1 --port 8080 \
-ngl 999 \
-c 65536 \
-fa 1 \
--parallel 1 \
--no-mmap \
--jinja \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--alias qwen3-coder-30bNotes specifically for agents:
--jinjais required — without it, tool-call formatting breaks for most modern models. Some models (older Qwen variants in particular) ship a strict template that rejects messages where the system message isn't first; if you hit a 500 error, override with--chat-template-file <path>using the model maker's recommended template.--aliassets the model ID exposed by/v1/models. pi will list it under that name.--cache-type-k/v q8_0quantizes the KV cache. Halves its memory at negligible quality cost. Worth it for long-context coding sessions.
Edit (create) ~/.pi/agent/models.json:
{
"providers": {
"llama-cpp": {
"baseUrl": "http://127.0.0.1:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"compat": {
"supportsDeveloperRole": false,
"supportsReasoningEffort": false
},
"models": [
{
"id": "qwen3-coder-30b",
"name": "Qwen3-Coder 30B (local, ROCm)",
"input": ["text"],
"contextWindow": 65536,
"maxTokens": 16384,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
}
]
}
}
}Why those compat flags: pi defaults to using OpenAI's developer message role and reasoning_effort parameter for reasoning-capable models. llama-server (and most non-OpenAI OpenAI-compat servers) don't understand them. Setting both to false makes pi send a normal system message and skip the reasoning-effort field.
cd ~/your/project
pi
# inside pi: /model → pick "qwen3-coder-30b"
# then just chat / ask it to make editspi supports skills (/skill:<name>), MCP servers, and subagents. For a starter pack:
git clone https://github.com/badlogic/pi-skills ~/.pi/agent/skills/pi-skills- Smaller local models hallucinate APIs more than cloud frontier models. Pair with an MCP doc-fetch server (e.g. context7) so the model can pull current docs into context.
- Watch your KV cache.
-c 65536 --parallel 1allocates one big cache; checknvidia-smi-style numbers viarocm-smiwhile pi runs to see real consumption. - If the agent is getting stuck on edits, models tend to do better with hash-anchored / line-anchored edits than with diff-based ones. The
oh-my-pifork mentioned in the pi-mono ecosystem adds this behavior.
| Symptom | Likely cause | Fix |
|---|---|---|
llama-cli crashes on model load with ROCm |
HIP VMM bug on gfx1151 | Rebuild with -DGGML_HIP_NO_VMM=ON |
ROCm build fails on hipblasDatatype_t |
Old llama.cpp, new HIP headers | git pull llama.cpp master and rebuild |
rocminfo doesn't list gfx1151 |
Wrong ROCm package; user not in render/video groups |
Use TheRock nightly, usermod -aG render,video $USER, relogin |
| Long-running jobs corrupt output ("checkerboard") | SDMA bug, older kernel | export HSA_ENABLE_SDMA=0 or upgrade kernel to ≥6.18 |
| pi returns 500 on tool calls | Chat template doesn't accept pi's message order | Add --jinja to llama-server, or --chat-template-file with a fixed template |
| Vulkan very slow on long prompts | Vulkan FA falls back to CPU on gfx1151 | Switch this workload to ROCm; Vulkan FA support is improving but not there yet |
| OOM only on big prompts, not generation | --ubatch-size too high during prefill |
Lower --ubatch-size (default 512 → try 256) |
| pi shows 0 models | compat.supportsDeveloperRole/supportsReasoningEffort mismatch |
Set both to false in models.json |
Daily-driver coding agent (ROCm):
~/src/llama.cpp/build-rocm/bin/llama-server \
-m ~/models/qwen3-coder-30b/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
--host 127.0.0.1 --port 8080 \
-ngl 999 -c 65536 -fa 1 --parallel 1 --no-mmap --jinja \
--cache-type-k q8_0 --cache-type-v q8_0 \
--alias qwen3-coder-30bBig-model exploration (uses most of the 96 GB):
~/src/llama.cpp/build-rocm/bin/llama-cli \
-hf unsloth/Qwen3-235B-A22B-Instruct-GGUF:Q3_K_M \
-ngl 999 -c 16384 -fa 1 --no-mmap -cnvConvert + quantize a HF model end-to-end:
hf download <user/repo> --local-dir ~/models/src
cd ~/src/llama.cpp && source .venv/bin/activate
python convert_hf_to_gguf.py ~/models/src --outtype bf16 --outfile ~/models/m-bf16.gguf
./build-rocm/bin/llama-imatrix -m ~/models/m-bf16.gguf -f calibration.txt -o ~/models/m.imatrix -ngl 999
./build-rocm/bin/llama-quantize --imatrix ~/models/m.imatrix \
~/models/m-bf16.gguf ~/models/m-Q4_K_M.gguf Q4_K_M $(nproc)That's the full path: prerequisites → both backends built → models downloaded/converted/quantized → served → driving pi-mono. The biggest single thing to remember on Strix Halo is -DGGML_HIP_NO_VMM=ON on the ROCm build — without it you'll chase phantom problems for hours.