llama.cpp on Beelink GTR9 Pro (Ryzen AI Max+ 395 / Radeon 8060S, gfx1151) — Arch / Omarchy

A practical, end-to-end guide: prerequisites → build (ROCm + Vulkan) → run → quantize → wire into the pi-mono coding agent.

Hardware context: AMD Ryzen AI Max+ 395 ("Strix Halo") with integrated Radeon 8060S (gfx1151, RDNA 3.5, 40 CUs). 96 GB allocated as VRAM/GTT out of 128 GB unified memory. This is an APU with a unified memory architecture (UMA) — CPU and GPU share the same physical RAM, which fundamentally changes how you should think about "VRAM" and --no-mmap.

1. Prerequisites

1.1 Kernel and amdgpu driver

gfx1151 needs a recent kernel. Omarchy is rolling Arch, so linux from the official repos should be fine (≥ 6.11 has solid Strix Halo support; ≥ 6.16 is better; the bleeding-edge 6.18+/6.19 fixes some VAE/CWSR crashes seen in image gen workloads but is not required for llama.cpp).

uname -r                      # check current kernel
sudo pacman -Syu              # keep system fresh; rolling distros need this

If you ever see "checkerboard"/SDMA artifacts during long runs, the workaround is export HSA_ENABLE_SDMA=0 before launching — but on recent kernels + recent ROCm you should not need it.

1.2 GTT / VRAM allocation (verify the 96 GB)

On Strix Halo, "VRAM" is really GTT (Graphics Translation Table) carved out of system RAM. Check what's actually visible to the GPU:

# After installing rocminfo (next section), or via sysfs now:
cat /sys/class/drm/card*/device/mem_info_vram_total 2>/dev/null
# More usefully, after ROCm is installed:
rocminfo | grep -A2 "gfx1151" -A20 | grep -E "Pool|Size"

If you allocated 96 GB in BIOS as a fixed UMA carveout, llama.cpp will report ~96 GiB as "ROCm device VRAM". If you used the dynamic GTT path, the kernel will let the GPU grow up to the GTT limit (often ~110–120 GB on a 128 GB machine).

1.3 Base build tools (both backends need these)

sudo pacman -S --needed git base-devel cmake ninja \
    curl libcurl-gnutls openssl \
    python python-pip python-virtualenv

1.4 Backend-specific prerequisites

Vulkan path (easiest, recommended starting point):

sudo pacman -S --needed vulkan-radeon vulkan-icd-loader vulkan-headers \
    vulkan-tools spirv-headers spirv-tools shaderc glslang
vulkaninfo --summary       # should list "AMD Radeon Graphics (RADV GFX1151)"

ROCm/HIP path: Arch's official rocm-hip-sdk package ships rocBLAS Tensile libraries that historically did not include gfx1151 kernels and did not include hipBLASLt for it. The reliable path on Arch is TheRock nightly — AMD's open build that ships native gfx1151 kernels.

sudo pacman -S rocwmma
sudo mkdir -p /opt/rocm
# Pick the latest gfx1151 nightly tarball from:
#   https://github.com/ROCm/TheRock/releases
# Look for: therock-dist-linux-gfx1151-<version>.tar.gz
cd /tmp
wget https://github.com/ROCm/TheRock/releases/download/<TAG>/therock-dist-linux-gfx1151-<VERSION>.tar.gz
sudo tar -xzf therock-dist-linux-gfx1151-*.tar.gz -C /opt/rocm --strip-components=0

Add to ~/.zshrc / ~/.bashrc (omarchy uses bash by default, but check echo $SHELL):

export ROCM_PATH=/opt/rocm
export HIP_PATH=/opt/rocm
export PATH=$ROCM_PATH/bin:$PATH
export LD_LIBRARY_PATH=$ROCM_PATH/lib:$LD_LIBRARY_PATH
# Strix Halo gotcha: PYTORCH_HIP_ALLOC_CONF=backend:malloc crashes things. Don't set it.

Verify:

rocminfo | grep -E "Name:|gfx"
# Expect:  Name: gfx1151
#          Marketing Name: AMD Radeon Graphics  (or "Radeon 8060S Graphics")
rocm-smi

User permissions:

sudo usermod -aG render,video $USER
# log out / back in

2. Build llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git ~/src/llama.cpp
cd ~/src/llama.cpp

You'll build two separate build directories so you can A/B test the backends.

2.1 Vulkan build

cmake -S . -B build-vulkan \
    -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_VULKAN=ON \
    -DLLAMA_CURL=ON
cmake --build build-vulkan --config Release -j$(nproc)

Smoke test:

./build-vulkan/bin/llama-cli --list-devices
# Expect a Vulkan device line like:
# Vulkan0: AMD Radeon Graphics (RADV GFX1151) (uma:1, fp16:1, ...)

2.2 ROCm/HIP build

The flag set below is the known-good combination for gfx1151 (from the llama.cpp Strix Halo discussion thread). Each flag matters:

Flag	Why
`GGML_HIP=ON`	Use ROCm/HIP backend
`AMDGPU_TARGETS=gfx1151`	Build kernels for your GPU. Don't rely on defaults.
`GGML_HIP_ROCWMMA_FATTN=ON`	rocWMMA flash attention — significant prompt-processing speedup
`GGML_HIP_NO_VMM=ON`	Critical on gfx1151. HIP's virtual memory manager is buggy on this GPU; without this flag you get unexplained model-load failures and stability issues.

export ROCM_PATH=/opt/rocm
export PATH=$ROCM_PATH/bin:$PATH

cmake -S . -B build-rocm \
    -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1151 \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DGGML_HIP_NO_VMM=ON \
    -DLLAMA_CURL=ON \
    -DCMAKE_C_COMPILER=$ROCM_PATH/bin/amdclang \
    -DCMAKE_CXX_COMPILER=$ROCM_PATH/bin/amdclang++
cmake --build build-rocm --config Release -j$(nproc)

Smoke test:

./build-rocm/bin/llama-cli --list-devices
# Expect:
# ROCm0: Radeon 8060S Graphics (gfx1151, ~96000 MiB)

If the build fails with errors about hipblasDatatype_t / hipblasDiagType_t, your HIP headers and llama.cpp are out of sync — pull the latest master of llama.cpp, then re-build. This API was renamed and old llama.cpp commits don't compile against newer HIP.

3. Get a model and run it

3.1 Download a GGUF

Easiest: llama-cli/llama-server can pull directly from Hugging Face with -hf. With 96 GB of VRAM you can comfortably run things up to ~70–120B at Q4-Q5, or 30B-class models at Q8 with huge context.

Good starter picks for a coding-agent workload:

hf download unsloth/Qwen3.6-35B-A3B-GGUF --include "*Q8_K_XL*" --local-dir ~/.cache/llama.cpp/unsloth_Qwen3.6-35B-A3B-GGUF

3.2 Run with `llama-cli` (interactive)

# ROCm
./build-rocm/bin/llama-cli \
    -m ~/.cache/llama.cpp/unsloth_Qwen3.6-35B-A3B-GGUF \
    -ngl 999 \
    -c 32768 \
    -fa 1 \
    --temp 0.6 \
    -cnv

# Vulkan (same flags work)
./build-vulkan/bin/llama-cli -m ... -ngl 999 -c 32768 -fa 1 -cnv

Key flags:

-ngl 999 — offload all layers to GPU (you have plenty of VRAM)
-c <N> — context window in tokens. KV-cache scales linearly with this
-fa 1 — flash attention. With ROCm + rocWMMA this is a big win
-cnv — conversation mode

3.3 Run with `llama-server` (the way you actually want to use it)

This is the OpenAI-compatible HTTP server. Coding agents talk to it.

./build-rocm/bin/llama-server \
    -m ~/.cache/llama.cpp/unsloth_Qwen3.6-35B-A3B-GGUF \
    --host 127.0.0.1 --port 8080 \
    -ngl 999 \
    -c 65536 \
    -fa 1 \
    --parallel 1 \
    --no-mmap \
    --jinja

About these flags:

--no-mmap — on a UMA system with abundant RAM, disabling mmap loads weights into the GTT pool faster and avoids page-fault stalls. If you're tight on memory, drop it and let the kernel page.
--parallel 1 — for a single coding agent, keep this at 1. Each parallel slot allocates its own KV cache, so --parallel 4 -c 65536 means 4× 65k KV caches, which can exhaust VRAM fast.
--jinja — use the model's embedded chat template properly, including tool-calling. Required for agents.
-c 65536 — Qwen3-Coder handles 256k natively, but KV-cache memory cost is real (-ctk q8_0 -ctv q8_0 halves it if you need more headroom)
The web UI is at http://127.0.0.1:8080, the OpenAI API is at http://127.0.0.1:8080/v1.

3.4 Benchmarking ROCm vs Vulkan

This is the right tool for your A/B test:

# ROCm
./build-rocm/bin/llama-bench \
    -m ~/.cache/llama.cpp/unsloth_Qwen3.6-35B-A3B-GGUF \
    -ngl 999 -fa 1 -p 512,2048 -n 128

# Vulkan
./build-vulkan/bin/llama-bench \
    -m ~/.cache/llama.cpp/unsloth_Qwen3.6-35B-A3B-GGUF \
    -ngl 999 -fa 1 -p 512,2048 -n 128

What to expect on gfx1151 from community results: ROCm with rocWMMA flash attention is generally ~30–50% faster on prompt processing (pp512) than Vulkan. Token generation (tg128) is closer between the two — Vulkan is sometimes within 5–10%, occasionally even matching ROCm depending on the model and llama.cpp build. Vulkan on gfx1151 has no flash-attention support yet, so -fa 1 falls back to CPU on Vulkan for the attention path, which hurts long-context prompt processing specifically.

Concrete reference number from the upstream Strix Halo benchmarking thread on a recent build, Llama 2 7B Q4_0:

Backend	pp512 (t/s)	tg128 (t/s)
ROCm + rocWMMA -fa 1	~1488	~50.4
ROCm -fa 0	~1201	~46.0

Vulkan on the same chip lands roughly in the 600–900 pp range and 40–48 tg range depending on driver vintage. Bottom line: use ROCm for serious work, keep Vulkan as a fallback / for sanity checking driver issues.

4. Quantizing a model

You only need this when you have a Hugging Face safetensors model (e.g. your own fine-tune) and no pre-made GGUF on the Hub. If a GGUF exists, just download it.

4.1 Setup

cd ~/src/llama.cpp
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

4.2 Download a HF model

hf download Qwen/Qwen3-8B --local-dir ~/models/qwen3-8b-hf \
    --include "*.safetensors" "*.json" "*.txt" "tokenizer*"

4.3 Convert HF → GGUF (high precision intermediate)

python convert_hf_to_gguf.py ~/models/qwen3-8b-hf \
    --outtype bf16 \
    --outfile ~/models/qwen3-8b-bf16.gguf

--outtype choices: f32, f16, bf16, q8_0, auto. For BF16-trained models (Qwen3, Llama 3.x) use bf16. Always quantize from a high-precision GGUF — never re-quantize an already-quantized GGUF. Quality degrades sharply.

4.4 (Highly recommended) Build an importance matrix

For anything below Q5, this materially improves quality. Use a calibration text representative of your target domain (or download wiki.train.raw for a generic one):

./build-rocm/bin/llama-imatrix \
    -m ~/models/qwen3-8b-bf16.gguf \
    -f calibration.txt \
    -o ~/models/qwen3-8b.imatrix \
    --chunk 512 \
    -ngl 999

4.5 Quantize

./build-rocm/bin/llama-quantize \
    --imatrix ~/models/qwen3-8b.imatrix \
    ~/models/qwen3-8b-bf16.gguf \
    ~/models/qwen3-8b-Q4_K_M.gguf \
    Q4_K_M $(nproc)

Picking a quantization (rough rule of thumb for your 96 GB):

Quant	Bits/weight	Quality	When to use
`Q8_0`	8.5	Near-lossless	Small (≤8B) models or when quality-critical
`Q6_K`	6.6	Excellent	Default for "I have RAM, don't quantize hard"
`Q5_K_M`	5.7	Very good	Sweet spot for medium models
`Q4_K_M`	4.8	Good	Default for ≥30B; widely used
`Q3_K_M`	3.9	Acceptable	For 70B+ on tighter budgets
`IQ4_XS / IQ3_M`	~4 / ~3.7	Better than Q at same size, slower decode	Use when you really need to squeeze size

With 96 GB of VRAM you almost never need to go below Q4_K_M. For Qwen3-Coder-30B-A3B you might as well use Q6_K or Q8_0 — it'll still run blazingly fast because of the MoE active-param count.

4.6 Verify the result

./build-rocm/bin/llama-perplexity \
    -m ~/models/qwen3-8b-Q4_K_M.gguf \
    -f wiki.test.raw \
    -ngl 999
# Compare ppl number against the bf16 baseline. Smaller delta = better quant.

5. Wire up the pi-mono coding agent

pi-mono (the pi CLI) is badlogic/pi-mono — a terminal AI coding agent that talks to any OpenAI-compatible endpoint. Your llama-server is exactly that.

5.1 Install pi

The agent ships as an npm package:

sudo pacman -S nodejs npm
npm install -g @badlogic/pi-mono
# binary is `pi`

5.2 Start `llama-server` with the right flags for an agent

Coding agents are sensitive to a few specifics. This is a tested config:

./build-rocm/bin/llama-server \
    -m ~/models/qwen3-coder-30b/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
    --host 127.0.0.1 --port 8080 \
    -ngl 999 \
    -c 65536 \
    -fa 1 \
    --parallel 1 \
    --no-mmap \
    --jinja \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --alias qwen3-coder-30b

Notes specifically for agents:

--jinja is required — without it, tool-call formatting breaks for most modern models. Some models (older Qwen variants in particular) ship a strict template that rejects messages where the system message isn't first; if you hit a 500 error, override with --chat-template-file <path> using the model maker's recommended template.
--alias sets the model ID exposed by /v1/models. pi will list it under that name.
--cache-type-k/v q8_0 quantizes the KV cache. Halves its memory at negligible quality cost. Worth it for long-context coding sessions.

5.3 Configure pi to use it

Edit (create) ~/.pi/agent/models.json:

{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://127.0.0.1:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "qwen3-coder-30b",
          "name": "Qwen3-Coder 30B (local, ROCm)",
          "input": ["text"],
          "contextWindow": 65536,
          "maxTokens": 16384,
          "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
        }
      ]
    }
  }
}

Why those compat flags: pi defaults to using OpenAI's developer message role and reasoning_effort parameter for reasoning-capable models. llama-server (and most non-OpenAI OpenAI-compat servers) don't understand them. Setting both to false makes pi send a normal system message and skip the reasoning-effort field.

5.4 Use it

cd ~/your/project
pi
# inside pi:  /model   → pick "qwen3-coder-30b"
#             then just chat / ask it to make edits

pi supports skills (/skill:<name>), MCP servers, and subagents. For a starter pack:

git clone https://github.com/badlogic/pi-skills ~/.pi/agent/skills/pi-skills

5.5 Tips for getting good results from a local coding agent

Smaller local models hallucinate APIs more than cloud frontier models. Pair with an MCP doc-fetch server (e.g. context7) so the model can pull current docs into context.
Watch your KV cache. -c 65536 --parallel 1 allocates one big cache; check nvidia-smi-style numbers via rocm-smi while pi runs to see real consumption.
If the agent is getting stuck on edits, models tend to do better with hash-anchored / line-anchored edits than with diff-based ones. The oh-my-pi fork mentioned in the pi-mono ecosystem adds this behavior.

6. Troubleshooting cheat sheet

Symptom	Likely cause	Fix
`llama-cli` crashes on model load with ROCm	HIP VMM bug on gfx1151	Rebuild with `-DGGML_HIP_NO_VMM=ON`
ROCm build fails on `hipblasDatatype_t`	Old llama.cpp, new HIP headers	`git pull` llama.cpp master and rebuild
`rocminfo` doesn't list gfx1151	Wrong ROCm package; user not in `render`/`video` groups	Use TheRock nightly, `usermod -aG render,video $USER`, relogin
Long-running jobs corrupt output ("checkerboard")	SDMA bug, older kernel	`export HSA_ENABLE_SDMA=0` or upgrade kernel to ≥6.18
pi returns 500 on tool calls	Chat template doesn't accept pi's message order	Add `--jinja` to `llama-server`, or `--chat-template-file` with a fixed template
Vulkan very slow on long prompts	Vulkan FA falls back to CPU on gfx1151	Switch this workload to ROCm; Vulkan FA support is improving but not there yet
OOM only on big prompts, not generation	`--ubatch-size` too high during prefill	Lower `--ubatch-size` (default 512 → try 256)
pi shows 0 models	`compat.supportsDeveloperRole`/`supportsReasoningEffort` mismatch	Set both to `false` in `models.json`

7. One-shot reference recipes

Daily-driver coding agent (ROCm):

~/src/llama.cpp/build-rocm/bin/llama-server \
  -m ~/models/qwen3-coder-30b/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8080 \
  -ngl 999 -c 65536 -fa 1 --parallel 1 --no-mmap --jinja \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --alias qwen3-coder-30b

Big-model exploration (uses most of the 96 GB):

~/src/llama.cpp/build-rocm/bin/llama-cli \
  -hf unsloth/Qwen3-235B-A22B-Instruct-GGUF:Q3_K_M \
  -ngl 999 -c 16384 -fa 1 --no-mmap -cnv

Convert + quantize a HF model end-to-end:

hf download <user/repo> --local-dir ~/models/src
cd ~/src/llama.cpp && source .venv/bin/activate
python convert_hf_to_gguf.py ~/models/src --outtype bf16 --outfile ~/models/m-bf16.gguf
./build-rocm/bin/llama-imatrix -m ~/models/m-bf16.gguf -f calibration.txt -o ~/models/m.imatrix -ngl 999
./build-rocm/bin/llama-quantize --imatrix ~/models/m.imatrix \
    ~/models/m-bf16.gguf ~/models/m-Q4_K_M.gguf Q4_K_M $(nproc)

That's the full path: prerequisites → both backends built → models downloaded/converted/quantized → served → driving pi-mono. The biggest single thing to remember on Strix Halo is -DGGML_HIP_NO_VMM=ON on the ROCm build — without it you'll chase phantom problems for hours.

nmeylan/llama.cpp-install-beelink-gtr9pro-arch.md

Select an option

No results found

Select an option

No results found

llama.cpp on Beelink GTR9 Pro (Ryzen AI Max+ 395 / Radeon 8060S, gfx1151) — Arch / Omarchy

1. Prerequisites

1.1 Kernel and amdgpu driver

1.2 GTT / VRAM allocation (verify the 96 GB)

1.3 Base build tools (both backends need these)

1.4 Backend-specific prerequisites

2. Build llama.cpp

2.1 Vulkan build

2.2 ROCm/HIP build

3. Get a model and run it

3.1 Download a GGUF

3.2 Run with `llama-cli` (interactive)

3.3 Run with `llama-server` (the way you actually want to use it)

3.4 Benchmarking ROCm vs Vulkan

4. Quantizing a model

4.1 Setup

4.2 Download a HF model

4.3 Convert HF → GGUF (high precision intermediate)

4.4 (Highly recommended) Build an importance matrix

4.5 Quantize

4.6 Verify the result

5. Wire up the pi-mono coding agent

5.1 Install pi

5.2 Start `llama-server` with the right flags for an agent

5.3 Configure pi to use it

5.4 Use it

5.5 Tips for getting good results from a local coding agent

6. Troubleshooting cheat sheet

7. One-shot reference recipes

nmeylan/llama.cpp-install-beelink-gtr9pro-arch.md

llama.cpp on Beelink GTR9 Pro (Ryzen AI Max+ 395 / Radeon 8060S, gfx1151) — Arch / Omarchy

1. Prerequisites

1.1 Kernel and amdgpu driver

1.2 GTT / VRAM allocation (verify the 96 GB)

1.3 Base build tools (both backends need these)

1.4 Backend-specific prerequisites

2. Build llama.cpp

2.1 Vulkan build

2.2 ROCm/HIP build

3. Get a model and run it

3.1 Download a GGUF

3.2 Run with llama-cli (interactive)

3.3 Run with llama-server (the way you actually want to use it)

3.4 Benchmarking ROCm vs Vulkan

4. Quantizing a model

4.1 Setup

4.2 Download a HF model

4.3 Convert HF → GGUF (high precision intermediate)

4.4 (Highly recommended) Build an importance matrix

4.5 Quantize

4.6 Verify the result

5. Wire up the pi-mono coding agent

5.1 Install pi

5.2 Start llama-server with the right flags for an agent

5.3 Configure pi to use it

5.4 Use it

5.5 Tips for getting good results from a local coding agent

6. Troubleshooting cheat sheet

7. One-shot reference recipes

3.2 Run with `llama-cli` (interactive)

3.3 Run with `llama-server` (the way you actually want to use it)

5.2 Start `llama-server` with the right flags for an agent