Skip to content

Instantly share code, notes, and snippets.

@Fyko
Last active May 2, 2026 01:36
Show Gist options
  • Select an option

  • Save Fyko/169a27c8b54be84a7135200ea4959f10 to your computer and use it in GitHub Desktop.

Select an option

Save Fyko/169a27c8b54be84a7135200ea4959f10 to your computer and use it in GitHub Desktop.
Local LLM on AMD: Step-by-step setup guide for Qwen 3.6 on RX 6700 XT

Local LLM on AMD — Step by Step

Hardware: 64GB RAM, Ryzen 9 7900X, Radeon RX 6700 XT (16GB VRAM) OS: Ubuntu 24.04 LTS Primary Models: Qwen 3.6 27B Q8_0 (agent) + GLM-4.7-Flash Q4 (coding specialist)


Step 1: Install Ubuntu 24.04 LTS

  1. Download ISO from ubuntu.com, flash to USB with BalenaEtcher or dd
  2. Boot, install normally — choose "Erase disk and install Ubuntu"
  3. After install, update everything:
sudo apt update && sudo apt upgrade -y
  1. Install build essentials:
sudo apt install -y git cmake build-essential curl wget python3-pip

Step 2: AMD GPU Drivers & Vulkan

Ubuntu 24.04 ships with Mesa RADV out of the box. Verify and configure:

# Verify your GPU is seen
lspci | grep -i vga

# Install Vulkan tools and latest Mesa
sudo apt install -y mesa-vulkan-drivers vulkan-tools

# Verify Vulkan sees your 6700 XT
vulkaninfo --summary | grep deviceName

You should see something like AMD RADV NAVI10. If you do, you're good.

Force RADV as the default (faster than AMDVLK for LLM inference):

echo 'export AMD_VULKAN_ICD=RADV' >> ~/.bashrc
source ~/.bashrc

Step 3: Build llama.cpp with Vulkan

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with Vulkan support
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# Verify it sees your GPU
./build/bin/llama-cli --list-devices

You should see your RX 6700 XT listed with ~16GB.


Step 4: Download Models

Install huggingface-cli (supports resume on large downloads):

pip install huggingface_hub
mkdir -p ~/models

⭐ Qwen 3.6 27B Q8_0 (~28.6GB) — main agent model (near-lossless quality)

huggingface-cli download unsloth/Qwen3.6-27B-GGUF \
  --include "Qwen3.6-27B-Q8_0.gguf" \
  --local-dir ~/models

⭐ GLM-4.7-Flash UD-Q4_K_XL (~18GB) — coding specialist (fits fully on GPU)

huggingface-cli download unsloth/GLM-4.7-Flash-GGUF \
  --include "GLM-4.7-Flash-UD-Q4_K_XL.gguf" \
  --local-dir ~/models

Optional: Qwen 2.5 Coder 32B Instruct Q5_K_M (~23GB) — code execution model

huggingface-cli download unsloth/Qwen2.5-Coder-32B-Instruct-GGUF \
  --include "Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf" \
  --local-dir ~/models

Optional: Qwen 3.6 35B-A3B Q4_K_M (~22GB) — MoE speed pick

huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
  --include "Qwen3.6-35B-A3B-Q4_K_M.gguf" \
  --local-dir ~/models

Optional: Qwen3-Coder-Next Q4_K_M (~38-46GB) — purpose-built coding agent

huggingface-cli download bartowski/Qwen_Qwen3-Coder-Next-GGUF \
  --include "Qwen_Qwen3-Coder-Next-Q4_K_M.gguf" \
  --local-dir ~/models

Step 5: Run with Optimized Flags

Qwen 3.6 27B Q8_0 (main agent — quality over speed)

./build/bin/llama-server \
  -m ~/models/Qwen3.6-27B-Q8_0.gguf \
  -ngl 20 \
  -c 131072 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

Tuning -ngl (GPU layers):

  • Start at 20 (Q8_0 is larger than Q4, so fewer layers fit on GPU)
  • If it OOMs, drop by 2-3
  • If you have VRAM headroom, bump up
  • The goal is max GPU layers without spilling

GLM-4.7-Flash Q4 (coding specialist — fits fully on GPU)

./build/bin/llama-server \
  -m ~/models/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
  -ngl 99 \
  -c 65536 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

GLM-4.7-Flash is a 30B MoE (3.6B active per token). At ~18GB Q4 it fits fully on your 16GB VRAM with a few layers offloaded to CPU RAM. Extremely fast for code tasks.

Recommended sampling for GLM-4.7 (from Z.AI):

  • Temperature: 1.0
  • Top-p: 0.95
  • Min-p: 0.01

Optional: Qwen 2.5 Coder 32B Q5_K_M (code execution — partial offload)

./build/bin/llama-server \
  -m ~/models/Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf \
  -ngl 20 \
  -c 32768 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

Optional: Qwen 3.6 35B-A3B (MoE — faster inference, slightly less capable)

./build/bin/llama-server \
  -m ~/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
  -ngl 25 \
  -c 131072 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

Optional: Qwen3-Coder-Next (purpose-built coding agent — tight fit)

./build/bin/llama-server \
  -m ~/models/Qwen_Qwen3-Coder-Next-Q4_K_M.gguf \
  -ngl 15 \
  -c 65536 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

This is a tight fit at ~38-46GB. Start with lower -ngl and shorter context. Only 3B active per token so inference is fast despite the large download.


Step 6: Connect Your Tools

The server exposes an OpenAI-compatible API at http://localhost:8080/v1. Point anything at it:

Tool Config
OpenCode OPENAI_BASE_URL=http://localhost:8080/v1
Aider aider --model openai/qwen3.6-27b --openai-api-base http://localhost:8080/v1
Continue.dev OpenAI-compatible provider → localhost:8080/v1, any API key
Cursor OpenAI-compatible provider → localhost:8080/v1, any API key
Anything else Use the OpenAI-compatible provider pointing at localhost:8080/v1

Step 7 (Optional): LM Studio for Lighter Models

If you want a GUI for smaller models:

  1. Download from lmstudio.ai (Linux beta available)
  2. Settings → Backend → Vulkan
  3. Search & download models from the UI
  4. Chat directly in the app

Don't use LM Studio for the 27B Q8_0 or Coder-Next — you need the fine-grained -ngl and KV cache tuning that only llama-server gives you.


Quick Reference

What Command
Check GPU vulkaninfo --summary | grep deviceName
Check VRAM usage watch -n 1 cat /sys/class/drm/card*/device/mem_info_vram_used
Kill server Ctrl+C or kill $(lsof -t -i:8080)
API endpoint http://localhost:8080/v1
Test API curl http://localhost:8080/v1/models
Rebuild llama.cpp cd llama.cpp && git pull && cmake --build build --config Release -j$(nproc)

Why These Choices

Decision Reason
Ubuntu 24.04 ROCm is Linux-only; RADV Vulkan drivers optimized on Linux; most AMD LLM docs target Ubuntu
Vulkan over ROCm ROCm doesn't officially support consumer RDNA2 cards; Vulkan is faster for token generation on AMD anyway
RADV over AMDVLK Consistently faster for LLM inference in benchmarks
llama.cpp over Ollama Full control over -ngl, KV cache type, flash attention; Ollama forces thinking ON for Qwen 3.x
q4_0 KV cache The single biggest unlock — fits 131K context in your RAM vs FP16 KV cache which doesn't fit at all
Q8_0 for 27B Near-lossless quality. Since speed isn't the priority, push quality. 77.2% SWE-bench at full fidelity
GLM-4.7-Flash Q4 94.2% HumanEval, MoE (3.6B active), fits on GPU, blazing fast for code. Best coding specialist that fits
Q4_K_M / UD-Q4_K_XL Best quality/size ratio for quantized models

Model Comparison

Model Size SWE-bench V HumanEval FIM Active/Token Best For
Qwen 3.6 27B Q8_0 ~28.6GB 77.2% No 27B (dense) Main agent, complex reasoning, plan execution
GLM-4.7-Flash Q4 ~18GB 94.2% No 3.6B (MoE) Coding specialist, fast code generation
Qwen 2.5 Coder 32B Q5 ~23GB 92.7% Yes 32B (dense) Code execution, autocomplete, FIM
Qwen 3.6 35B-A3B Q4 ~22GB 73.4% No 3B (MoE) Fast chat, speed priority
Qwen3-Coder-Next Q4 ~38-46GB 70.6% No 3B (MoE) Purpose-built coding agent, 256K context

What NOT to Run

Model Why Not
GLM-5.1 (744B) Needs 8x H100s. ~452GB at Q4. Datacenter only.
DeepSeek V4-Pro (1.6T) Needs 16x H100s. ~432GB at Q4. Datacenter only.
DeepSeek V4-Flash (284B) ~80GB at Q4. Doesn't fit in 64GB RAM.
GLM-4.7 full (355B) ~216GB at Q4. Needs multi-GPU server. Use Flash instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment