Local LLM on AMD — Step by Step

Hardware: 64GB RAM, Ryzen 9 7900X, Radeon RX 6700 XT (16GB VRAM) OS: Ubuntu 24.04 LTS Primary Models: Qwen 3.6 27B Q8_0 (agent) + GLM-4.7-Flash Q4 (coding specialist)

Step 1: Install Ubuntu 24.04 LTS

Download ISO from ubuntu.com, flash to USB with BalenaEtcher or dd
Boot, install normally — choose "Erase disk and install Ubuntu"
After install, update everything:

sudo apt update && sudo apt upgrade -y

Install build essentials:

sudo apt install -y git cmake build-essential curl wget python3-pip

Step 2: AMD GPU Drivers & Vulkan

Ubuntu 24.04 ships with Mesa RADV out of the box. Verify and configure:

# Verify your GPU is seen
lspci | grep -i vga

# Install Vulkan tools and latest Mesa
sudo apt install -y mesa-vulkan-drivers vulkan-tools

# Verify Vulkan sees your 6700 XT
vulkaninfo --summary | grep deviceName

You should see something like AMD RADV NAVI10. If you do, you're good.

Force RADV as the default (faster than AMDVLK for LLM inference):

echo 'export AMD_VULKAN_ICD=RADV' >> ~/.bashrc
source ~/.bashrc

Step 3: Build llama.cpp with Vulkan

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with Vulkan support
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# Verify it sees your GPU
./build/bin/llama-cli --list-devices

You should see your RX 6700 XT listed with ~16GB.

Step 4: Download Models

Install huggingface-cli (supports resume on large downloads):

pip install huggingface_hub
mkdir -p ~/models

⭐ Qwen 3.6 27B Q8_0 (~28.6GB) — main agent model (near-lossless quality)

huggingface-cli download unsloth/Qwen3.6-27B-GGUF \
  --include "Qwen3.6-27B-Q8_0.gguf" \
  --local-dir ~/models

⭐ GLM-4.7-Flash UD-Q4_K_XL (~18GB) — coding specialist (fits fully on GPU)

huggingface-cli download unsloth/GLM-4.7-Flash-GGUF \
  --include "GLM-4.7-Flash-UD-Q4_K_XL.gguf" \
  --local-dir ~/models

Optional: Qwen 2.5 Coder 32B Instruct Q5_K_M (~23GB) — code execution model

huggingface-cli download unsloth/Qwen2.5-Coder-32B-Instruct-GGUF \
  --include "Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf" \
  --local-dir ~/models

Optional: Qwen 3.6 35B-A3B Q4_K_M (~22GB) — MoE speed pick

huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
  --include "Qwen3.6-35B-A3B-Q4_K_M.gguf" \
  --local-dir ~/models

Optional: Qwen3-Coder-Next Q4_K_M (~38-46GB) — purpose-built coding agent

huggingface-cli download bartowski/Qwen_Qwen3-Coder-Next-GGUF \
  --include "Qwen_Qwen3-Coder-Next-Q4_K_M.gguf" \
  --local-dir ~/models

Step 5: Run with Optimized Flags

Qwen 3.6 27B Q8_0 (main agent — quality over speed)

./build/bin/llama-server \
  -m ~/models/Qwen3.6-27B-Q8_0.gguf \
  -ngl 20 \
  -c 131072 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

Tuning -ngl (GPU layers):

Start at 20 (Q8_0 is larger than Q4, so fewer layers fit on GPU)
If it OOMs, drop by 2-3
If you have VRAM headroom, bump up
The goal is max GPU layers without spilling

GLM-4.7-Flash Q4 (coding specialist — fits fully on GPU)

./build/bin/llama-server \
  -m ~/models/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
  -ngl 99 \
  -c 65536 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

GLM-4.7-Flash is a 30B MoE (3.6B active per token). At ~18GB Q4 it fits fully on your 16GB VRAM with a few layers offloaded to CPU RAM. Extremely fast for code tasks.

Recommended sampling for GLM-4.7 (from Z.AI):

Temperature: 1.0
Top-p: 0.95
Min-p: 0.01

Optional: Qwen 2.5 Coder 32B Q5_K_M (code execution — partial offload)

./build/bin/llama-server \
  -m ~/models/Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf \
  -ngl 20 \
  -c 32768 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

Optional: Qwen 3.6 35B-A3B (MoE — faster inference, slightly less capable)

./build/bin/llama-server \
  -m ~/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
  -ngl 25 \
  -c 131072 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

Optional: Qwen3-Coder-Next (purpose-built coding agent — tight fit)

./build/bin/llama-server \
  -m ~/models/Qwen_Qwen3-Coder-Next-Q4_K_M.gguf \
  -ngl 15 \
  -c 65536 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080

This is a tight fit at ~38-46GB. Start with lower -ngl and shorter context. Only 3B active per token so inference is fast despite the large download.

Step 6: Connect Your Tools

The server exposes an OpenAI-compatible API at http://localhost:8080/v1. Point anything at it:

Tool	Config
OpenCode	`OPENAI_BASE_URL=http://localhost:8080/v1`
Aider	`aider --model openai/qwen3.6-27b --openai-api-base http://localhost:8080/v1`
Continue.dev	OpenAI-compatible provider → `localhost:8080/v1`, any API key
Cursor	OpenAI-compatible provider → `localhost:8080/v1`, any API key
Anything else	Use the OpenAI-compatible provider pointing at `localhost:8080/v1`

Step 7 (Optional): LM Studio for Lighter Models

If you want a GUI for smaller models:

Download from lmstudio.ai (Linux beta available)
Settings → Backend → Vulkan
Search & download models from the UI
Chat directly in the app

Don't use LM Studio for the 27B Q8_0 or Coder-Next — you need the fine-grained -ngl and KV cache tuning that only llama-server gives you.

Quick Reference

What	Command
Check GPU	`vulkaninfo --summary \| grep deviceName`
Check VRAM usage	`watch -n 1 cat /sys/class/drm/card*/device/mem_info_vram_used`
Kill server	`Ctrl+C` or `kill $(lsof -t -i:8080)`
API endpoint	`http://localhost:8080/v1`
Test API	`curl http://localhost:8080/v1/models`
Rebuild llama.cpp	`cd llama.cpp && git pull && cmake --build build --config Release -j$(nproc)`

Why These Choices

Decision	Reason
Ubuntu 24.04	ROCm is Linux-only; RADV Vulkan drivers optimized on Linux; most AMD LLM docs target Ubuntu
Vulkan over ROCm	ROCm doesn't officially support consumer RDNA2 cards; Vulkan is faster for token generation on AMD anyway
RADV over AMDVLK	Consistently faster for LLM inference in benchmarks
llama.cpp over Ollama	Full control over `-ngl`, KV cache type, flash attention; Ollama forces thinking ON for Qwen 3.x
q4_0 KV cache	The single biggest unlock — fits 131K context in your RAM vs FP16 KV cache which doesn't fit at all
Q8_0 for 27B	Near-lossless quality. Since speed isn't the priority, push quality. 77.2% SWE-bench at full fidelity
GLM-4.7-Flash Q4	94.2% HumanEval, MoE (3.6B active), fits on GPU, blazing fast for code. Best coding specialist that fits
Q4_K_M / UD-Q4_K_XL	Best quality/size ratio for quantized models

Model Comparison

Model	Size	SWE-bench V	HumanEval	FIM	Active/Token	Best For
Qwen 3.6 27B Q8_0	~28.6GB	77.2%	—	No	27B (dense)	Main agent, complex reasoning, plan execution
GLM-4.7-Flash Q4	~18GB	—	94.2%	No	3.6B (MoE)	Coding specialist, fast code generation
Qwen 2.5 Coder 32B Q5	~23GB	—	92.7%	Yes	32B (dense)	Code execution, autocomplete, FIM
Qwen 3.6 35B-A3B Q4	~22GB	73.4%	—	No	3B (MoE)	Fast chat, speed priority
Qwen3-Coder-Next Q4	~38-46GB	70.6%	—	No	3B (MoE)	Purpose-built coding agent, 256K context

What NOT to Run

Model	Why Not
GLM-5.1 (744B)	Needs 8x H100s. ~452GB at Q4. Datacenter only.
DeepSeek V4-Pro (1.6T)	Needs 16x H100s. ~432GB at Q4. Datacenter only.
DeepSeek V4-Flash (284B)	~80GB at Q4. Doesn't fit in 64GB RAM.
GLM-4.7 full (355B)	~216GB at Q4. Needs multi-GPU server. Use Flash instead.

Fyko/local-llm-amd-setup.md

Select an option

No results found