Hardware: 64GB RAM, Ryzen 9 7900X, Radeon RX 6700 XT (16GB VRAM) OS: Ubuntu 24.04 LTS Primary Models: Qwen 3.6 27B Q8_0 (agent) + GLM-4.7-Flash Q4 (coding specialist)
- Download ISO from ubuntu.com, flash to USB with BalenaEtcher or
dd - Boot, install normally — choose "Erase disk and install Ubuntu"
- After install, update everything:
sudo apt update && sudo apt upgrade -y- Install build essentials:
sudo apt install -y git cmake build-essential curl wget python3-pipUbuntu 24.04 ships with Mesa RADV out of the box. Verify and configure:
# Verify your GPU is seen
lspci | grep -i vga
# Install Vulkan tools and latest Mesa
sudo apt install -y mesa-vulkan-drivers vulkan-tools
# Verify Vulkan sees your 6700 XT
vulkaninfo --summary | grep deviceNameYou should see something like AMD RADV NAVI10. If you do, you're good.
Force RADV as the default (faster than AMDVLK for LLM inference):
echo 'export AMD_VULKAN_ICD=RADV' >> ~/.bashrc
source ~/.bashrcgit clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build with Vulkan support
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
# Verify it sees your GPU
./build/bin/llama-cli --list-devicesYou should see your RX 6700 XT listed with ~16GB.
Install huggingface-cli (supports resume on large downloads):
pip install huggingface_hub
mkdir -p ~/modelshuggingface-cli download unsloth/Qwen3.6-27B-GGUF \
--include "Qwen3.6-27B-Q8_0.gguf" \
--local-dir ~/modelshuggingface-cli download unsloth/GLM-4.7-Flash-GGUF \
--include "GLM-4.7-Flash-UD-Q4_K_XL.gguf" \
--local-dir ~/modelshuggingface-cli download unsloth/Qwen2.5-Coder-32B-Instruct-GGUF \
--include "Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf" \
--local-dir ~/modelshuggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
--include "Qwen3.6-35B-A3B-Q4_K_M.gguf" \
--local-dir ~/modelshuggingface-cli download bartowski/Qwen_Qwen3-Coder-Next-GGUF \
--include "Qwen_Qwen3-Coder-Next-Q4_K_M.gguf" \
--local-dir ~/models./build/bin/llama-server \
-m ~/models/Qwen3.6-27B-Q8_0.gguf \
-ngl 20 \
-c 131072 \
-fa on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-np 1 \
--host 0.0.0.0 \
--port 8080Tuning -ngl (GPU layers):
- Start at 20 (Q8_0 is larger than Q4, so fewer layers fit on GPU)
- If it OOMs, drop by 2-3
- If you have VRAM headroom, bump up
- The goal is max GPU layers without spilling
./build/bin/llama-server \
-m ~/models/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
-ngl 99 \
-c 65536 \
-fa on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-np 1 \
--host 0.0.0.0 \
--port 8080GLM-4.7-Flash is a 30B MoE (3.6B active per token). At ~18GB Q4 it fits fully on your 16GB VRAM with a few layers offloaded to CPU RAM. Extremely fast for code tasks.
Recommended sampling for GLM-4.7 (from Z.AI):
- Temperature: 1.0
- Top-p: 0.95
- Min-p: 0.01
./build/bin/llama-server \
-m ~/models/Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf \
-ngl 20 \
-c 32768 \
-fa on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-np 1 \
--host 0.0.0.0 \
--port 8080./build/bin/llama-server \
-m ~/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
-ngl 25 \
-c 131072 \
-fa on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-np 1 \
--host 0.0.0.0 \
--port 8080./build/bin/llama-server \
-m ~/models/Qwen_Qwen3-Coder-Next-Q4_K_M.gguf \
-ngl 15 \
-c 65536 \
-fa on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-np 1 \
--host 0.0.0.0 \
--port 8080This is a tight fit at ~38-46GB. Start with lower -ngl and shorter context. Only 3B active per token so inference is fast despite the large download.
The server exposes an OpenAI-compatible API at http://localhost:8080/v1. Point anything at it:
| Tool | Config |
|---|---|
| OpenCode | OPENAI_BASE_URL=http://localhost:8080/v1 |
| Aider | aider --model openai/qwen3.6-27b --openai-api-base http://localhost:8080/v1 |
| Continue.dev | OpenAI-compatible provider → localhost:8080/v1, any API key |
| Cursor | OpenAI-compatible provider → localhost:8080/v1, any API key |
| Anything else | Use the OpenAI-compatible provider pointing at localhost:8080/v1 |
If you want a GUI for smaller models:
- Download from lmstudio.ai (Linux beta available)
- Settings → Backend → Vulkan
- Search & download models from the UI
- Chat directly in the app
Don't use LM Studio for the 27B Q8_0 or Coder-Next — you need the fine-grained -ngl and KV cache tuning that only llama-server gives you.
| What | Command |
|---|---|
| Check GPU | vulkaninfo --summary | grep deviceName |
| Check VRAM usage | watch -n 1 cat /sys/class/drm/card*/device/mem_info_vram_used |
| Kill server | Ctrl+C or kill $(lsof -t -i:8080) |
| API endpoint | http://localhost:8080/v1 |
| Test API | curl http://localhost:8080/v1/models |
| Rebuild llama.cpp | cd llama.cpp && git pull && cmake --build build --config Release -j$(nproc) |
| Decision | Reason |
|---|---|
| Ubuntu 24.04 | ROCm is Linux-only; RADV Vulkan drivers optimized on Linux; most AMD LLM docs target Ubuntu |
| Vulkan over ROCm | ROCm doesn't officially support consumer RDNA2 cards; Vulkan is faster for token generation on AMD anyway |
| RADV over AMDVLK | Consistently faster for LLM inference in benchmarks |
| llama.cpp over Ollama | Full control over -ngl, KV cache type, flash attention; Ollama forces thinking ON for Qwen 3.x |
| q4_0 KV cache | The single biggest unlock — fits 131K context in your RAM vs FP16 KV cache which doesn't fit at all |
| Q8_0 for 27B | Near-lossless quality. Since speed isn't the priority, push quality. 77.2% SWE-bench at full fidelity |
| GLM-4.7-Flash Q4 | 94.2% HumanEval, MoE (3.6B active), fits on GPU, blazing fast for code. Best coding specialist that fits |
| Q4_K_M / UD-Q4_K_XL | Best quality/size ratio for quantized models |
| Model | Size | SWE-bench V | HumanEval | FIM | Active/Token | Best For |
|---|---|---|---|---|---|---|
| Qwen 3.6 27B Q8_0 | ~28.6GB | 77.2% | — | No | 27B (dense) | Main agent, complex reasoning, plan execution |
| GLM-4.7-Flash Q4 | ~18GB | — | 94.2% | No | 3.6B (MoE) | Coding specialist, fast code generation |
| Qwen 2.5 Coder 32B Q5 | ~23GB | — | 92.7% | Yes | 32B (dense) | Code execution, autocomplete, FIM |
| Qwen 3.6 35B-A3B Q4 | ~22GB | 73.4% | — | No | 3B (MoE) | Fast chat, speed priority |
| Qwen3-Coder-Next Q4 | ~38-46GB | 70.6% | — | No | 3B (MoE) | Purpose-built coding agent, 256K context |
| Model | Why Not |
|---|---|
| GLM-5.1 (744B) | Needs 8x H100s. ~452GB at Q4. Datacenter only. |
| DeepSeek V4-Pro (1.6T) | Needs 16x H100s. ~432GB at Q4. Datacenter only. |
| DeepSeek V4-Flash (284B) | ~80GB at Q4. Doesn't fit in 64GB RAM. |
| GLM-4.7 full (355B) | ~216GB at Q4. Needs multi-GPU server. Use Flash instead. |