-
docker/model-runner:mtp(image7b6f81c6dc4b) has the MTP-patched llama.cpp baked in (FROMllama-rocm:full). Retag it as:latestbecausedocker model status/runauto-pull and clobber:latest:docker tag docker/model-runner:mtp docker/model-runner:latest -
Pulled GGUF into the runner volume (one-time, via the managed runner):
docker model pull hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-IQ2_XXSStored at
/models/bundles/sha256/60b929136fc442800ef3cc2b200e026419c6b30b704c2ae7bf4b4a31957dde72/model/model.ggufin thedocker-model-runner-modelsvolume. -
The managed
docker model runSIGSEGVs on this model. Crashes traced viadmesgto GPF inlibamdhip64.sowhen llama.cpp enumerates bothROCm0(GPU) andROCm1(CPU-as-ROCm). Also crashes during warmup withn_parallel=4. -
Bypass
docker modelCLI entirely — run/app/llama-serverdirectly from the patched image:docker rm -f docker-model-runner llama-mtp 2>/dev/null docker run -d --name llama-mtp \ --device /dev/dri --device /dev/kfd \ -e HIP_VISIBLE_DEVICES=0 -e ROCR_VISIBLE_DEVICES=0 \ -v docker-model-runner-models:/models \ -p 127.0.0.1:12434:12434 \ --entrypoint /app/llama-server \ docker/model-runner:mtp \ -m /models/bundles/sha256/60b929136fc442800ef3cc2b200e026419c6b30b704c2ae7bf4b4a31957dde72/model/model.gguf \ --host 0.0.0.0 --port 12434 \ -c 131072 \ -np 1 \ -ngl 999 \ --device ROCm0 \ -fa on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ --reasoning-budget 0 \ --no-mmproj
Key flags: HIP_VISIBLE_DEVICES=0 + --device ROCm0 (avoid CPU-as-ROCm GPF), --no-warmup, -np 1 (default 4 OOMs slot init), -ngl 999 (all layers on GPU), --jinja (enable tool-call template).
Can also add "chat_template_kwargs": {"enable_thinking": false}, at top level
request obj for fast mode.
curl -s http://127.0.0.1:12434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3.6-mtp",
"messages": [
{"role": "user", "content": "Whats the weather in Portland, OR right now?"}
],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"units": {"type": "string", "enum": ["celsius","fahrenheit"]}
},
"required": ["city"]
}
}
}],
"tool_choice": "auto",
"max_tokens": 256
}'
The model returns choices[0].message.tool_calls[*] with function.name and JSON function.arguments. reasoning_content holds the chain-of-thought when thinking is enabled (use /no_think in the user message or "chat_template_kwargs": {"enable_thinking": false} to suppress).
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"llama.cpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama-server (local)",
"options": {
"baseURL": "http://127.0.0.1:12434/v1"
},
"models": {
"qwen3.6-mtp": {
"name": "Qwen3.6-35B-A3B-MTP-GGUF:UD-Q2_K_XL",
"limit": {
"context": 131072,
"output": 65536
}
}
}
}
}
}
$ rocm-smi --showmeminfo vram -f -t -p -u | python -u ~/Downloads/parse_gpu_mem.py
WARNING: AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status
GPU VRAM Usage (GB)
===================================
GPU[0]: 13.96884 GB
GPU[1]: 0.01552 GB
0.26.816.397 I srv params_from_: Chat format: peg-native
0.26.816.528 I slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.971 (> 0.100 thold), f_keep = 0.660
0.26.816.554 I reasoning-budget: activated, budget=0 tokens
0.26.816.555 I reasoning-budget: budget=0, forcing immediately
0.26.816.555 I reasoning-budget: forced sequence complete, done
0.26.816.578 I slot launch_slot_: id 0 | task 8 | processing task, is_child = 0
0.26.816.586 W slot update_slots: id 0 | task 8 | n_past = 33, slot.prompt.tokens.size() = 50, seq_id = 0, pos_min = 49, n_swa = 0
0.26.816.586 W slot update_slots: id 0 | task 8 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
0.28.155.348 I slot print_timing: id 0 | task 8 | n_decoded = 100, tg = 83.36 t/s
0.31.175.069 I slot print_timing: id 0 | task 8 | n_decoded = 377, tg = 89.35 t/s
0.34.195.697 I slot print_timing: id 0 | task 8 | n_decoded = 668, tg = 92.27 t/s
0.37.196.084 I slot print_timing: id 0 | task 8 | n_decoded = 980, tg = 95.70 t/s
0.40.202.946 I slot print_timing: id 0 | task 8 | n_decoded = 1269, tg = 95.79 t/s
0.43.219.627 I slot print_timing: id 0 | task 8 | n_decoded = 1581, tg = 97.21 t/s
0.46.223.644 I slot print_timing: id 0 | task 8 | n_decoded = 1927, tg = 100.01 t/s
0.49.247.218 I slot print_timing: id 0 | task 8 | n_decoded = 2260, tg = 101.38 t/s
0.52.271.680 I slot print_timing: id 0 | task 8 | n_decoded = 2612, tg = 103.18 t/s
0.52.569.121 I slot print_timing: id 0 | task 8 |
prompt eval time = 139.10 ms / 34 tokens ( 4.09 ms per token, 244.43 tokens per second)
eval time = 25613.34 ms / 2644 tokens ( 9.69 ms per token, 103.23 tokens per second)
total time = 25752.43 ms / 2678 tokens
draft acceptance rate = 0.61607 ( 1717 accepted / 2787 generated)
0.52.569.131 I statistics draft-mtp: #calls(b,g,a) = 2 934 934, #gen drafts = 934, #acc drafts = 726, #gen tokens = 2802, #acc tokens = 1726, dur(b,g,a) = 0.002, 5476.322, 0.454 ms
0.52.569.142 I slot release: id 0 | task 8 | stop processing: n_tokens = 2680, truncated = 0
0.52.569.152 I srv update_slots: all slots are idle
prompt eval time = 193.53 ms / 156 tokens ( 1.24 ms per token, 806.06 tokens per second)
eval time = 6483.50 ms / 600 tokens ( 10.81 ms per token, 92.54 tokens per second)
total time = 6677.03 ms / 756 tokens
draft acceptance rate = 0.48968 ( 356 accepted / 727 generated)
10.04.504.857 I statistics draft-mtp: #calls(b,g,a) = 29 2268 2268, #gen drafts = 2268, #acc drafts = 1729, #gen tokens = 6798, #acc tokens = 4088, dur(b,g,a) = 0.025, 13521.682, 0.750 ms
10.04.504.879 I slot release: id 0 | task 2106 | stop processing: n_tokens = 755, truncated = 0
10.04.504.888 I srv update_slots: all slots are idle
k11.08.751.030 I srv params_from_: Chat format: peg-native
11.08.751.203 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 624008135
11.08.751.205 I srv get_availabl: updating prompt cache
11.08.751.375 W srv prompt_save: - saving prompt with length 755, total state size = 71.460 MiB (draft: 0.798 MiB)
11.08.800.661 I srv load: - looking for better prompt, base f_keep = 0.004, sim = 0.009
11.08.800.668 I srv update: - cache state: 8 prompts, 1275.261 MiB (limits: 8192.000 MiB, 131072 tokens, 131072 est)
11.08.800.669 I srv update: - prompt 0x55cd6c40fc30: 609 tokens, checkpoints: 1, 133.021 MiB
11.08.800.669 I srv update: - prompt 0x55cd6bf4fa70: 674 tokens, checkpoints: 1, 133.831 MiB
11.08.800.669 I srv update: - prompt 0x55cd6daed2d0: 824 tokens, checkpoints: 2, 198.645 MiB
11.08.800.670 I srv update: - prompt 0x55cd6c2ac720: 774 tokens, checkpoints: 2, 197.931 MiB
11.08.800.670 I srv update: - prompt 0x55cd6c0d06e0: 999 tokens, checkpoints: 1, 137.485 MiB
11.08.800.670 I srv update: - prompt 0x55cd6c26dea0: 838 tokens, checkpoints: 1, 135.472 MiB
11.08.800.671 I srv update: - prompt 0x55cd6e8d4b60: 1299 tokens, checkpoints: 2, 204.441 MiB
11.08.800.671 I srv update: - prompt 0x55cd6c4329c0: 755 tokens, checkpoints: 1, 134.433 MiB
11.08.800.671 I srv get_availabl: prompt cache update took 49.47 ms
11.08.800.786 I reasoning-budget: activated, budget=0 tokens
11.08.800.787 I reasoning-budget: budget=0, forcing immediately
11.08.800.787 I reasoning-budget: forced sequence complete, done
11.08.800.832 I slot launch_slot_: id 0 | task 2352 | processing task, is_child = 0
11.08.800.841 W slot update_slots: id 0 | task 2352 | n_past = 3, slot.prompt.tokens.size() = 755, seq_id = 0, pos_min = 754, n_swa = 0
11.08.800.842 I slot update_slots: id 0 | task 2352 | Checking checkpoint with [151, 151] against 3...
11.08.800.842 W slot update_slots: id 0 | task 2352 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
11.08.800.844 W slot update_slots: id 0 | task 2352 | erased invalidated context checkpoint (pos_min = 151, pos_max = 151, n_tokens = 152, n_swa = 0, pos_next = 0, size = 62.974 MiB)
11.09.084.647 I slot create_check: id 0 | task 2352 | created context checkpoint 1 of 32 (pos_min = 318, pos_max = 318, n_tokens = 319, size = 63.150 MiB)
11.10.084.873 I slot print_timing: id 0 | task 2352 | n_decoded = 103, tg = 106.18 t/s
11.10.409.721 I slot print_timing: id 0 | task 2352 |
prompt eval time = 313.90 ms / 323 tokens ( 0.97 ms per token, 1028.97 tokens per second)
eval time = 1294.87 ms / 149 tokens ( 8.69 ms per token, 115.07 tokens per second)
total time = 1608.78 ms / 472 tokens
draft acceptance rate = 0.75362 ( 104 accepted / 138 generated)
11.10.409.731 I statistics draft-mtp: #calls(b,g,a) = 30 2314 2314, #gen drafts = 2314, #acc drafts = 1770, #gen tokens = 6936, #acc tokens = 4192, dur(b,g,a) = 0.026, 13793.810, 0.761 ms
11.10.409.762 I slot release: id 0 | task 2352 | stop processing: n_tokens = 473, truncated = 0
11.10.409.776 I srv update_slots: all slots are idle
11.10.424.475 I srv params_from_: Chat format: peg-native
11.10.424.685 I slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.920 (> 0.100 thold), f_keep = 1.000
11.10.424.776 I reasoning-budget: activated, budget=0 tokens
11.10.424.777 I reasoning-budget: budget=0, forcing immediately
11.10.424.777 I reasoning-budget: forced sequence complete, done
11.10.424.808 I slot launch_slot_: id 0 | task 2401 | processing task, is_child = 0
11.10.448.353 I slot create_check: id 0 | task 2401 | created context checkpoint 2 of 32 (pos_min = 472, pos_max = 472, n_tokens = 473, size = 63.313 MiB)
11.13.105.549 I slot print_timing: id 0 | task 2401 | n_decoded = 219, tg = 84.99 t/s
11.13.254.472 I slot print_timing: id 0 | task 2401 |
prompt eval time = 103.95 ms / 41 tokens ( 2.54 ms per token, 394.42 tokens per second)
eval time = 2725.57 ms / 237 tokens ( 11.50 ms per token, 86.95 tokens per second)
total time = 2829.52 ms / 278 tokens
draft acceptance rate = 0.43689 ( 135 accepted / 309 generated)
11.13.254.490 I statistics draft-mtp: #calls(b,g,a) = 31 2417 2417, #gen drafts = 2417, #acc drafts = 1835, #gen tokens = 7245, #acc tokens = 4327, dur(b,g,a) = 0.027, 14386.349, 0.790 ms
11.13.254.536 I slot release: id 0 | task 2401 | stop processing: n_tokens = 752, truncated = 0
11.13.254.548 I srv update_slots: all slots are idle
11.13.268.009 I srv params_from_: Chat format: peg-native
11.13.268.207 I slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.948 (> 0.100 thold), f_keep = 1.000
11.13.268.290 I reasoning-budget: activated, budget=0 tokens
11.13.268.291 I reasoning-budget: budget=0, forcing immediately
11.13.268.291 I reasoning-budget: forced sequence complete, done
11.13.268.324 I slot launch_slot_: id 0 | task 2507 | processing task, is_child = 0
11.13.291.376 I slot create_check: id 0 | task 2507 | created context checkpoint 3 of 32 (pos_min = 751, pos_max = 751, n_tokens = 752, size = 63.608 MiB)
11.16.120.773 I slot print_timing: id 0 | task 2507 | n_decoded = 260, tg = 94.93 t/s
11.19.124.921 I slot print_timing: id 0 | task 2507 | n_decoded = 591, tg = 102.91 t/s
11.22.135.395 I slot print_timing: id 0 | task 2507 | n_decoded = 889, tg = 101.56 t/s
11.22.298.426 I slot print_timing: id 0 | task 2507 |
prompt eval time = 113.42 ms / 41 tokens ( 2.77 ms per token, 361.48 tokens per second)
eval time = 8916.57 ms / 908 tokens ( 9.82 ms per token, 101.83 tokens per second)
total time = 9029.99 ms / 949 tokens
draft acceptance rate = 0.58032 ( 578 accepted / 996 generated)
11.22.298.436 I statistics draft-mtp: #calls(b,g,a) = 32 2749 2749, #gen drafts = 2749, #acc drafts = 2095, #gen tokens = 8241, #acc tokens = 4905, dur(b,g,a) = 0.028, 16270.681, 0.885 ms
11.22.298.477 I slot release: id 0 | task 2507 | stop processing: n_tokens = 1703, truncated = 0
11.22.298.487 I srv update_slots: all slots are idle
11.22.312.676 I srv params_from_: Chat format: peg-native
11.22.312.883 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 701801734
11.22.312.885 I srv get_availabl: updating prompt cache
11.22.313.081 W srv prompt_save: - saving prompt with length 1703, total state size = 82.316 MiB (draft: 1.800 MiB)
11.22.408.043 I srv load: - looking for better prompt, base f_keep = 0.002, sim = 0.007
11.22.408.052 I srv update: - cache state: 9 prompts, 1547.648 MiB (limits: 8192.000 MiB, 131072 tokens, 131072 est)
11.22.408.053 I srv update: - prompt 0x55cd6c40fc30: 609 tokens, checkpoints: 1, 133.021 MiB
11.22.408.053 I srv update: - prompt 0x55cd6bf4fa70: 674 tokens, checkpoints: 1, 133.831 MiB
11.22.408.054 I srv update: - prompt 0x55cd6daed2d0: 824 tokens, checkpoints: 2, 198.645 MiB
11.22.408.054 I srv update: - prompt 0x55cd6c2ac720: 774 tokens, checkpoints: 2, 197.931 MiB
11.22.408.054 I srv update: - prompt 0x55cd6c0d06e0: 999 tokens, checkpoints: 1, 137.485 MiB
11.22.408.054 I srv update: - prompt 0x55cd6c26dea0: 838 tokens, checkpoints: 1, 135.472 MiB
11.22.408.055 I srv update: - prompt 0x55cd6e8d4b60: 1299 tokens, checkpoints: 2, 204.441 MiB
11.22.408.055 I srv update: - prompt 0x55cd6c4329c0: 755 tokens, checkpoints: 1, 134.433 MiB
11.22.408.055 I srv update: - prompt 0x55cd68cc55f0: 1703 tokens, checkpoints: 3, 272.387 MiB
11.22.408.056 I srv get_availabl: prompt cache update took 95.17 ms
11.22.408.280 I reasoning-budget: activated, budget=0 tokens
11.22.408.281 I reasoning-budget: budget=0, forcing immediately
11.22.408.281 I reasoning-budget: forced sequence complete, done
11.22.408.312 I slot launch_slot_: id 0 | task 2842 | processing task, is_child = 0
11.22.408.320 W slot update_slots: id 0 | task 2842 | n_past = 3, slot.prompt.tokens.size() = 1703, seq_id = 0, pos_min = 1702, n_swa = 0
11.22.408.321 I slot update_slots: id 0 | task 2842 | Checking checkpoint with [751, 751] against 3...
11.22.408.321 I slot update_slots: id 0 | task 2842 | Checking checkpoint with [472, 472] against 3...
11.22.408.321 I slot update_slots: id 0 | task 2842 | Checking checkpoint with [318, 318] against 3...
11.22.408.321 W slot update_slots: id 0 | task 2842 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
11.22.408.322 W slot update_slots: id 0 | task 2842 | erased invalidated context checkpoint (pos_min = 318, pos_max = 318, n_tokens = 319, n_swa = 0, pos_next = 0, size = 63.150 MiB)
11.22.412.336 W slot update_slots: id 0 | task 2842 | erased invalidated context checkpoint (pos_min = 472, pos_max = 472, n_tokens = 473, n_swa = 0, pos_next = 0, size = 63.313 MiB)
11.22.417.256 W slot update_slots: id 0 | task 2842 | erased invalidated context checkpoint (pos_min = 751, pos_max = 751, n_tokens = 752, n_swa = 0, pos_next = 0, size = 63.608 MiB)
11.22.735.713 I slot create_check: id 0 | task 2842 | created context checkpoint 1 of 32 (pos_min = 452, pos_max = 452, n_tokens = 453, size = 63.292 MiB)
11.25.147.821 I slot print_timing: id 0 | task 2842 | n_decoded = 231, tg = 96.93 t/s
11.28.174.211 I slot print_timing: id 0 | task 2842 | n_decoded = 524, tg = 96.87 t/s
11.31.192.125 I slot print_timing: id 0 | task 2842 | n_decoded = 893, tg = 105.96 t/s
11.34.207.969 I slot print_timing: id 0 | task 2842 | n_decoded = 1260, tg = 110.11 t/s
11.37.215.102 I slot print_timing: id 0 | task 2842 | n_decoded = 1615, tg = 111.76 t/s
11.40.234.039 I slot print_timing: id 0 | task 2842 | n_decoded = 1967, tg = 112.60 t/s
11.40.657.281 I slot print_timing: id 0 | task 2842 |
prompt eval time = 356.31 ms / 457 tokens ( 0.78 ms per token, 1282.58 tokens per second)
eval time = 17892.53 ms / 2025 tokens ( 8.84 ms per token, 113.18 tokens per second)
total time = 18248.84 ms / 2482 tokens
draft acceptance rate = 0.69250 ( 1367 accepted / 1974 generated)
11.40.657.295 I statistics draft-mtp: #calls(b,g,a) = 33 3407 3407, #gen drafts = 3407, #acc drafts = 2634, #gen tokens = 10215, #acc tokens = 6272, dur(b,g,a) = 0.029, 19972.023, 1.077 ms
11.40.657.341 I slot release: id 0 | task 2842 | stop processing: n_tokens = 2482, truncated = 0
11.40.657.357 I srv update_slots: all slots are idle