Skip to content

Instantly share code, notes, and snippets.

@johnandersen777
Last active May 16, 2026 20:16
Show Gist options
  • Select an option

  • Save johnandersen777/76d6773f79500f036f989ae9caaa85f0 to your computer and use it in GitHub Desktop.

Select an option

Save johnandersen777/76d6773f79500f036f989ae9caaa85f0 to your computer and use it in GitHub Desktop.
RX 9070 XT hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q2_K_XL https://asciinema.org/a/1067473

Qwen3.6-35B-A3B-MTP local on RX 9070 XT

ggml-org/llama.cpp#22673

What worked

  1. docker/model-runner:mtp (image 7b6f81c6dc4b) has the MTP-patched llama.cpp baked in (FROM llama-rocm:full). Retag it as :latest because docker model status/run auto-pull and clobber :latest:

    docker tag docker/model-runner:mtp docker/model-runner:latest
    
  2. Pulled GGUF into the runner volume (one-time, via the managed runner):

    docker model pull hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-IQ2_XXS
    

    Stored at /models/bundles/sha256/60b929136fc442800ef3cc2b200e026419c6b30b704c2ae7bf4b4a31957dde72/model/model.gguf in the docker-model-runner-models volume.

  3. The managed docker model run SIGSEGVs on this model. Crashes traced via dmesg to GPF in libamdhip64.so when llama.cpp enumerates both ROCm0 (GPU) and ROCm1 (CPU-as-ROCm). Also crashes during warmup with n_parallel=4.

  4. Bypass docker model CLI entirely — run /app/llama-server directly from the patched image:

    docker rm -f docker-model-runner llama-mtp 2>/dev/null
    docker run -d --name llama-mtp \
      --device /dev/dri --device /dev/kfd \
      -e HIP_VISIBLE_DEVICES=0 -e ROCR_VISIBLE_DEVICES=0 \
      -v docker-model-runner-models:/models \
      -p 127.0.0.1:12434:12434 \
      --entrypoint /app/llama-server \
      docker/model-runner:mtp \
      -m /models/bundles/sha256/60b929136fc442800ef3cc2b200e026419c6b30b704c2ae7bf4b4a31957dde72/model/model.gguf \
      --host 0.0.0.0 --port 12434 \
      -c 131072 \
      -np 1 \
      -ngl 999 \
      --device ROCm0 \
      -fa on \
      --cache-type-k q8_0 \
      --cache-type-v q8_0 \
      --spec-type draft-mtp \
      --spec-draft-n-max 3 \
      --reasoning-budget 0 \
      --no-mmproj
    

Key flags: HIP_VISIBLE_DEVICES=0 + --device ROCm0 (avoid CPU-as-ROCm GPF), --no-warmup, -np 1 (default 4 OOMs slot init), -ngl 999 (all layers on GPU), --jinja (enable tool-call template).

Example: chat completion with tool call

Can also add "chat_template_kwargs": {"enable_thinking": false}, at top level request obj for fast mode.

curl -s http://127.0.0.1:12434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.6-mtp",
    "messages": [
      {"role": "user", "content": "Whats the weather in Portland, OR right now?"}
    ],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
          "type": "object",
          "properties": {
            "city":  {"type": "string", "description": "City name"},
            "units": {"type": "string", "enum": ["celsius","fahrenheit"]}
          },
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto",
    "max_tokens": 256
  }'

The model returns choices[0].message.tool_calls[*] with function.name and JSON function.arguments. reasoning_content holds the chain-of-thought when thinking is enabled (use /no_think in the user message or "chat_template_kwargs": {"enable_thinking": false} to suppress).

Opencode

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:12434/v1"
      },
      "models": {
        "qwen3.6-mtp": {
          "name": "Qwen3.6-35B-A3B-MTP-GGUF:UD-Q2_K_XL",
          "limit": {
            "context": 131072,
            "output": 65536
          }
        }
      }
    }
  }
}

Stats and Logs

$ rocm-smi --showmeminfo vram -f -t -p -u | python -u ~/Downloads/parse_gpu_mem.py
WARNING: AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status

GPU VRAM Usage (GB)
===================================
GPU[0]: 13.96884 GB
GPU[1]: 0.01552 GB
0.26.816.397 I srv  params_from_: Chat format: peg-native
0.26.816.528 I slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.971 (> 0.100 thold), f_keep = 0.660
0.26.816.554 I reasoning-budget: activated, budget=0 tokens
0.26.816.555 I reasoning-budget: budget=0, forcing immediately
0.26.816.555 I reasoning-budget: forced sequence complete, done
0.26.816.578 I slot launch_slot_: id  0 | task 8 | processing task, is_child = 0
0.26.816.586 W slot update_slots: id  0 | task 8 | n_past = 33, slot.prompt.tokens.size() = 50, seq_id = 0, pos_min = 49, n_swa = 0
0.26.816.586 W slot update_slots: id  0 | task 8 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
0.28.155.348 I slot print_timing: id  0 | task 8 | n_decoded =    100, tg =  83.36 t/s
0.31.175.069 I slot print_timing: id  0 | task 8 | n_decoded =    377, tg =  89.35 t/s
0.34.195.697 I slot print_timing: id  0 | task 8 | n_decoded =    668, tg =  92.27 t/s
0.37.196.084 I slot print_timing: id  0 | task 8 | n_decoded =    980, tg =  95.70 t/s
0.40.202.946 I slot print_timing: id  0 | task 8 | n_decoded =   1269, tg =  95.79 t/s
0.43.219.627 I slot print_timing: id  0 | task 8 | n_decoded =   1581, tg =  97.21 t/s
0.46.223.644 I slot print_timing: id  0 | task 8 | n_decoded =   1927, tg = 100.01 t/s
0.49.247.218 I slot print_timing: id  0 | task 8 | n_decoded =   2260, tg = 101.38 t/s
0.52.271.680 I slot print_timing: id  0 | task 8 | n_decoded =   2612, tg = 103.18 t/s
0.52.569.121 I slot print_timing: id  0 | task 8 |
prompt eval time =     139.10 ms /    34 tokens (    4.09 ms per token,   244.43 tokens per second)
       eval time =   25613.34 ms /  2644 tokens (    9.69 ms per token,   103.23 tokens per second)
      total time =   25752.43 ms /  2678 tokens
draft acceptance rate = 0.61607 ( 1717 accepted /  2787 generated)
0.52.569.131 I statistics draft-mtp: #calls(b,g,a) = 2 934 934, #gen drafts = 934, #acc drafts = 726, #gen tokens = 2802, #acc tokens = 1726, dur(b,g,a) = 0.002, 5476.322, 0.454 ms
0.52.569.142 I slot      release: id  0 | task 8 | stop processing: n_tokens = 2680, truncated = 0
0.52.569.152 I srv  update_slots: all slots are idle
prompt eval time =     193.53 ms /   156 tokens (    1.24 ms per token,   806.06 tokens per second)
       eval time =    6483.50 ms /   600 tokens (   10.81 ms per token,    92.54 tokens per second)
      total time =    6677.03 ms /   756 tokens
draft acceptance rate = 0.48968 (  356 accepted /   727 generated)
10.04.504.857 I statistics draft-mtp: #calls(b,g,a) = 29 2268 2268, #gen drafts = 2268, #acc drafts = 1729, #gen tokens = 6798, #acc tokens = 4088, dur(b,g,a) = 0.025, 13521.682, 0.750 ms
10.04.504.879 I slot      release: id  0 | task 2106 | stop processing: n_tokens = 755, truncated = 0
10.04.504.888 I srv  update_slots: all slots are idle
k11.08.751.030 I srv  params_from_: Chat format: peg-native
11.08.751.203 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 624008135
11.08.751.205 I srv  get_availabl: updating prompt cache
11.08.751.375 W srv   prompt_save:  - saving prompt with length 755, total state size = 71.460 MiB (draft: 0.798 MiB)
11.08.800.661 I srv          load:  - looking for better prompt, base f_keep = 0.004, sim = 0.009
11.08.800.668 I srv        update:  - cache state: 8 prompts, 1275.261 MiB (limits: 8192.000 MiB, 131072 tokens, 131072 est)
11.08.800.669 I srv        update:    - prompt 0x55cd6c40fc30:     609 tokens, checkpoints:  1,   133.021 MiB
11.08.800.669 I srv        update:    - prompt 0x55cd6bf4fa70:     674 tokens, checkpoints:  1,   133.831 MiB
11.08.800.669 I srv        update:    - prompt 0x55cd6daed2d0:     824 tokens, checkpoints:  2,   198.645 MiB
11.08.800.670 I srv        update:    - prompt 0x55cd6c2ac720:     774 tokens, checkpoints:  2,   197.931 MiB
11.08.800.670 I srv        update:    - prompt 0x55cd6c0d06e0:     999 tokens, checkpoints:  1,   137.485 MiB
11.08.800.670 I srv        update:    - prompt 0x55cd6c26dea0:     838 tokens, checkpoints:  1,   135.472 MiB
11.08.800.671 I srv        update:    - prompt 0x55cd6e8d4b60:    1299 tokens, checkpoints:  2,   204.441 MiB
11.08.800.671 I srv        update:    - prompt 0x55cd6c4329c0:     755 tokens, checkpoints:  1,   134.433 MiB
11.08.800.671 I srv  get_availabl: prompt cache update took 49.47 ms
11.08.800.786 I reasoning-budget: activated, budget=0 tokens
11.08.800.787 I reasoning-budget: budget=0, forcing immediately
11.08.800.787 I reasoning-budget: forced sequence complete, done
11.08.800.832 I slot launch_slot_: id  0 | task 2352 | processing task, is_child = 0
11.08.800.841 W slot update_slots: id  0 | task 2352 | n_past = 3, slot.prompt.tokens.size() = 755, seq_id = 0, pos_min = 754, n_swa = 0
11.08.800.842 I slot update_slots: id  0 | task 2352 | Checking checkpoint with [151, 151] against 3...
11.08.800.842 W slot update_slots: id  0 | task 2352 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
11.08.800.844 W slot update_slots: id  0 | task 2352 | erased invalidated context checkpoint (pos_min = 151, pos_max = 151, n_tokens = 152, n_swa = 0, pos_next = 0, size = 62.974 MiB)
11.09.084.647 I slot create_check: id  0 | task 2352 | created context checkpoint 1 of 32 (pos_min = 318, pos_max = 318, n_tokens = 319, size = 63.150 MiB)
11.10.084.873 I slot print_timing: id  0 | task 2352 | n_decoded =    103, tg = 106.18 t/s
11.10.409.721 I slot print_timing: id  0 | task 2352 | 
prompt eval time =     313.90 ms /   323 tokens (    0.97 ms per token,  1028.97 tokens per second)
       eval time =    1294.87 ms /   149 tokens (    8.69 ms per token,   115.07 tokens per second)
      total time =    1608.78 ms /   472 tokens
draft acceptance rate = 0.75362 (  104 accepted /   138 generated)
11.10.409.731 I statistics draft-mtp: #calls(b,g,a) = 30 2314 2314, #gen drafts = 2314, #acc drafts = 1770, #gen tokens = 6936, #acc tokens = 4192, dur(b,g,a) = 0.026, 13793.810, 0.761 ms
11.10.409.762 I slot      release: id  0 | task 2352 | stop processing: n_tokens = 473, truncated = 0
11.10.409.776 I srv  update_slots: all slots are idle
11.10.424.475 I srv  params_from_: Chat format: peg-native
11.10.424.685 I slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.920 (> 0.100 thold), f_keep = 1.000
11.10.424.776 I reasoning-budget: activated, budget=0 tokens
11.10.424.777 I reasoning-budget: budget=0, forcing immediately
11.10.424.777 I reasoning-budget: forced sequence complete, done
11.10.424.808 I slot launch_slot_: id  0 | task 2401 | processing task, is_child = 0
11.10.448.353 I slot create_check: id  0 | task 2401 | created context checkpoint 2 of 32 (pos_min = 472, pos_max = 472, n_tokens = 473, size = 63.313 MiB)
11.13.105.549 I slot print_timing: id  0 | task 2401 | n_decoded =    219, tg =  84.99 t/s
11.13.254.472 I slot print_timing: id  0 | task 2401 | 
prompt eval time =     103.95 ms /    41 tokens (    2.54 ms per token,   394.42 tokens per second)
       eval time =    2725.57 ms /   237 tokens (   11.50 ms per token,    86.95 tokens per second)
      total time =    2829.52 ms /   278 tokens
draft acceptance rate = 0.43689 (  135 accepted /   309 generated)
11.13.254.490 I statistics draft-mtp: #calls(b,g,a) = 31 2417 2417, #gen drafts = 2417, #acc drafts = 1835, #gen tokens = 7245, #acc tokens = 4327, dur(b,g,a) = 0.027, 14386.349, 0.790 ms
11.13.254.536 I slot      release: id  0 | task 2401 | stop processing: n_tokens = 752, truncated = 0
11.13.254.548 I srv  update_slots: all slots are idle
11.13.268.009 I srv  params_from_: Chat format: peg-native
11.13.268.207 I slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.948 (> 0.100 thold), f_keep = 1.000
11.13.268.290 I reasoning-budget: activated, budget=0 tokens
11.13.268.291 I reasoning-budget: budget=0, forcing immediately
11.13.268.291 I reasoning-budget: forced sequence complete, done
11.13.268.324 I slot launch_slot_: id  0 | task 2507 | processing task, is_child = 0
11.13.291.376 I slot create_check: id  0 | task 2507 | created context checkpoint 3 of 32 (pos_min = 751, pos_max = 751, n_tokens = 752, size = 63.608 MiB)
11.16.120.773 I slot print_timing: id  0 | task 2507 | n_decoded =    260, tg =  94.93 t/s
11.19.124.921 I slot print_timing: id  0 | task 2507 | n_decoded =    591, tg = 102.91 t/s
11.22.135.395 I slot print_timing: id  0 | task 2507 | n_decoded =    889, tg = 101.56 t/s
11.22.298.426 I slot print_timing: id  0 | task 2507 | 
prompt eval time =     113.42 ms /    41 tokens (    2.77 ms per token,   361.48 tokens per second)
       eval time =    8916.57 ms /   908 tokens (    9.82 ms per token,   101.83 tokens per second)
      total time =    9029.99 ms /   949 tokens
draft acceptance rate = 0.58032 (  578 accepted /   996 generated)
11.22.298.436 I statistics draft-mtp: #calls(b,g,a) = 32 2749 2749, #gen drafts = 2749, #acc drafts = 2095, #gen tokens = 8241, #acc tokens = 4905, dur(b,g,a) = 0.028, 16270.681, 0.885 ms
11.22.298.477 I slot      release: id  0 | task 2507 | stop processing: n_tokens = 1703, truncated = 0
11.22.298.487 I srv  update_slots: all slots are idle
11.22.312.676 I srv  params_from_: Chat format: peg-native
11.22.312.883 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 701801734
11.22.312.885 I srv  get_availabl: updating prompt cache
11.22.313.081 W srv   prompt_save:  - saving prompt with length 1703, total state size = 82.316 MiB (draft: 1.800 MiB)
11.22.408.043 I srv          load:  - looking for better prompt, base f_keep = 0.002, sim = 0.007
11.22.408.052 I srv        update:  - cache state: 9 prompts, 1547.648 MiB (limits: 8192.000 MiB, 131072 tokens, 131072 est)
11.22.408.053 I srv        update:    - prompt 0x55cd6c40fc30:     609 tokens, checkpoints:  1,   133.021 MiB
11.22.408.053 I srv        update:    - prompt 0x55cd6bf4fa70:     674 tokens, checkpoints:  1,   133.831 MiB
11.22.408.054 I srv        update:    - prompt 0x55cd6daed2d0:     824 tokens, checkpoints:  2,   198.645 MiB
11.22.408.054 I srv        update:    - prompt 0x55cd6c2ac720:     774 tokens, checkpoints:  2,   197.931 MiB
11.22.408.054 I srv        update:    - prompt 0x55cd6c0d06e0:     999 tokens, checkpoints:  1,   137.485 MiB
11.22.408.054 I srv        update:    - prompt 0x55cd6c26dea0:     838 tokens, checkpoints:  1,   135.472 MiB
11.22.408.055 I srv        update:    - prompt 0x55cd6e8d4b60:    1299 tokens, checkpoints:  2,   204.441 MiB
11.22.408.055 I srv        update:    - prompt 0x55cd6c4329c0:     755 tokens, checkpoints:  1,   134.433 MiB
11.22.408.055 I srv        update:    - prompt 0x55cd68cc55f0:    1703 tokens, checkpoints:  3,   272.387 MiB
11.22.408.056 I srv  get_availabl: prompt cache update took 95.17 ms
11.22.408.280 I reasoning-budget: activated, budget=0 tokens
11.22.408.281 I reasoning-budget: budget=0, forcing immediately
11.22.408.281 I reasoning-budget: forced sequence complete, done
11.22.408.312 I slot launch_slot_: id  0 | task 2842 | processing task, is_child = 0
11.22.408.320 W slot update_slots: id  0 | task 2842 | n_past = 3, slot.prompt.tokens.size() = 1703, seq_id = 0, pos_min = 1702, n_swa = 0
11.22.408.321 I slot update_slots: id  0 | task 2842 | Checking checkpoint with [751, 751] against 3...
11.22.408.321 I slot update_slots: id  0 | task 2842 | Checking checkpoint with [472, 472] against 3...
11.22.408.321 I slot update_slots: id  0 | task 2842 | Checking checkpoint with [318, 318] against 3...
11.22.408.321 W slot update_slots: id  0 | task 2842 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
11.22.408.322 W slot update_slots: id  0 | task 2842 | erased invalidated context checkpoint (pos_min = 318, pos_max = 318, n_tokens = 319, n_swa = 0, pos_next = 0, size = 63.150 MiB)
11.22.412.336 W slot update_slots: id  0 | task 2842 | erased invalidated context checkpoint (pos_min = 472, pos_max = 472, n_tokens = 473, n_swa = 0, pos_next = 0, size = 63.313 MiB)
11.22.417.256 W slot update_slots: id  0 | task 2842 | erased invalidated context checkpoint (pos_min = 751, pos_max = 751, n_tokens = 752, n_swa = 0, pos_next = 0, size = 63.608 MiB)
11.22.735.713 I slot create_check: id  0 | task 2842 | created context checkpoint 1 of 32 (pos_min = 452, pos_max = 452, n_tokens = 453, size = 63.292 MiB)
11.25.147.821 I slot print_timing: id  0 | task 2842 | n_decoded =    231, tg =  96.93 t/s
11.28.174.211 I slot print_timing: id  0 | task 2842 | n_decoded =    524, tg =  96.87 t/s
11.31.192.125 I slot print_timing: id  0 | task 2842 | n_decoded =    893, tg = 105.96 t/s
11.34.207.969 I slot print_timing: id  0 | task 2842 | n_decoded =   1260, tg = 110.11 t/s
11.37.215.102 I slot print_timing: id  0 | task 2842 | n_decoded =   1615, tg = 111.76 t/s
11.40.234.039 I slot print_timing: id  0 | task 2842 | n_decoded =   1967, tg = 112.60 t/s
11.40.657.281 I slot print_timing: id  0 | task 2842 | 
prompt eval time =     356.31 ms /   457 tokens (    0.78 ms per token,  1282.58 tokens per second)
       eval time =   17892.53 ms /  2025 tokens (    8.84 ms per token,   113.18 tokens per second)
      total time =   18248.84 ms /  2482 tokens
draft acceptance rate = 0.69250 ( 1367 accepted /  1974 generated)
11.40.657.295 I statistics draft-mtp: #calls(b,g,a) = 33 3407 3407, #gen drafts = 3407, #acc drafts = 2634, #gen tokens = 10215, #acc tokens = 6272, dur(b,g,a) = 0.029, 19972.023, 1.077 ms
11.40.657.341 I slot      release: id  0 | task 2842 | stop processing: n_tokens = 2482, truncated = 0
11.40.657.357 I srv  update_slots: all slots are idle
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "openai>=1.54",
# ]
# ///
"""Async chat with tool calls against local Qwen3.6-MTP via llama-server.
Run: ./qwen_example.py
"""
import asyncio
import json
import random
from openai import AsyncOpenAI
BASE_URL = "http://127.0.0.1:12434/v1"
MODEL = "qwen3.6-mtp"
TOOLS = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"default": "fahrenheit",
},
},
"required": ["city"],
},
},
}
]
def fake_weather(city: str, units: str = "fahrenheit") -> dict:
temp = random.randint(45, 85) if units == "fahrenheit" else random.randint(7, 30)
return {"city": city, "temp": temp, "units": units, "conditions": "partly cloudy"}
TOOL_IMPL = {"get_weather": lambda **kw: fake_weather(**kw)}
async def run(prompt: str) -> str:
client = AsyncOpenAI(base_url=BASE_URL, api_key="not-needed")
messages = [
{"role": "system", "content": "You are concise. Use tools when needed. /no_think"},
{"role": "user", "content": prompt},
]
for _ in range(4):
resp = await client.chat.completions.create(
model=MODEL,
messages=messages,
tools=TOOLS,
tool_choice="auto",
max_tokens=512,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
msg = resp.choices[0].message
messages.append(msg.model_dump(exclude_none=True))
if not msg.tool_calls:
return msg.content or ""
for call in msg.tool_calls:
name = call.function.name
args = json.loads(call.function.arguments or "{}")
print(f"→ tool {name}({args})")
result = TOOL_IMPL[name](**args)
print(f"← {result}")
messages.append(
{
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
}
)
return "(max tool-call iterations exceeded)"
async def main() -> None:
answer = await run("What's the weather in Portland, OR right now? Use fahrenheit.")
print("\n=== assistant ===")
print(answer)
if __name__ == "__main__":
asyncio.run(main())
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "snoop",
# "openai>=1.54",
# "pydantic>=2.7",
# ]
# ///
"""Qwen3.6-MTP: tool calls + Pydantic structured output in one async flow.
Flow:
1. Model calls get_weather tool (function calling).
2. We return tool result.
3. Final turn forces JSON schema response matching TripPlan pydantic model.
Run: ./qwen_json.py
"""
import sys
import asyncio
import json
import random
import snoop
from openai import AsyncOpenAI
from pydantic import BaseModel, Field
BASE_URL = "http://127.0.0.1:12434/v1"
MODEL = "qwen3.6-mtp"
NO_THINK = {"chat_template_kwargs": {"enable_thinking": False}}
class Activity(BaseModel):
name: str
indoor: bool
duration_hours: float = Field(ge=0.5, le=12)
class TripPlan(BaseModel):
city: str
temp_f: int
conditions: str
summary: str
activities: list[Activity] = Field(min_length=2, max_length=4)
TOOLS = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city"],
},
},
}
]
def fake_weather(city: str, units: str = "fahrenheit") -> dict:
temp = random.randint(45, 85) if units == "fahrenheit" else random.randint(7, 30)
return {"city": city, "temp": temp, "units": units, "conditions": "light rain"}
TOOL_IMPL = {"get_weather": lambda **kw: fake_weather(**kw)}
async def plan_trip(city: str) -> TripPlan:
client = AsyncOpenAI(base_url=BASE_URL, api_key="not-needed")
messages = [
{
"role": "system",
"content": (
"You plan day trips. First call get_weather, then produce a "
"TripPlan JSON object matching the schema exactly."
),
},
{"role": "user", "content": f"Plan a day in {city}. Use fahrenheit."},
]
for _ in range(4):
resp = await client.chat.completions.create(
model=MODEL,
messages=messages,
tools=TOOLS,
tool_choice="auto",
# max_tokens=512,
# extra_body=NO_THINK,
)
msg = resp.choices[0].message
messages.append(msg.model_dump(exclude_none=True))
if not msg.tool_calls:
break
for call in msg.tool_calls:
name = call.function.name
args = json.loads(call.function.arguments or "{}")
print(f"→ tool {name}({args})", file=sys.stderr)
result = TOOL_IMPL[name](**args)
print(f"← {result}", file=sys.stderr)
messages.append(
{
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
}
)
messages.append(
{
"role": "user",
"content": "Now emit the final TripPlan JSON. JSON only, no prose.",
}
)
final = await client.chat.completions.create(
model=MODEL,
messages=messages,
# max_tokens=600,
response_format={
"type": "json_schema",
"json_schema": {
"name": "TripPlan",
"strict": True,
"schema": TripPlan.model_json_schema(),
},
},
# extra_body=NO_THINK,
)
raw = final.choices[0].message.content or "{}"
try:
return TripPlan.model_validate_json(raw)
except Exception as error:
snoop.pp("error", error, final)
raise
async def main() -> None:
plan = await plan_trip("Portland, OR")
print(plan.model_dump_json(indent=2))
if __name__ == "__main__":
asyncio.run(main())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment