git fetch origin matthen/ultravox-sft
git checkout matthen/ultravox-sft| Checkpoint | Description |
|---|---|
s3://polyai-temp-training-only/experiments/audio/ultravox-sft-test/ultravox-v0_6-full-sft-unfrozen-v1/checkpoints/final/ |
v1: with <think> tags, llm_score=74.24 |
s3://polyai-temp-training-only/experiments/audio/ultravox-sft-test/ultravox-v0_6-full-sft-unfrozen-v2-no-think/checkpoints/final/ |
v2: no think tags, eval TBD |
aws s3 cp --recursive \
s3://polyai-temp-training-only/experiments/audio/ultravox-sft-test/ultravox-v0_6-full-sft-unfrozen-v1/checkpoints/final/ \
var/models/ultravox-v0_6-unfrozen-v1/The saved checkpoint has text_model_id set which causes vLLM to re-download base LLaMA weights, overwriting our fine-tuned weights. You must null it out:
python -c "
import json
with open('var/models/ultravox-v0_6-unfrozen-v1/config.json') as f:
config = json.load(f)
config['text_model_id'] = None
config['audio_model_id'] = None
with open('var/models/ultravox-v0_6-unfrozen-v1/config.json', 'w') as f:
json.dump(config, f, indent=2)
"(This fix is automated in checkpointing.py on the branch for future training runs.)
The v1 checkpoint was trained before the enable_think_tags flag existed. Add it:
python -c "
import json
with open('var/models/ultravox-v0_6-unfrozen-v1/polywhirl.json') as f:
config = json.load(f)
config['lm_formatter']['enable_think_tags'] = True
with open('var/models/ultravox-v0_6-unfrozen-v1/polywhirl.json', 'w') as f:
json.dump(config, f, indent=2)
"(v2 checkpoint doesn't need this — it was trained with the updated formatter.)
DATASET=data/audio_eval/examples.v1.3892a6d1.jsonl
OUTPUT_DIR=s3://polyai-temp-training-only/data/polywhirl_eval_results/audio/v1.3892a6d1
FNAME=${OUTPUT_DIR}/ultravox-v0_6-unfrozen-v1
uv run -m polywhirl.evaluation.generate \
--dataset ${DATASET} \
--media_root var/audio_data/ \
--output ${FNAME}.gen.jsonl \
--model.type POLYWHIRL_VLLM \
--model.polywhirl_vllm.path var/models/ultravox-v0_6-unfrozen-v1 \
--model.polywhirl_vllm.temperature 0.0 \
--model.polywhirl_vllm.limit_audios_per_prompt 20 \
--model.polywhirl_vllm.batch_size 512 \
--enable_ood true \
--default_response_language en-US \
--allow_overwrite trueuv run -m polywhirl.evaluation.run_judge \
--generations ${FNAME}.gen.jsonl \
--output ${FNAME}.eval.jsonl \
--cache_dir ~/.cache/polywhirl/ \
--api_judge.version V3_GEMINI \
--api_judge.type GEMINI \
--api_judge.name gemini-3.1-pro-preview \
--api_judge.parallelism 40 \
--api_judge.temperature 0.1 \
--api_judge.use_audio true \
--api_judge.thinking_level LOW \
--api_judge.use_annotations false \
--api_judge.use_known_bad_annotations true \
--api_judge.api_timeout_seconds 300 \
--api_judge.remove_text_if_tool true \
--naturalness_judge null \
--allow_overwrite trueuv run -m polywhirl.evaluation.show_results --input ${FNAME}.eval.jsonlTo verify a checkpoint works with audio:
uv run python -m polywhirl.training.ultravox.test_ultravox \
--model var/models/ultravox-v0_6-unfrozen-v1This loads the model via HF transformers and generates a response from a test audio file.
| Metric | Qwen Omni baseline | Ultravox v1 |
|---|---|---|
| llm_score | 77.10 | 74.24 |
| error_rate | 0.08% | 0.08% |
| style_score | 87.23 | 87.66 |
| WER | 6.49 | 8.16 |
| tool_accuracy | 98.79 | 97.58 |
See polywhirl/audio/experiments/2026-03-30-ultravox-sft.md for full details.