Running Ultravox audio eval

Branch

git fetch origin matthen/ultravox-sft
git checkout matthen/ultravox-sft

Available checkpoints on S3

Checkpoint	Description
`s3://polyai-temp-training-only/experiments/audio/ultravox-sft-test/ultravox-v0_6-full-sft-unfrozen-v1/checkpoints/final/`	v1: with `<think>` tags, llm_score=74.24
`s3://polyai-temp-training-only/experiments/audio/ultravox-sft-test/ultravox-v0_6-full-sft-unfrozen-v2-no-think/checkpoints/final/`	v2: no think tags, eval TBD

Download a checkpoint

aws s3 cp --recursive \
    s3://polyai-temp-training-only/experiments/audio/ultravox-sft-test/ultravox-v0_6-full-sft-unfrozen-v1/checkpoints/final/ \
    var/models/ultravox-v0_6-unfrozen-v1/

IMPORTANT: Fix config before inference

The saved checkpoint has text_model_id set which causes vLLM to re-download base LLaMA weights, overwriting our fine-tuned weights. You must null it out:

python -c "
import json
with open('var/models/ultravox-v0_6-unfrozen-v1/config.json') as f:
    config = json.load(f)
config['text_model_id'] = None
config['audio_model_id'] = None
with open('var/models/ultravox-v0_6-unfrozen-v1/config.json', 'w') as f:
    json.dump(config, f, indent=2)
"

(This fix is automated in checkpointing.py on the branch for future training runs.)

For v1 checkpoint: patch polywhirl.json

The v1 checkpoint was trained before the enable_think_tags flag existed. Add it:

python -c "
import json
with open('var/models/ultravox-v0_6-unfrozen-v1/polywhirl.json') as f:
    config = json.load(f)
config['lm_formatter']['enable_think_tags'] = True
with open('var/models/ultravox-v0_6-unfrozen-v1/polywhirl.json', 'w') as f:
    json.dump(config, f, indent=2)
"

(v2 checkpoint doesn't need this — it was trained with the updated formatter.)

Run generate

DATASET=data/audio_eval/examples.v1.3892a6d1.jsonl
OUTPUT_DIR=s3://polyai-temp-training-only/data/polywhirl_eval_results/audio/v1.3892a6d1
FNAME=${OUTPUT_DIR}/ultravox-v0_6-unfrozen-v1

uv run -m polywhirl.evaluation.generate \
    --dataset ${DATASET} \
    --media_root var/audio_data/ \
    --output ${FNAME}.gen.jsonl \
    --model.type POLYWHIRL_VLLM \
    --model.polywhirl_vllm.path var/models/ultravox-v0_6-unfrozen-v1 \
    --model.polywhirl_vllm.temperature 0.0 \
    --model.polywhirl_vllm.limit_audios_per_prompt 20 \
    --model.polywhirl_vllm.batch_size 512 \
    --enable_ood true \
    --default_response_language en-US \
    --allow_overwrite true

Run judge

uv run -m polywhirl.evaluation.run_judge \
    --generations ${FNAME}.gen.jsonl \
    --output ${FNAME}.eval.jsonl \
    --cache_dir ~/.cache/polywhirl/ \
    --api_judge.version V3_GEMINI \
    --api_judge.type GEMINI \
    --api_judge.name gemini-3.1-pro-preview \
    --api_judge.parallelism 40 \
    --api_judge.temperature 0.1 \
    --api_judge.use_audio true \
    --api_judge.thinking_level LOW \
    --api_judge.use_annotations false \
    --api_judge.use_known_bad_annotations true \
    --api_judge.api_timeout_seconds 300 \
    --api_judge.remove_text_if_tool true \
    --naturalness_judge null \
    --allow_overwrite true

Show results

uv run -m polywhirl.evaluation.show_results --input ${FNAME}.eval.jsonl

Quick test (HF transformers, not vLLM)

To verify a checkpoint works with audio:

uv run python -m polywhirl.training.ultravox.test_ultravox \
    --model var/models/ultravox-v0_6-unfrozen-v1

This loads the model via HF transformers and generates a response from a test audio file.

Results so far

Metric	Qwen Omni baseline	Ultravox v1
llm_score	77.10	74.24
error_rate	0.08%	0.08%
style_score	87.23	87.66
WER	6.49	8.16
tool_accuracy	98.79	97.58

See polywhirl/audio/experiments/2026-03-30-ultravox-sft.md for full details.

matthen/ultravox-eval-guide.md

Select an option

No results found