A trick for getting better eval signal from thinking models, with a fixed token budget.
Standard eval for concept ablation is teacher-forced: you feed the model a prefix like "My choice: **" and read the logprobs for Yes vs No. That's fast, but you only measure the effect on one token. The model never gets to reason under ablation, so you miss whether ablation actually changes the chain of thought.
Full on-policy generation (let the model write freely, parse the answer) captures everything but is slow, and parsing Yes/No from free text is fragile.
Let the model think for a bit, then force it to answer. Three steps:
- Generate a short reasoning trace (32 tokens) under ablation, greedy decoding
- Append a fixed suffix:
\nI should answer now.\n</think>\nMy choice: ** - One forward pass on the whole sequence, read logprobs at the final position
The output is a logratio: log P(Yes) - log P(No), summed over tokenizer variants of "Yes"/"No" via logsumexp. You also get logratios, so you can compute calibrated uncertainties instead of just hard labels.
Here's what the token sequence looks like:
<|im_start|>assistant
<think> <-- chat template adds this
Thinking Process:
1. Analyze the Request:
* Role: Main person... <-- 32 tokens of on-policy reasoning
I should answer now. <-- appended
</think>
My choice: ** <-- appended, score here
def guided_eval(model, prompt, n_think=32):
# prompt ends at "<think>\n" (from chat template)
# ── 1. On-policy thinking (ablated) ──
ids ← model.generate(prompt, max_new=n_think, greedy=True)
# ── 2. Force transition to answer ──
suffix ← tokenize("\nI should answer now.\n</think>\nMy choice: **")
ids ← cat([ids, suffix])
# ── 3. Score final position ──
ℓ ← model(ids).logits[:, -1, :] # ℓ ∈ ℝ^V
p ← log_softmax(ℓ)
p_yes ← logsumexp(p[yes_ids])
p_no ← logsumexp(p[no_ids])
# pmass = exp(p_yes) + exp(p_no) should be > 0.5
# if not, the model isn't predicting Yes/No
return p_yes - p_no # logratioTeacher-forced is fine if you just want "does ablation flip the answer?" Guided CoT is better when you care about how ablation changes the reasoning path, because 32 tokens is enough for the chain of thought to diverge before scoring.
In practice, guided logratios correlate with teacher-forced (same sign, similar magnitude) but with more variance. That variance is from the reasoning trace, and it's informative.
I've used this across several projects and it gives better uncertainty estimates than teacher-forced, since you get proper logratios from a model that actually reasoned about the question.
Cost: 32 think tokens + 13 suffix tokens + 1 forward pass per item. For a 1360-item sweep at 8 prompts each, that's ~10K short generations instead of ~10K long ones for full on-policy.
pmass is exp(p_yes) + exp(p_no), the total probability on Yes/No. If it's below 0.5, the model isn't confidently choosing either option after "My choice: **".
Things to check:
- The
</think>token must be the special token (ID 248069 for Qwen3.5), not the raw string. Tokenizers handle this differently. - If think_tokens is too low, the model hasn't finished a coherent thought and the forced suffix confuses it. Try bumping to 64.
- The model needs to support
<think>blocks in its chat template.
- Make sure the prompt ends at the chat template's generation point. For Qwen3.5 that's
<|im_start|>assistant\n<think>\n. Don't add your own<think>tag on top. - Large ablation coefficients (|c| > 2) can make generation incoherent. Look at the thinking trace first when debugging.
model.generate() allocates KV cache, which uses more memory than a single forward pass. I use bs=4 for guided mode vs bs=16 for teacher-forced.
The trace is too short for reasoning to diverge. At 4 tokens there's barely any thinking, so the score converges to teacher-forced. Start at 32, try 64-128 for more divergence.
Greedy decoding (do_sample=False) for the thinking trace so measurements are deterministic across runs.
The "I should answer now" suffix gives the model a natural transition into answering. Without it, a bare </think> appears abruptly and the model doesn't handle the context switch as cleanly.
The model doesn't generate its own </think> because it might never produce one, and a fixed token budget keeps runs comparable.
logsumexp over multiple Yes/No token IDs because tokenizers can encode "Yes" as "Yes", "yes", " Yes", etc. Summing captures the full decision mass.
I'll note