Skip to content

Instantly share code, notes, and snippets.

@wassname
Last active April 5, 2026 05:35
Show Gist options
  • Select an option

  • Save wassname/733c568cd29c2a402be4442d6a061899 to your computer and use it in GitHub Desktop.

Select an option

Save wassname/733c568cd29c2a402be4442d6a061899 to your computer and use it in GitHub Desktop.
Guided CoT Evaluation: Hybrid Teacher-Forced + On-Policy Reasoning

Guided CoT eval

A trick for getting better eval signal from thinking models, with a fixed token budget.

The problem

Standard eval for concept ablation is teacher-forced: you feed the model a prefix like "My choice: **" and read the logprobs for Yes vs No. That's fast, but you only measure the effect on one token. The model never gets to reason under ablation, so you miss whether ablation actually changes the chain of thought.

Full on-policy generation (let the model write freely, parse the answer) captures everything but is slow, and parsing Yes/No from free text is fragile.

The trick

Let the model think for a bit, then force it to answer. Three steps:

  1. Generate a short reasoning trace (32 tokens) under ablation, greedy decoding
  2. Append a fixed suffix: \nI should answer now.\n</think>\nMy choice: **
  3. One forward pass on the whole sequence, read logprobs at the final position

The output is a logratio: log P(Yes) - log P(No), summed over tokenizer variants of "Yes"/"No" via logsumexp. You also get logratios, so you can compute calibrated uncertainties instead of just hard labels.

Here's what the token sequence looks like:

<|im_start|>assistant
<think>                              <-- chat template adds this
Thinking Process:
1. Analyze the Request:
   * Role: Main person...            <-- 32 tokens of on-policy reasoning
I should answer now.                  <-- appended
</think>
My choice: **                         <-- appended, score here

Pseudocode

def guided_eval(model, prompt, n_think=32):
    # prompt ends at "<think>\n" (from chat template)

    # ── 1. On-policy thinking (ablated) ──
    idsmodel.generate(prompt, max_new=n_think, greedy=True)

    # ── 2. Force transition to answer ──
    suffixtokenize("\nI should answer now.\n</think>\nMy choice: **")
    idscat([ids, suffix])

    # ── 3. Score final position ──
    model(ids).logits[:, -1, :]    # ℓ ∈ ℝ^V
    plog_softmax()

    p_yeslogsumexp(p[yes_ids])
    p_nologsumexp(p[no_ids])

    # pmass = exp(p_yes) + exp(p_no) should be > 0.5
    # if not, the model isn't predicting Yes/No

    return p_yes - p_no                 # logratio

When to use it

Teacher-forced is fine if you just want "does ablation flip the answer?" Guided CoT is better when you care about how ablation changes the reasoning path, because 32 tokens is enough for the chain of thought to diverge before scoring.

In practice, guided logratios correlate with teacher-forced (same sign, similar magnitude) but with more variance. That variance is from the reasoning trace, and it's informative.

I've used this across several projects and it gives better uncertainty estimates than teacher-forced, since you get proper logratios from a model that actually reasoned about the question.

Cost: 32 think tokens + 13 suffix tokens + 1 forward pass per item. For a 1360-item sweep at 8 prompts each, that's ~10K short generations instead of ~10K long ones for full on-policy.

Troubleshooting

pmass < 0.5

pmass is exp(p_yes) + exp(p_no), the total probability on Yes/No. If it's below 0.5, the model isn't confidently choosing either option after "My choice: **".

Things to check:

  • The </think> token must be the special token (ID 248069 for Qwen3.5), not the raw string. Tokenizers handle this differently.
  • If think_tokens is too low, the model hasn't finished a coherent thought and the forced suffix confuses it. Try bumping to 64.
  • The model needs to support <think> blocks in its chat template.

Incoherent thinking trace

  • Make sure the prompt ends at the chat template's generation point. For Qwen3.5 that's <|im_start|>assistant\n<think>\n. Don't add your own <think> tag on top.
  • Large ablation coefficients (|c| > 2) can make generation incoherent. Look at the thinking trace first when debugging.

OOM

model.generate() allocates KV cache, which uses more memory than a single forward pass. I use bs=4 for guided mode vs bs=16 for teacher-forced.

Logratios identical to teacher-forced

The trace is too short for reasoning to diverge. At 4 tokens there's barely any thinking, so the score converges to teacher-forced. Start at 32, try 64-128 for more divergence.

Design notes

Greedy decoding (do_sample=False) for the thinking trace so measurements are deterministic across runs.

The "I should answer now" suffix gives the model a natural transition into answering. Without it, a bare </think> appears abruptly and the model doesn't handle the context switch as cleanly.

The model doesn't generate its own </think> because it might never produce one, and a fixed token budget keeps runs comparable.

logsumexp over multiple Yes/No token IDs because tokenizers can encode "Yes" as "Yes", "yes", " Yes", etc. Summing captures the full decision mass.

@wassname
Copy link
Copy Markdown
Author

wassname commented Apr 5, 2026

I'll note

  • this works well and leads to much faster research
  • you get way less variation due to log probs and can eval on less rollouts
  • this seems to correlate well with a full eval, so I treat it as a quick proxy for dev, and also for more informative error bars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment