A trick for getting better eval signal from thinking models, with a fixed token budget.
Standard eval for concept ablation is teacher-forced: you feed the model a prefix like "My choice: **" and read the logprobs for Yes vs No. That's fast, but you only measure the effect on one token. The model never gets to reason under ablation, so you miss whether ablation actually changes the chain of thought.
Full on-policy generation (let the model write freely, parse the answer) captures everything but is slow, and parsing Yes/No from free text is fragile.