
Paper: https://arxiv.org/abs/2605.08346
Work done w/Minh Vu, @HongliZhan, @liraymond96, & Manish Bhattarai.

Paper: https://arxiv.org/abs/2605.08346
Work done w/Minh Vu, @HongliZhan, @liraymond96, & Manish Bhattarai.

TRACT stays stable under both FORCE and REMOVE since it scores the reasoning body, not the endpoint. It also stacks well: fusing TRACT with existing detectors gives +5 to +20 average AUC across all 5 models.

We ran this across 4 benchmarks and 5 models. Some detectors swing 20+ AUC points just from changing or removing answer cues, even though the reasoning is untouched.

Here's what we did: FORCE: replace the final answer with the ground truth; REMOVE: delete the answer step entirely.
Same reasoning body both times. A trace-faithful detector should remain informative under both.

So we asked: what does the reasoning itself look like when it's going wrong? It wanders, hedges, grows uneven, or diverges across samples. We built TRACT to pick up on these trajectory patterns as a lightweight text-only score.
No Digg Deeper questions have been answered for this story yet.