6h ago

Yoav Gur Arieh finds most unfaithful chain-of-thought detectors perform near random chance

Their BonaFide benchmark generates ground-truth faithfulness labels.

0
Original post

Can we tell when LLMs are being unfaithful in their chains of thought? We evaluated 8 methods claiming to do this, and found that most perform near chance! But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?

8:16 AM · May 26, 2026 View on X

We evaluated CoT faithfulness evaluations & released 𝐁𝐨𝐧𝐚𝐅𝐢𝐝𝐞 so you can test yours too!!

Yoav Gur AriehYoav Gur Arieh@GurYoav

Can we tell when LLMs are being unfaithful in their chains of thought? We evaluated 8 methods claiming to do this, and found that most perform near chance! But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?

3:16 PM · May 26, 2026 · 7K Views
3:33 PM · May 26, 2026 · 1K Views

Monitoring whether what LLMs say faithfully reflects their internal reasoning is increasingly important for safety and trust

*BonaFide* is a first step towards bridging verbalized and latent reasoning in LLMs -- check it out!

Proud of this work by my student @GurYoav with @anmarasovic!

Yoav Gur AriehYoav Gur Arieh@GurYoav

Can we tell when LLMs are being unfaithful in their chains of thought? We evaluated 8 methods claiming to do this, and found that most perform near chance! But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?

3:16 PM · May 26, 2026 · 7K Views
4:26 PM · May 26, 2026 · 67 Views