8h ago

Yoav Gur Arieh finds most unfaithful chain-of-thought detectors perform near random chance

Their BonaFide benchmark generates ground-truth faithfulness labels.

——0——

Original post

#624@ANMARASOVICOP

Yoav Gur Arieh@GURYOAV

Can we tell when LLMs are being unfaithful in their chains of thought? We evaluated 8 methods claiming to do this, and found that most perform near chance! But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?

8:16 AM · May 26, 2026

QUOTE POST

#624Ana Marasović@ANMARASOVIC

We evaluated CoT faithfulness evaluations & released 𝐁𝐨𝐧𝐚𝐅𝐢𝐝𝐞 so you can test yours too!!

Yoav Gur Arieh@GurYoav

Can we tell when LLMs are being unfaithful in their chains of thought? We evaluated 8 methods claiming to do this, and found that most perform near chance! But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?

3:16 PM · May 26, 2026 · 8.6K Views

3:33 PM · May 26, 2026 · 1.2K Views

QUOTE POST

#1329Mor Geva@MEGAMOR2

Monitoring whether what LLMs say faithfully reflects their internal reasoning is increasingly important for safety and trust

*BonaFide* is a first step towards bridging verbalized and latent reasoning in LLMs -- check it out!

Proud of this work by my student @GurYoav with @anmarasovic!

Yoav Gur Arieh@GurYoav

Can we tell when LLMs are being unfaithful in their chains of thought? We evaluated 8 methods claiming to do this, and found that most perform near chance! But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?

3:16 PM · May 26, 2026 · 8.6K Views

4:26 PM · May 26, 2026 · 67 Views

Yoav Gur Arieh finds most unfaithful chain-of-thought detectors perform near random chance · Digg