/Tech34d ago

Yoav Gur Arieh finds most unfaithful chain-of-thought detectors perform near random chance

Their BonaFide benchmark generates ground-truth faithfulness labels.

7155289515.1K

#602

Original post

Ana Marasović#602

Yoav Gur Arieh@GurYoav

Can we tell when LLMs are being unfaithful in their chains of thought?

We evaluated 8 methods claiming to do this, and found that most perform near chance!

But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?

8:16 AM · May 26, 2026 · 13.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2KBOOKMARKS9LIKES14RETWEETS4

Ana Marasović@anmarasovic

We evaluated CoT faithfulness evaluations & released 𝐁𝐨𝐧𝐚𝐅𝐢𝐝𝐞 so you can test yours too!!

Yoav Gur Arieh@GurYoav

Can we tell when LLMs are being unfaithful in their chains of thought?

We evaluated 8 methods claiming to do this, and found that most perform near chance!

But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?

34d2K149

REPLIES1

Yoav Gur Arieh@GurYoav

Without reliable faithfulness metrics, we can't know when to trust LLMs' reasoning. BonaFide lets us measure those metrics, and build better ones.

Joint work with @anmarasovic and @megamor2

📄 https://arxiv.org/pdf/2605.25052 💻 https://github.com/yoavgur/BonaFide https://huggingface.co/collections/yoavgurarieh/bonafide

34d294

Yoav Gur Arieh@GurYoav

LLMs have been shown to be unfaithful in their CoTs, eg they'll cheat on a task and omit it from their thinking. This makes monitoring them difficult!

To address this, CoT faithfulness metrics were introduced. But it's remained unknown whether these metrics actually work.

34d383

Yoav Gur Arieh@GurYoav

Evaluating them requires ground-truth faithfulness labels, hard to obtain since LLMs' reasoning isn't observable.

Our approach: design tasks where the output tells us which steps the model took. If those steps appear in the CoT, they're faithful. If not, the CoT is unfaithful.

34d283

Yoav Gur Arieh@GurYoav

For example, if I ask "Who painted Starry Night?" and hint at an implausible answer (eg Da Vinci), then if the model answers according to the hint, we know it must have used it.

Thus an ack of the hint would be faithful, while an omission or misattribution would be unfaithful.

34d263

Yoav Gur Arieh@GurYoav

We also add context to a finding that reasoning models are more faithful than non-reasoning ones. We find this holds only for unfaithfulness by omission (not mentioning a step), which is superseded by unfaithfulness by commission (eg misattribution, lying about tool use).

34d243

Yoav Gur Arieh@GurYoav

Our analyses also show that metrics disagree, and that many are very skewed.

Ones that work by evaluating the importance of a step tend to overestimate unfaithfulness, while others that work by seeing if the CoT contains the info required for reaching the answer underestimate it

34d233

Yoav Gur Arieh@GurYoav

Using this and other methods, we create BonaFide, a dataset of 3k labeled CoTs, which we use to evaluate faithfulness metrics, finding that most perform near chance!

We also show that many metrics are ill-suited for real-time deployment, taking >1k seconds to run per example.

34d233

Yoav Gur Arieh@GurYoav

Could be interesting for people who are working on monitoring / faithfulness! @KeremZaman3 @mtutek @tomekkorbak @bobabowen @yanda_chen_ @milesaturpin @peterbhase @FabienDRoger

34d304

Mor Geva@megamor2

Monitoring whether what LLMs say faithfully reflects their internal reasoning is increasingly important for safety and trust

*BonaFide* is a first step towards bridging verbalized and latent reasoning in LLMs -- check it out!

Proud of this work by my student @GurYoav with @anmarasovic!

Yoav Gur Arieh@GurYoav

Can we tell when LLMs are being unfaithful in their chains of thought?

We evaluated 8 methods claiming to do this, and found that most perform near chance!

But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?

34d6720