Yoav Gur Arieh finds most unfaithful chain-of-thought detectors perform near random chance
Their BonaFide benchmark generates ground-truth faithfulness labels.
We evaluated CoT faithfulness evaluations & released 𝐁𝐨𝐧𝐚𝐅𝐢𝐝𝐞 so you can test yours too!!
Can we tell when LLMs are being unfaithful in their chains of thought? We evaluated 8 methods claiming to do this, and found that most perform near chance! But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?
Monitoring whether what LLMs say faithfully reflects their internal reasoning is increasingly important for safety and trust
*BonaFide* is a first step towards bridging verbalized and latent reasoning in LLMs -- check it out!
Proud of this work by my student @GurYoav with @anmarasovic!
Can we tell when LLMs are being unfaithful in their chains of thought? We evaluated 8 methods claiming to do this, and found that most perform near chance! But evaluating this requires us to have ground-truth labels for CoT faithfulness. How can we obtain these?