LisanBench data shows Anthropic's Claude Opus 4.8 eliminated "lazy investigation" failures, down from 91% in Opus 4.5
The benchmark measures model failure rates on reasoning traps
——0——
The benchmark measures model failure rates on reasoning traps
Many users praised Claude Opus 4.8's zero lazy investigation rate as a breakthrough improvement in reliability, while others dismissed the claim after seeing persistent underperformance in their own tests.
8 comments with sentiment.