1h ago

LisanBench data shows Anthropic's Claude Opus 4.8 eliminated "lazy investigation" failures, down from 91% in Opus 4.5

The benchmark measures model failure rates on reasoning traps

0
Original post

Interesting. Opus 4.8 should be dramatically less lazy than every other version of Claude

5:12 PM · May 28, 2026 · 505 Views
LisanBench data shows Anthropic's Claude Opus 4.8 eliminated "lazy investigation" failures, down from 91% in Opus 4.5 · Digg