Anthropic found a cure for laziness
LisanBench data shows Anthropic's Claude Opus 4.8 eliminated "lazy investigation" failures, down from 91% in Opus 4.5
The benchmark measures model failure rates on reasoning traps
Users are reacting to claims that Claude Opus 4.8 eliminates lazy investigation errors, with positive ones praising the clear improvements in proactivity over 4.7 while negative ones call the results lies or find the model still unreliable.
No Digg Deeper questions have been answered for this story yet.
Most Activity
Opus 4.8 the least lazy model ever?
Anthropic found a cure for laziness

@scaling01 Opus 4.7 is way more lazy than Opus 4.6.
What methodology is this? Opus refuses to check the documentation or notes or confirm its work. It 100% trusts its confabulations and refuses to be reasoned with. More and more like Gemini every iteration.
@scaling01 hot dayum
Anthropic found a cure for laziness
Interesting. Opus 4.8 should be dramatically less lazy than every other version of Claude
The real question is if I ask Opus 4.8 to look at the data, will they use a regex?
Anthropic found a cure for laziness
laziness this has been my #1 complaint about opus and ... wow?

@Teknium Getting great performance out of 4.8

@Teknium its token maxing basically.

@bearlyprofit thats not enough money in the pile sir xD

@scaling01 Would love to see stats for GPT-5.5 next to these.

@scaling01 @akarlin This is bullshit. Opus 4.7 is very lazy

@Teknium Opus 4.5 be like

@Teknium in other words

@scaling01 @thesaraharminta They fed the model Adderall?

@scaling01 One of the biggest reasons I tried codex at the end of last year was that claude was way too myopic to ship production quality code
If 4.8 is really much less lazy and has stronger reasoning skills, I'd imagine churn will be a lot less because psychosis will be achieved easier

@scaling01 This is a lie lol
The amount of times Opus 4.7 finds a negative data point and just goes "well that seals the deal, this avenue of investigation is dead" is insane, I have to keep pushing it constantly
0.25% my ass

@nembal Ah thats a good point. they do seem to be pushing their models to use more than seems necessary.
Same tasks opus 4.7 definitely used a significantly larger amt of tokens than 4.6 for me

@Sushilk91 @scaling01 For sure 😂

@scaling01 imagine internally benchmaxxing your own model
like wtf

@scaling01 they must have removed me from the training data