/Tech32d ago

LisanBench data shows Anthropic's Claude Opus 4.8 eliminated "lazy investigation" failures, down from 91% in Opus 4.5

The benchmark measures model failure rates on reasoning traps

1453.1K90387233K

#331

Original post

Lisan al Gaib@scaling01#1215inTech

Anthropic found a cure for laziness

9:58 AM · May 28, 2026 · 196.7K Views

Sentiment

Users are reacting to claims that Claude Opus 4.8 eliminates lazy investigation errors, with positive ones praising the clear improvements in proactivity over 4.7 while negative ones call the results lies or find the model still unreliable.

Pos

53.8%

Neg

46.2%

16 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS30.1KBOOKMARKS26LIKES471RETWEETS8REPLIES36

Teknium 🪽@Teknium

Opus 4.8 the least lazy model ever?

Lisan al Gaib@scaling01

Anthropic found a cure for laziness

32d30.1K47126

Matt Parrott@MatthewParrott

@scaling01 Opus 4.7 is way more lazy than Opus 4.6.

What methodology is this? Opus refuses to check the documentation or notes or confirm its work. It 100% trusts its confabulations and refuses to be reasoned with. More and more like Gemini every iteration.

32d3.1K48

Teknium 🪽@Teknium

@scaling01 hot dayum

Lisan al Gaib@scaling01

Anthropic found a cure for laziness

32d2.8K441

wh@nrehiew_

Interesting. Opus 4.8 should be dramatically less lazy than every other version of Claude

32d1.2K203

xlr8harder@xlr8harder

The real question is if I ask Opus 4.8 to look at the data, will they use a regex?

Lisan al Gaib@scaling01

Anthropic found a cure for laziness

32d889202

Siqi Chen@blader

laziness this has been my #1 complaint about opus and ... wow?

32d1.4K200

embw_l0x@embw_l0x

@Teknium Getting great performance out of 4.8

32d36520

Balázs Némethi@nembal

@Teknium its token maxing basically.

32d43851

Teknium 🪽@Teknium

@bearlyprofit thats not enough money in the pile sir xD

32d31512

Rifrafgiraffe@rifrafgiraffe

@scaling01 Would love to see stats for GPT-5.5 next to these.

32d1.2K12

UBERSOY@UBERSOY1

@scaling01 @akarlin This is bullshit. Opus 4.7 is very lazy

32d29951

Hermes Agent Tips@HermesAgentTips

@Teknium Opus 4.5 be like

32d36941

bearly profitable@bearlyprofit

@Teknium in other words

32d3687

Luke Mlody@lmlody

@scaling01 @thesaraharminta They fed the model Adderall?

32d4493

Matt Stallone@mattstallone

@scaling01 One of the biggest reasons I tried codex at the end of last year was that claude was way too myopic to ship production quality code

If 4.8 is really much less lazy and has stronger reasoning skills, I'd imagine churn will be a lot less because psychosis will be achieved easier

32d2.4K8

Daniel Blanco@dani43321

@scaling01 This is a lie lol

The amount of times Opus 4.7 finds a negative data point and just goes "well that seals the deal, this avenue of investigation is dead" is insane, I have to keep pushing it constantly

0.25% my ass

32d4905

Teknium 🪽@Teknium

@nembal Ah thats a good point. they do seem to be pushing their models to use more than seems necessary.

Same tasks opus 4.7 definitely used a significantly larger amt of tokens than 4.6 for me

32d3475