we might have a GPT-5.2-xhigh situation on our hand
Opus 4.8 low thinks almost as much as Opus 4.6 high
Extra-high effort yields a 70% pass rate.
we might have a GPT-5.2-xhigh situation on our hand
Opus 4.8 low thinks almost as much as Opus 4.6 high
Positive users celebrate Claude Opus 4.8 matching prior benchmark peaks at lower reasoning effort while negative users object to higher costs and continued reliance on burning more tokens rather than efficiency gains.
No Digg Deeper questions have been answered for this story yet.
this looks much better
we might have a GPT-5.2-xhigh situation on our hand
Opus 4.8 low thinks almost as much as Opus 4.6 high
okay might just be the benchmark
we might have a GPT-5.2-xhigh situation on our hand
Opus 4.8 low thinks almost as much as Opus 4.6 high
PB is seemingly close to being solved, so it was in fact an elicitation (and money) issue
Sadly they don't specify the harness for PB in the system card, while they do for some other benches
this looks much better

@scaling01 GPT-5.5 pulled the same compression last month.
quarter the tokens, same horizon. Opus 4.8 low eating 4.6 high's lunch means the reasoning budget is now a knob, not a tier.

@scaling01 "max" seems to be regressing vs. "x-high", crazy.
@scaling01 the default is iso-compute to 4.7 for coding tasks :pray:
we might have a GPT-5.2-xhigh situation on our hand
Opus 4.8 low thinks almost as much as Opus 4.6 high

@scaling01 Antropic Solution to solving has Always Been Burn More Tokens Instead of Making Smarter Models

@scaling01 It's the new SOTA on a few of our benchmarks-

@scaling01 Even more expensive as a consequence 🤦🏻♂️ I was really hoping Anthropic could match the token efficient gains OpenAI achieved with 5.5

@scaling01 cuz Opus 4.5~4.8 are small models Like 1T~2T

@scaling01

@scaling01 max below xhigh ? tf ?

@scaling01 This feels like something I can’t afford. Praying the price is improved.

@scaling01 Mine almost never triggers thinking

@scaling01 I was hoping the opposite would happen, given they have such a strong model to distill from.
But I guess enterprise cares more about raw intelligence.

@scaling01 The inference chain lengths blurring between model tiers could compress the premium pricing delta—curious if Anthropic’s token economics adjust to match.

@scaling01 bruh they went the opposite way from what people were demanding

@DaBrown95 @scaling01 maybe they will do that with opus 5? sonnet 3.7 was also thinking a lot but v4 did only a little thinking

@xeophon i guess we can just assume it was $ if it isn't specified

@Britoisinsane @scaling01 gpt-5.5 is small too? smaller than opus