Anthropic releases Claude Opus 4.8, which outperforms GPT-5.5 on senior-level engineering and writing tasks in early evaluations

VIEWS256.6KBOOKMARKS811RETWEETS137REPLIES131

BREAKING:

Anthropic just dropped Opus 4.8—and it is a MONSTER

We've been testing for about a week @every and our verdict is they could've just called it Opus 5, it's that good.

Here's our vibe check:

- Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works.

HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results.

- Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context.

HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high.

- Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark.

- Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic.

THE BAD: These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude.

Anthropic is back baby!

Read the rest on @every: https://every.to/vibe-check/opus-4-8-vibecheck

32d256.6K1.7K811

LIKES1.7K

Chubby♨️@kimmonismus

Opus 4.8 is clearly a strong model, but my impression is that Anthropic is increasingly playing catch-up with OpenAI rather than setting the pace.

It feels like GPT-5.5 has shifted the benchmark again, and if OpenAI keeps this trajectory, GPT-5.6 could very plausibly become the stronger overall model.

Initial testing is that 4.8 is good-ish

32d144.7K1.7K128

Theo - t3.gg@theo

Opus 4.8 is here. It's pretty good. Is Anthropic back on top?

31d84.9K701195

Bindu Reddy@bindureddy

🚨 Opus 4.8 Still Trails Behind GPT 5.5 And Is A Very Incremental Release

Opus 4.8 barely inches past 4.7 on benchmarks but lags behind GPT 5.5. considerably!!

Anthropic may be stalling a bit given it's last two releases. OpenAI has a huge opening with GPT 5.6 coming soon

Live On ChatLLM!!

32d34.6K41564

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

…this is quite crushing. People unsubscribing from Ant will crawl back. Things are beginning to go very fast. And yes it feels a bit more "AGI-like" than 4.6 immediately (4.7 was a tool anyway and not appreciably sharper). 5.6 next, then what?

Vals AI@ValsAI

Anthropic just dropped another powerhouse model, Opus 4.8 and it’s the new SOTA on the Vals Index (70.2%) and Vals Multimodal (70.7%). Full results below.

32d22.4K24854

Lisan al Gaib@scaling01

Vals AI@ValsAI

Anthropic just dropped another powerhouse model, Opus 4.8 and it’s the new SOTA on the Vals Index (70.2%) and Vals Multimodal (70.7%). Full results below.

32d14.4K22917

Dan Shipper 📧@danshipper

@every Read our full vibe check on @every: https://every.to/vibe-check/opus-4-8-vibecheck

Dan Shipper 📧@danshipper

BREAKING:

Anthropic just dropped Opus 4.8—and it is a MONSTER

We've been testing for about a week @every and our verdict is they could've just called it Opus 5, it's that good.

Here's our vibe check:

- Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works.

HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results.

- Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context.

HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high.

- Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark.

- Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic.

THE BAD: These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude.

Anthropic is back baby!

Read the rest on @every: https://every.to/vibe-check/opus-4-8-vibecheck

32d7.8K1810

FLORA ©@floraai

Opus 4.8 is live in FLORA. The @every said it better than we could.

Monster mode, in the canvas.

Dan Shipper 📧@danshipper

BREAKING:

Anthropic just dropped Opus 4.8—and it is a MONSTER

We've been testing for about a week @every and our verdict is they could've just called it Opus 5, it's that good.

Here's our vibe check:

- Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works.

HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results.

- Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context.

HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high.

- Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark.

- Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic.

THE BAD: These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude.

Anthropic is back baby!

Read the rest on @every: https://every.to/vibe-check/opus-4-8-vibecheck

32d7.7K1310

Amol Avasare@TheAmolAvasare

Benchmarks are great, but IMO the behavior change is a much bigger deal. Plans before it edits, recovers from its own errors, and finds creative ways around obstacles instead of stalling.

Feels much more like a senior engineer than 4.7, and better at long-horizon work.

32d3.1K633

Felix Thöni@f_thoeniX

@kimmonismus @kimmonismus you have to try out the new dynamic workflows in claude code. That is the real power of Opus 4.8. It's out for like 4 hours now and I have already burned 300 mil tokens and implemented 3 new features that would have taken me 4-5 days before. this thing is insane.

32d5.4K78

Eric Bahn 💛@ericbahn

Good news for those of us white-knuckling through Opus 4.7...

Dan Shipper 📧@danshipper

BREAKING:

Anthropic just dropped Opus 4.8—and it is a MONSTER

We've been testing for about a week @every and our verdict is they could've just called it Opus 5, it's that good.

Here's our vibe check:

- Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works.

HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results.

- Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context.

HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high.

- Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark.

- Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic.

THE BAD: These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude.

Anthropic is back baby!

Read the rest on @every: https://every.to/vibe-check/opus-4-8-vibecheck

32d3.8K106

Ben Pielstick@BenPielstick

@kimmonismus Remember Anthropic is trying not to advance the state of technology and is instead trying to ideologically control the direction of AI by keeping up as little as possible while maintaining relevancy. If they wanted to maximally accelerate they could.

32d7.3K352

Dan Shipper 📧@danshipper

@every Watch on YouTube: https://www.youtube.com/watch?v=kXYDUozZAhk&feature=youtu.be

32d2.3K56

Alex Albert@alexalbert__

@danshipper @every thanks for testing Dan!

Dan Shipper 📧@danshipper

BREAKING:

Anthropic just dropped Opus 4.8—and it is a MONSTER

We've been testing for about a week @every and our verdict is they could've just called it Opus 5, it's that good.

Here's our vibe check:

- Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works.

HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results.

- Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context.

HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high.

- Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark.

- Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic.

THE BAD: These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude.

Anthropic is back baby!

Read the rest on @every: https://every.to/vibe-check/opus-4-8-vibecheck

32d5K331

Lee Knowlton@leeknowlton

Getting @danshipper out of Codex is no simple feat!

Stoked to see Anthropic come back strong.

Dan Shipper 📧@danshipper

BREAKING:

Anthropic just dropped Opus 4.8—and it is a MONSTER

We've been testing for about a week @every and our verdict is they could've just called it Opus 5, it's that good.

Here's our vibe check:

- Beats GPT-5.5 on Senior Engineer bench. On our toughest benchmark Opus 4.8 scores a 63—a hair higher than GPT-5.5's score of 62, and a full 30 points higher than Opus 4.7. It tackled a ground-up rewrite of a production codebase, and actually built something that works.

HOWEVER: Coding performance varied a lot at different reasoning levels. We recommend using it on xhigh for best results.

- Incredibly good writer. Opus 4.8 scored a 79.6 on our writing benchmark—measuring models on real-world writing tasks we do all of the time like essay writing, promo email writing, and more. It beats GPT-5.5 by 6 points. It produces well-written prose with fewer "AI-isms". It's also very good at writing in your voice given the right context.

HOWEVER: Writing performance also varied with reasoning levels. Medium reasoning had higher incidence of AI-isms—we found best results with high.

- Beast at knowledge work. Opus 4.8 is very good at general knowledge work tasks like report creation, research and more. It produced the best PowerPoint one-shot we've ever seen on our deck generation benchmark.

- Emotionally intelligent, willing to question the frame. I've also found it to be quite good at talking through psychological or interpersonal issues. It has a high EQ, and it's also good at not glazing and helping to expand your perspective. Its thought process feels extremely rich and dynamic.

THE BAD: These days a model is only as good as its harness, and Codex is still a far superior harness to the Claude Desktop app. This has kept me using Codex + GPT-5.5 as my daily driver, but I am flipping back and forth a lot more between Codex and Claude.

Anthropic is back baby!

Read the rest on @every: https://every.to/vibe-check/opus-4-8-vibecheck

32d2.8K83

嫣彤~🌸同城上门@Dennis636953244

@scavenger869 @cline 👆🤝

🌈⁠

😊‌ 💕͏🙏💝

31d

orlie@sunglassesface

@kimmonismus it's a lot of that cat and mouse game. in 6 months tables might turn again. it's so annoying to keep switching models (my credit card thinks i have adhd)

32d5.1K231

Amol Avasare@TheAmolAvasare

Also shipping today: • Fast mode for Opus 4.8 – 2.5x faster, 3x cheaper vs 4.7 • Dynamic workflows in CC – hundreds of parallel subagents in one session, and verifies its own work • Effort controls on http://claude.ai – choose how much effort Claude puts into a query

32d2.5K231

Ben Pielstick@BenPielstick

@BayernHahn @kimmonismus This is a bigger underlying issue. Anthropic has a very different approach from OpenAI.

https://www.youtube.com/watch?v=JMYspR42HFM&t=1714s

32d1.4K23

Chubby♨️@kimmonismus

@gitcommute good-ish means its good, but not so sure if it makes up for the bad 4.7 release

32d3K8