/AI9h ago

Biggest Code Eval Launch Benchmarks Opus 4.8-Medium At 40x Speedup

325701914.3K
Original post
swyx@swyx#214inAI

releasing tmr - the biggest code eval launch of the year

glad to have played a small part in defining the agenda for this very critical next phase in koding

1:43 PM · Jun 7, 2026 · 10.4K Views
Sentiment

Many users are excited about Opus 4.8-Medium's 40x code eval speedup because it shifts benchmarks toward real engineering workflows, though one questions the relevance of human-hour metrics.

Pos
83.3%
Neg
16.7%
6 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS3.9KLIKES2REPLIES1
swyx@swyx

@cognition funny you should say that

6hViews 3.9KLikes 2Bookmarks 0
Manuel Sampedro@manuelsampedrop

@swyx @cognition hope it scores scope discipline too, half my agent failures are perfectly fine code in files it had no business touching

8hViews 366
Blum@Blum_OG

@swyx @cognition sounds sick, hyped for this release

8hViews 26Likes 1
Ahtasham@ahtashammuzamal

@swyx @cognition This chart is wild. 40x speedup is insane but only 32% fully solved shows how far we still have to go. The next big leap will be in validation and orchestration layers that turn partials into reliable production code.

9hViews 70
Tech News@tech_summaries

@swyx @cognition the eval gap for coding agents has been the real bottleneck. curious whether this covers multi-file workflows or stays at the function level

9hViews 62
Alvaro Balbin@elalvarobalbin

@swyx @cognition the eval becomes the target so everything after just optimizes to pass it not to code better

9hViews 58
Shinka - AI@ShinkaIoT

@swyx @cognition Good to see the 'critical next phase in koding' moving beyond benchmarks and into real dev workflows. ⚡️

8hViews 17Likes 1
Lee Gaul@Leegaul

@swyx @cognition boots on the ground looks good on you 🫡

8hViews 16Likes 1
Hunter Gon@gonlenidefi

@swyx @cognition small part is doing a lot of heavy lifting here lol

cant wait to see what lands

9hViews 40
TK@theagentmaster

@swyx @cognition Curious if this leans more offline benchmark or live production eval.

8hViews 31
Kekko D’Amato@kekkodamato_

The 40x wall-clock speedup is real, but the grade mix tells the actual story — 32% fully solved, 51% low. That's where the frontier is: not just speed but completion fidelity on hard multi-step tasks. The eval design matters as much as the results. Excited to see the methodology tomorrow.

9hViews 24
@valerii_arch@valeriibo

@swyx @cognition code evals are finally moving toward real engineering work

9hViews 23
Sentio@Sentio_xbt

@swyx @cognition The conversation really shifted this year

from can it work to can you trust it

9hViews 20
Eclipse 🌖@ECLresearch

@swyx @cognition Curious which benchmarks they’re targeting — HumanEval+ is table stakes now, SWE-bench or something multi-agent?

9hViews 20
Saylor@seylorra

@swyx @cognition inb4 speed stays flat and the real insight is just "tasks that looked hard were actually easy"

8hViews 19
Clement@clement_1z4rd

@swyx @cognition I don't think the human hours are relevant anymore. Maybe to give a sense of scale, but I don't see a world where we're going to bill by "human hours" ever again.

9hViews 18
Invincible@InvincibleEdge

@swyx @cognition small part or not, shipping code evals that set the bar for the year is always interesting to watch

what benchmark are u measuring against?

9hViews 17
Ferbin@Ferbin08

@swyx @cognition biggest for which problem. code that wins evals isn't usually the code that survives production. which benchmarks predict the real survivors?

9hViews 15
TIC Association@TicAssociation

@swyx @cognition What specific areas of koding will this launch impact the most, and how will it change the workflow for developers?

7hViews 12
Alex YGift@Radipdegen

@swyx @cognition "played a small part" is doing insane amounts of work here lol

whats the evals stack look like?

9hViews 11
Load more posts