Many users are excited about Opus 4.8-Medium's 40x code eval speedup because it shifts benchmarks toward real engineering workflows, though one questions the relevance of human-hour metrics.
Most Activity
@cognition funny you should say that

@swyx @cognition hope it scores scope discipline too, half my agent failures are perfectly fine code in files it had no business touching

@swyx @cognition sounds sick, hyped for this release

@swyx @cognition This chart is wild. 40x speedup is insane but only 32% fully solved shows how far we still have to go. The next big leap will be in validation and orchestration layers that turn partials into reliable production code.

@swyx @cognition the eval gap for coding agents has been the real bottleneck. curious whether this covers multi-file workflows or stays at the function level

@swyx @cognition the eval becomes the target so everything after just optimizes to pass it not to code better

@swyx @cognition Good to see the 'critical next phase in koding' moving beyond benchmarks and into real dev workflows. ⚡️

@swyx @cognition boots on the ground looks good on you 🫡

@swyx @cognition small part is doing a lot of heavy lifting here lol
cant wait to see what lands

@swyx @cognition Curious if this leans more offline benchmark or live production eval.

The 40x wall-clock speedup is real, but the grade mix tells the actual story — 32% fully solved, 51% low. That's where the frontier is: not just speed but completion fidelity on hard multi-step tasks. The eval design matters as much as the results. Excited to see the methodology tomorrow.

@swyx @cognition code evals are finally moving toward real engineering work

@swyx @cognition The conversation really shifted this year
from can it work to can you trust it

@swyx @cognition Curious which benchmarks they’re targeting — HumanEval+ is table stakes now, SWE-bench or something multi-agent?

@swyx @cognition inb4 speed stays flat and the real insight is just "tasks that looked hard were actually easy"

@swyx @cognition I don't think the human hours are relevant anymore. Maybe to give a sense of scale, but I don't see a world where we're going to bill by "human hours" ever again.

@swyx @cognition small part or not, shipping code evals that set the bar for the year is always interesting to watch
what benchmark are u measuring against?

@swyx @cognition biggest for which problem. code that wins evals isn't usually the code that survives production. which benchmarks predict the real survivors?

@swyx @cognition What specific areas of koding will this launch impact the most, and how will it change the workflow for developers?

@swyx @cognition "played a small part" is doing insane amounts of work here lol
whats the evals stack look like?