Most AI coding benchmarks miss what actually matters: how models perform at the application layer.
Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.
Opus 4.8 led the leaderboard with an 87.8% score.
Most AI coding benchmarks miss what actually matters: how models perform at the application layer.
Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.
Many users call Replit's ViBench a needed real-world benchmark for testing how AI builds full web apps like actual users, while some dismiss it as irrelevant amid saturated benchmarks or criticize Replit's pricing and competitiveness.
No Digg Deeper questions have been answered for this story yet.
Benchmarks place GPT 5.5 as the best model on SWE, but is it the best at making apps end-to-end?
Turns out Opus 4.8 continues to be the king of vibe coding on both price & performance.
Introducing ViBench: the first benchmark for app creation based on real world tasks
SWE benchmarks don’t necessarily capture app building capabilities. ViBench does.
Most AI coding benchmarks miss what actually matters: how models perform at the application layer.
Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.

For more details, check our paper accepted at ACM CAIS 2026: https://dl.acm.org/doi/10.1145/3786335.3813162
and follow the project lead @Lambda_freak

Paper: https://dl.acm.org/doi/10.1145/3786335.3813162 Website: http://vibench.ai

ViBench is built from real, anonymized Replit apps turned into PRDs.
It measures both 0-to-1 builds as well as how agents navigate existing codebases.
We have been using it at Replit for months to test end-to-end feature correctness of apps built from plain-language prompts.

Instead of unit tests on a fixed surface, ViBench mimics a vibe coder, poking at the live app to catch issues.
The evaluator runs on our App Testing stack: Playwright in a notebook, plus computer use when scripts fall short.
https://replit.com/blog/automated-self-testing

ViBench is an open, living benchmark → we keep adding harder test cases as frontier models get more powerful
It goes beyond 0-to-1 and feature extension → we already use it to study task decomposition and merging
Ready to use or contribute? → http://vibench.ai

@amasad Not quite first but close! Pretty consistent with what we see on Vibe Code Bench 😊
finally some numbers that actually help people building real things
was so tired of benchmarks based on balls bouncing around in a box 😭
Most AI coding benchmarks miss what actually matters: how models perform at the application layer.
Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.

@amasad Is it really vibe coding if you are measuring end to end correctness from spec?

@amasad vibench sounds useful but tbh we all know what benchmark matters most at the end of the day
random twitter reviews

@butaji exactly, that’s why we offer Lite/Economy/Power modes on Replit Agent — our users can pick the tradeoff they prefer!

@amasad That translates well into my experience. I generally prefer opus 4.8 over gpt 5.5 in app development and web development especially in design. However I found 2 critical things: 1. The 20usd plan in chat gpt offers more usage. 2. In terminal based workloads chat gpt misses less.

@amasad تفاجئت برتيب glm، هل فيه خطة لتطوير نموذج لغوي خاص بربليت؟

@amasad Cost per working app is going to be the metric that reshapes developer tooling valuations.

Interesting! Then (Success rate % / USD) tells us that
>> Cheap models win hard on value. Flash and the minis deliver way more working apps per dollar than the flagship frontier models.
Top by _weighted value_: 1) Gemini 3.5 Flash: 73 (62.2% at $0.85) 2) GPT-5.4 Mini: 65 (60.8% at $0.93) 3) Kimi K2.6: 60 (60.8% at $1.01)
To say, I use MIniMax exactly by this principle

@pirroh You should test the new MiniMax M3 on it

@amasad Doesn't the paper point to an uncomfortable state - beyond the hype - of vibe coding?

@pirroh So incredibly important that the models and agents are benchmarked for how we'll they build

@amasad nice https://runtimewire.com/article/vibench-aims-to-rank-ai-models-by-app-building-not-just-coding-tests (made with replit)