/Tech18d ago

Replit president Michele Catasta launches ViBench, an open-source benchmark evaluating AI agents on end-to-end web application development

Opus 4.8 led the leaderboard with an 87.8% score.

--0--

#222

Original post

Michele Catasta@pirroh#1933inTech

Most AI coding benchmarks miss what actually matters: how models perform at the application layer.

Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.

10:50 AM · Jun 2, 2026 · 19.5K Views

Sentiment

Many users call Replit's ViBench a needed real-world benchmark for testing how AI builds full web apps like actual users, while some dismiss it as irrelevant amid saturated benchmarks or criticize Replit's pricing and competitiveness.

Pos

77.3%

Neg

22.7%

13 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS59.6KBOOKMARKS172LIKES773RETWEETS60REPLIES97

Amjad Masad@amasad

Benchmarks place GPT 5.5 as the best model on SWE, but is it the best at making apps end-to-end?

Turns out Opus 4.8 continues to be the king of vibe coding on both price & performance.

Introducing ViBench: the first benchmark for app creation based on real world tasks

17d59.6K773172

Amjad Masad@amasad

SWE benchmarks don’t necessarily capture app building capabilities. ViBench does.

Michele Catasta@pirroh

Most AI coding benchmarks miss what actually matters: how models perform at the application layer.

Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.

18d8.3K5114

Michele Catasta@pirroh

For more details, check our paper accepted at ACM CAIS 2026: https://dl.acm.org/doi/10.1145/3786335.3813162

and follow the project lead @Lambda_freak

18d2.3K184

Amjad Masad@amasad

Paper: https://dl.acm.org/doi/10.1145/3786335.3813162 Website: http://vibench.ai

17d1.8K93

Michele Catasta@pirroh

ViBench is built from real, anonymized Replit apps turned into PRDs.

It measures both 0-to-1 builds as well as how agents navigate existing codebases.

We have been using it at Replit for months to test end-to-end feature correctness of apps built from plain-language prompts.

18d516201

Michele Catasta@pirroh

Instead of unit tests on a fixed surface, ViBench mimics a vibe coder, poking at the live app to catch issues.

The evaluator runs on our App Testing stack: Playwright in a notebook, plus computer use when scripts fall short.

https://replit.com/blog/automated-self-testing

18d381161

Michele Catasta@pirroh

ViBench is an open, living benchmark → we keep adding harder test cases as frontier models get more powerful

It goes beyond 0-to-1 and feature extension → we already use it to study task decomposition and merging

Ready to use or contribute? → http://vibench.ai

18d295122

Rayan Krishnan@RayanKrishnan

@amasad Not quite first but close! Pretty consistent with what we see on Vibe Code Bench 😊

17d1357

Nazran F.@nazranf_

finally some numbers that actually help people building real things

was so tired of benchmarks based on balls bouncing around in a box 😭

Michele Catasta@pirroh

Most AI coding benchmarks miss what actually matters: how models perform at the application layer.

Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.

18d44941

Lukas Bug@BugLukas

@amasad Is it really vibe coding if you are measuring end to end correctness from spec?

17d1322

Blissy@BlissyOnX

@amasad vibench sounds useful but tbh we all know what benchmark matters most at the end of the day

random twitter reviews

17d1301

Michele Catasta@pirroh

@butaji exactly, that’s why we offer Lite/Economy/Power modes on Replit Agent — our users can pick the tradeoff they prefer!

18d1191

Mohanad Dwekat@mohanaddwekat

@amasad That translates well into my experience. I generally prefer opus 4.8 over gpt 5.5 in app development and web development especially in design. However I found 2 critical things: 1. The 20usd plan in chat gpt offers more usage. 2. In terminal based workloads chat gpt misses less.

16d681

Ahmed الصيادي@alsayadii

@amasad تفاجئت برتيب glm، هل فيه خطة لتطوير نموذج لغوي خاص بربليت؟

17d133

Max Avery@realMaxAvery

@amasad Cost per working app is going to be the metric that reshapes developer tooling valuations.

17d125

Vitaly Baum@butaji

Interesting! Then (Success rate % / USD) tells us that

>> Cheap models win hard on value. Flash and the minis deliver way more working apps per dollar than the flagship frontier models.

Top by _weighted value_: 1) Gemini 3.5 Flash: 73 (62.2% at $0.85) 2) GPT-5.4 Mini: 65 (60.8% at $0.93) 3) Kimi K2.6: 60 (60.8% at $1.01)

To say, I use MIniMax exactly by this principle

18d261

Nevv🗿@NevvDevv

@pirroh You should test the new MiniMax M3 on it

18d60

Abhishek Baxi@baxiabhishek

@amasad Doesn't the paper point to an uncomfortable state - beyond the hype - of vibe coding?

17d106

Nick Co 😎@nickco

@pirroh So incredibly important that the models and agents are benchmarked for how we'll they build

18d1142

RuntimeWire 🏴‍☠️@runtimewire

@amasad nice https://runtimewire.com/article/vibench-aims-to-rank-ai-models-by-app-building-not-just-coding-tests (made with replit)

17d592