Most AI coding benchmarks miss what actually matters: how models perform at the application layer.
Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.
Opus 4.8 led the leaderboard with an 87.8% score.
Most AI coding benchmarks miss what actually matters: how models perform at the application layer.
Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.
SWE benchmarks don’t necessarily capture app building capabilities. ViBench does.
Most AI coding benchmarks miss what actually matters: how models perform at the application layer.
Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.
Opus 4.8 led the leaderboard with an 87.8% score.
Most AI coding benchmarks miss what actually matters: how models perform at the application layer.
Introducing ViBench, an open-source benchmark for evaluating agents on end-to-end web application development.
Many users praise ViBench for realistically testing AI-built web apps through user-like interactions rather than just code patches.