How well do today’s frontier models handle long-horizon, multi-step web agent tasks, such as identifying the top 25 U.S. CS PhD programs with ML/AI faculty likely accepting students and compiling the results into a structured sheet?
Check out our new work on Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
Paper: https://odysseys-website.pages.dev/ Leaderboard: https://odysseys-website.pages.dev/leaderboard
We introduce Odysseys, a benchmark of 200 long-horizon tasks derived from real browsing sessions and evaluated on the live Internet. We show that binary pass/fail is inadequate in this setting and propose rubric-based evaluation, which better aligns with human judgment and provides more informative signals.
Across leading models, the best achieves only 44.5% success, leaving substantial headroom. We further introduce a Trajectory Efficiency metric (rubric score per step) and find efficiency remains extremely low (1.15%), highlighting a key bottleneck.
Odysseys provides a realistic benchmark for measuring progress toward web agents capable of sustained, efficient, real-world operation.
See a more detailed thread by @kohjingyu.


