/AI38d ago

Researchers release Odysseys benchmark with 200 long-horizon web tasks

Researchers led by Russ Salakhutdinov, UPMC professor in CMU’s Machine Learning Department, release Odysseys benchmark featuring approximately 200 realistic long-horizon web tasks derived from user browsing histories. Tasks span multiple websites, require hours to complete, and evaluate on live Internet. Frontier models achieve 44.5% success rate with low trajectory efficiency.

26785420.9K
Original post
Russ Salakhutdinov@rsalakhu#39inAI

How well do today’s frontier models handle long-horizon, multi-step web agent tasks, such as identifying the top 25 U.S. CS PhD programs with ML/AI faculty likely accepting students and compiling the results into a structured sheet?

Check out our new work on Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Paper: https://odysseys-website.pages.dev/ Leaderboard: https://odysseys-website.pages.dev/leaderboard

We introduce Odysseys, a benchmark of 200 long-horizon tasks derived from real browsing sessions and evaluated on the live Internet. We show that binary pass/fail is inadequate in this setting and propose rubric-based evaluation, which better aligns with human judgment and provides more informative signals.

Across leading models, the best achieves only 44.5% success, leaving substantial headroom. We further introduce a Trajectory Efficiency metric (rubric score per step) and find efficiency remains extremely low (1.15%), highlighting a key bottleneck.

Odysseys provides a realistic benchmark for measuring progress toward web agents capable of sustained, efficient, real-world operation.

See a more detailed thread by @kohjingyu.

9:08 AM · Apr 29, 2026 · 20.9K Views
Sentiment

A minority of comments express negative sentiment criticizing the benchmark's low efficiency and practicality for tasks like cafe research. No clear positive reactions are present.

Pos
0.0%
Neg
100.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS11.7KBOOKMARKS34LIKES44REPLIES2

How well do today’s frontier models handle long-horizon, multi-step web agent tasks, such as identifying in-person, 6-week, on-campus university engineering summer programs for high school students and compiling the results into a structured sheet?

Check out our new work on Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks:

Paper: https://arxiv.org/abs/2604.24964 Wedsite: https://odysseys-website.pages.dev/ Leaderboard: https://odysseys-website.pages.dev/leaderboard

We introduce Odysseys, a benchmark of 200 long-horizon tasks derived from real browsing sessions and evaluated on the live Internet. We show that binary pass/fail is inadequate in this setting and propose rubric-based evaluation, which better aligns with human judgment and provides more informative signals.

Across leading models, the best achieves only 44.5% success, leaving substantial headroom. We further introduce a Trajectory Efficiency metric (rubric score per step) and find efficiency remains extremely low (1.15%), highlighting a key bottleneck.

Odysseys provides a realistic benchmark for measuring progress toward web agents capable of sustained, efficient, real-world operation.

See a more detailed thread by @JangLawrenceK.

37dViews 11.7KLikes 44Bookmarks 34
RETWEETS8

How well do today’s frontier models handle long-horizon, multi-step web agent tasks, such as identifying the top 25 U.S. CS PhD programs with ML/AI faculty likely accepting students and compiling the results into a structured sheet?

Check out our new work on Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Paper: https://odysseys-website.pages.dev/ Leaderboard: https://odysseys-website.pages.dev/leaderboard

We introduce Odysseys, a benchmark of 200 long-horizon tasks derived from real browsing sessions and evaluated on the live Internet. We show that binary pass/fail is inadequate in this setting and propose rubric-based evaluation, which better aligns with human judgment and provides more informative signals.

Across leading models, the best achieves only 44.5% success, leaving substantial headroom. We further introduce a Trajectory Efficiency metric (rubric score per step) and find efficiency remains extremely low (1.15%), highlighting a key bottleneck.

Odysseys provides a realistic benchmark for measuring progress toward web agents capable of sustained, efficient, real-world operation.

See a more detailed thread by @kohjingyu.

38dViews 20.9KLikes 67Bookmarks 54
Daniel Fried@dan_fried

How successfully -- and efficiently! -- can agents carry out long-horizon tasks on the web? We built a benchmark of ~200 multi-site tasks, based on people's real browsing history. Many of them take hours to solve.

Paper: https://odysseys-website.pages.dev/

Led by @JangLawrenceK and @kohjingyu, with @rsalakhu

38dViews 9KLikes 41Bookmarks 25
Thomas Tao@Thomas_Tao_1

@rsalakhu I keep seeing them fail on constraint memory, not page navigation. They can browse fine. They lose track of on-campus, 6-week, in-person, then write a confident sheet anyway.

37dViews 28
GTechne@ai4urbanlife

@dan_fried @JangLawrenceK 52 minutes to research cafes is cool but that 1.15% efficiency metric 😬 We're burning GPU cycles to automate Yelp.

37dViews 27
validate.qa@Validate_QA

@rsalakhu long-horizon web tasks like program hunting. how'd the models score on accuracy

37d