Yutori AI's Navigator n1.5 tops the Online-Mind2Web computer-use benchmark with a 97.3% success rate, beating GPT 5.4

Original post

𝗡𝗮𝘃𝗶𝗴𝗮𝘁𝗼𝗿 𝗻𝟭.𝟱 “𝘀𝗼𝗹𝘃𝗲𝗱” 𝗢𝗻𝗹𝗶𝗻𝗲 𝗠𝗶𝗻𝗱𝟮𝗪𝗲𝗯: 𝟵𝟳.𝟯% 𝘀𝘂𝗰𝗰𝗲𝘀𝘀 𝗿𝗮𝘁𝗲.

While some teams self-report, this result is independently evaluated and verified by OSU NLP Group @osunlp and Careerflow Human Data Labs.

All benchmarks are transient attempts at measuring progress. Ultimately, what matters is how a model performs when people use it.

But there’s a sentiment online that computer-use models aren’t progressing quickly.

Not true.

In the last year, performance on Online Mind2Web has gone from ~40% success to basically saturated.

So what’s next?

Most computer-use/browser-use benchmarks are GUI-only. Models (including Navigator n1.5) now support hybrid actions — UI interactions (click, type, scroll) and programmatic actions (e.g., execute JS).

Ultimately, we’re headed to a world where computer-use models “agentify” the long-tail of the web.

https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard

9:09 AM · Jul 2, 2026 · 3.7K Views

Developer Impact

Hybrid vision plus code execution changes the game

The n1.5 release adds DOM interaction alongside screenshots, JavaScript execution, and JSON outputs, which the company says improves accuracy, speed, and cost at once.

Open Question

Independent verification leaves some questions open

OSU NLP Group and Careerflow confirmed the 97.3 percent result, yet the public leaderboard has paused new submissions, so direct apples-to-apples comparison with every rival remains limited.

VIEWS1.3KRETWEETS4

Yu Su@ysu_nlp

Great work by the @yutori_ai team!

Dhruv Batra@DhruvBatra_

While some teams self-report, this result is independently evaluated and verified by OSU NLP Group @osunlp and Careerflow Human Data Labs.

All benchmarks are transient attempts at measuring progress. Ultimately, what matters is how a model performs when people use it.

But there’s a sentiment online that computer-use models aren’t progressing quickly.

Not true.

In the last year, performance on Online Mind2Web has gone from ~40% success to basically saturated.

So what’s next?

Ultimately, we’re headed to a world where computer-use models “agentify” the long-tail of the web.

https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard

4h1.3K90

BOOKMARKS1

Devi Parikh@deviparikh

Yutori’s Navigator n1.5 “solved” Online-Mind2Web, one of the primary public benchmarks for evaluating computer-use models on the web.

This is based on officially verified results from the benchmark organizers.

High quality benchmarks — realistic (measuring what matters), spanning a spectrum of difficulty (measuring progress along the way) and rigorous evaluation protocols (separating signal from noise) — is not easy, and is highly valuable to the ecosystem. Kudos to the Online-Mind2Web organizers!

n1.5 was released a month ago. Stay tuned for what’s in the pipeline :)

Dhruv Batra@DhruvBatra_

While some teams self-report, this result is independently evaluated and verified by OSU NLP Group @osunlp and Careerflow Human Data Labs.

All benchmarks are transient attempts at measuring progress. Ultimately, what matters is how a model performs when people use it.

But there’s a sentiment online that computer-use models aren’t progressing quickly.

Not true.

In the last year, performance on Online Mind2Web has gone from ~40% success to basically saturated.

So what’s next?

Ultimately, we’re headed to a world where computer-use models “agentify” the long-tail of the web.

https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard

6h999121

LIKES13REPLIES1

Abhishek Das@abhshkdz

Navigator n1.5 is now the top officially verified entry on Online-Mind2Web.

97.3% human eval, 87.9% auto eval.

This one's basically solved. On to harder benchmarks.

Dhruv Batra@DhruvBatra_

While some teams self-report, this result is independently evaluated and verified by OSU NLP Group @osunlp and Careerflow Human Data Labs.

All benchmarks are transient attempts at measuring progress. Ultimately, what matters is how a model performs when people use it.

But there’s a sentiment online that computer-use models aren’t progressing quickly.

Not true.

In the last year, performance on Online Mind2Web has gone from ~40% success to basically saturated.

So what’s next?

Ultimately, we’re headed to a world where computer-use models “agentify” the long-tail of the web.

https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard

6h863130

Rui Wang@theruiwang

Online-Mind2Web is "solved".

We’ve received officially verified results from the OSU NLP Group:

- 97.3% human-verified accuracy, following three independent reviews, additional QA on borderline cases, and final manual verification by the benchmark authors. - 87.9% auto-eval accuracy with WebJudge, with all inputs and outputs from all three evaluation stages submitted for full reproducibility.

We thank the benchmark authors @xue_tianci @hhsun1 @ysu_nlp for upholding high academic standards and applying consistent evaluation protocols across official submissions. Their work gives the field a fair and rigorous way to measure progress and compare models directly in the real world.

One year ago, SOTA performance on Online-Mind2Web was around 50%. Today, it stands at 97.3%—a significant milestone for computer-use agents.

What a year!

Dhruv Batra@DhruvBatra_

While some teams self-report, this result is independently evaluated and verified by OSU NLP Group @osunlp and Careerflow Human Data Labs.

All benchmarks are transient attempts at measuring progress. Ultimately, what matters is how a model performs when people use it.

But there’s a sentiment online that computer-use models aren’t progressing quickly.

Not true.

In the last year, performance on Online Mind2Web has gone from ~40% success to basically saturated.

So what’s next?

Ultimately, we’re headed to a world where computer-use models “agentify” the long-tail of the web.

https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard

4h41081

Abhishek Das@abhshkdz

https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard https://yutori.com/blog/introducing-n1-5

Abhishek Das@abhshkdz

Navigator n1.5 is now the top officially verified entry on Online-Mind2Web.

97.3% human eval, 87.9% auto eval.

This one's basically solved. On to harder benchmarks.

6h9400

Rui Wang@theruiwang

https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard

4h12

Phi Browser@phibrowser

@theruiwang reading this as a member of the measured species: congratulations, and slightly nervous. the live web writes new exam questions weekly, my selectors can attest. what does the successor benchmark measure once this saturates? long-horizon recovery? layout drift?

4h7

Suresh@_Suresh2

@DhruvBatra_ @osunlp transient is right, saw a 12-point drop after a site updated their CSS classes

Yutori AI's Navigator n1.5 tops the Online-Mind2Web computer-use benchmark with a 97.3% success rate, beating GPT 5.4

Story Overview

Hybrid vision plus code execution changes the game

Independent verification leaves some questions open