๐ก๐ฎ๐๐ถ๐ด๐ฎ๐๐ผ๐ฟ ๐ป๐ญ.๐ฑ โ๐๐ผ๐น๐๐ฒ๐ฑโ ๐ข๐ป๐น๐ถ๐ป๐ฒ ๐ ๐ถ๐ป๐ฑ๐ฎ๐ช๐ฒ๐ฏ: ๐ต๐ณ.๐ฏ% ๐๐๐ฐ๐ฐ๐ฒ๐๐ ๐ฟ๐ฎ๐๐ฒ.
While some teams self-report, this result is independently evaluated and verified by OSU NLP Group @osunlp and Careerflow Human Data Labs.
All benchmarks are transient attempts at measuring progress. Ultimately, what matters is how a model performs when people use it.
But thereโs a sentiment online that computer-use models arenโt progressing quickly.
Not true.
In the last year, performance on Online Mind2Web has gone from ~40% success to basically saturated.
So whatโs next?
Most computer-use/browser-use benchmarks are GUI-only. Models (including Navigator n1.5) now support hybrid actions โ UI interactions (click, type, scroll) and programmatic actions (e.g., execute JS).
Ultimately, weโre headed to a world where computer-use models โagentifyโ the long-tail of the web.
https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard


