/Tech5h ago

Researchers Launch iOSWorld Benchmark for Mobile AI Agents

1346141614.3K

Original post

An annoyingly common question I get as an AI PhD student is “When can you get the ChatGPT AI to do something useful? It can’t even work on my phone yet. Siri is pretty dumb.”

To be fair, I think their criticism is correct. I personally wished that AI was better integrated on my phone. LLMs can solve IMO problems, so shouldn’t it be a cakewalk for it to remind me of the text I forgot to respond to last week? Obviously not, since it doesn’t exist in my pocket yet. Or maybe Apple’s new update yesterday fixed this and my research project is obsolete.

We are releasing iOSWorld (http://iosworld.io), a dynamic iPhone benchmark with 26 newly created apps grounded in personal context. Each of the 26 apps is centrally seeded around one persona, Jordan Avery, and the apps interact together in a realistic ecosystem that reflect real app interactions. We create 133 personalized mobile agent tasks to test in this environment, and the best model, even with privileged information, only scores 51%.

10:24 AM · Jun 9, 2026 · 9.8K Views

/Tech5h ago

Researchers Launch iOSWorld Benchmark for Mobile AI Agents

1346141614.3K

#43

Original post

Lawrence Jang@JangLawrenceK

An annoyingly common question I get as an AI PhD student is “When can you get the ChatGPT AI to do something useful? It can’t even work on my phone yet. Siri is pretty dumb.”

10:24 AM · Jun 9, 2026 · 9.8K Views

Sentiment

Many users praised the new iOSWorld benchmark for its realistic focus on cross-app tasks and persistent identity in mobile AI agents, while a few dismissed it as overly niche.

Pos

83.3%

Neg

16.7%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS230BOOKMARKS1

Lawrence Jang@JangLawrenceK

We constructed 26 realistic clones of real life apps. Everyone on this project sat together and contributed the apps they would want to see a phone agent be adept at. My favorites include TableFind (OpenTable), CityRide (Uber), and Quickbite (DoorDash). It is pretty nice to watch Gemini order a ride to the restaurant after it booked you a reservation.

1d23051

LIKES9

Lawrence Jang@JangLawrenceK

If you’re interested you can find all the resources at http://iosworld.io. We open source the apps, code, runners, tools, everything you need to get set up for this benchmark. We even include an AWS runner for non-Mac owners so the entire community can use iOSWorld. I think evaluating agents in personal context is a very obvious quality and use-case that must be emphasized more in the community, and hope this helps steer research in that direction.

This work was done in collaboration with @mareks_woodside , Geronimo Carom, @andrewkjang7, @kohjingyu and my advisor @rsalakhu.

Please enjoy one of the memes used for inspiration for this project come to life in iOSWorld below.

Website: http://iosworld.io Code: http://github.com/ljang0/iosworld Paper: https://arxiv.org/abs/2606.09764

1d1679

RETWEETS3

Russ Salakhutdinov@rsalakhu

New work: iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Paper: https://arxiv.org/abs/2606.09764 Code+Web: https://iosworld.io

An interactive benchmark built around a persistent user identity spanning 26 custom iOS apps, including analogs of OpenTable, Uber, DoorDash, AirBnB, Chase. These apps contain richly interconnected data, such as messages, transactions, travel histories, social relationships, and personal preferences.

iOSWorld comprises 133 tasks across three levels of difficulty: single-app (27), multi-app (60), and memory & personalization (46). We evaluate leading frontier and open-source computer-use agents under both vision-only and privileged vision+XML settings. Even with privileged access, the strongest frontier model achieves only 52% success, underscoring the challenge of personalized, cross-app reasoning.

We also release an MCP-based tool-use interface for all 26 apps, enabling controlled comparisons between computer-use, tool-use, and hybrid agents. The full benchmark includes the apps, seeded user data, tasks, rubrics, evaluation code, MCP server, and cloud-based infrastructure for running experiments without Mac hardware.

See a more detailed thread by @JangLawrenceK.

Lawrence Jang@JangLawrenceK

An annoyingly common question I get as an AI PhD student is “When can you get the ChatGPT AI to do something useful? It can’t even work on my phone yet. Siri is pretty dumb.”

1d4.5K2210

REPLIES1

Lawrence Jang@JangLawrenceK

We also create an MCP tool server for all the apps in iOSWorld. It increases Qwen3.5 performance from x to y. Tool-enthusiasts, please do not be scared off by the heavy use of computer-use agents for mobile use, we have something for everyone. I believe the future of agents will require some hybrid version of an action space of tools + keyboard/cursor.

1d1415

Lawrence Jang@JangLawrenceK

We adapted current computer-use models and found they performed sub-adequately on our benchmark, especially without access to XML accessibility trees. This was an interesting zag from the latest computer-use agent evaluations, where vision-only input works well for CUAs. This indicates that mobile UI grounding has a gap that could be pursued. The Anthropic models did the best, and Gemini was surprisingly solid for the vision-only setting.

1d1415

Mareks Woodside@mareks_woodside

@JangLawrenceK @andrewkjang7 @kohjingyu @rsalakhu Great work, Lawrence! This turned out great. Really grateful to have been part of it and to have learned from everyone involved.

Excited to see the impact this has on future work in AI agents and personalization!

1d952

Raj Mehta@_rajmehta_

@JangLawrenceK @mareks_woodside @andrewkjang7 @kohjingyu @rsalakhu Really great stuff Larry!

In v2 of the benchmark I'd suggest adding a Hinge clone, I'm sure its a part of Jordan's digital life

1d891

Matt@m13v_

@rsalakhu the 52% isn't the interesting number, the vision-only vs vision+XML gap is. that delta is the whole case for driving phone and desktop agents off the accessibility tree instead of pixels. structure is the unlock, not a bigger model. https://t8r.tech/r/e5sbrsnu written with ai

1d50

tsunami_crypto@ls_brd

@rsalakhu phone-level agents are the real utility test that actually matters

let me know when i can ask it to cancel my subscriptions for me

1d47

phonescloud@phonescloud99

@rsalakhu The seeded identity is what makes this feel closer to phone reality than one-off UI tasks. Do the rubrics penalize plausible but preference-violating actions, or only final task failure?

20h31

Rugbist@rugbist_

@rsalakhu the gap between AI research and actual phone utility is still massive

benchmarks like this might finally bridge it

1d20

Blissy@BlissyOnX

@rsalakhu 26 custom apps and persistent identity sounds like the hard part most people skip. wish more benchmarks did this instead of static screenshots

1d16

Alex YGift@Radipdegen

@rsalakhu this whole idea lives or dies on whether the phone actually remembers i exist between uses

the persistent identity bit is the make-or-break imo

1d13

Invincible@InvincibleEdge

@rsalakhu Siri dumb but a 26-app persistent identity benchmark?

idk that feels like life support for a specific edge case

1d12