/Tech1d ago

Researchers release iOSWorld, a native iOS simulator benchmark that evaluates computer use agents across 26 interconnected apps

The environment evaluates agents on 133 personalized tasks.

725285.7K

#43

Original post

Russ Salakhutdinov#43

Lawrence Jang@JangLawrenceK

An annoyingly common question I get as an AI PhD student is “When can you get the ChatGPT AI to do something useful? It can’t even work on my phone yet. Siri is pretty dumb.”

To be fair, I think their criticism is correct. I personally wished that AI was better integrated on my phone. LLMs can solve IMO problems, so shouldn’t it be a cakewalk for it to remind me of the text I forgot to respond to last week? Obviously not, since it doesn’t exist in my pocket yet. Or maybe Apple’s new update yesterday fixed this and my research project is obsolete.

We are releasing iOSWorld (http://iosworld.io), a dynamic iPhone benchmark with 26 newly created apps grounded in personal context. Each of the 26 apps is centrally seeded around one persona, Jordan Avery, and the apps interact together in a realistic ecosystem that reflect real app interactions. We create 133 personalized mobile agent tasks to test in this environment, and the best model, even with privileged information, only scores 51%.

10:24 AM · Jun 9, 2026 · 10.3K Views

/Tech1d ago

Researchers release iOSWorld, a native iOS simulator benchmark that evaluates computer use agents across 26 interconnected apps

The environment evaluates agents on 133 personalized tasks.

725285.7K

#43

Original post

Russ Salakhutdinov#43

Lawrence Jang@JangLawrenceK

An annoyingly common question I get as an AI PhD student is “When can you get the ChatGPT AI to do something useful? It can’t even work on my phone yet. Siri is pretty dumb.”

10:24 AM · Jun 9, 2026 · 10.3K Views

Sentiment

Positive users praise the iOSWorld benchmark for testing AI agents across 26 apps with persistent identity as a realistic utility measure, while some dismiss it as overly narrow.

Pos

80.0%

Neg

20.0%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS4.8KBOOKMARKS10LIKES21RETWEETS3REPLIES7

Russ Salakhutdinov@rsalakhu

New work: iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Paper: https://arxiv.org/abs/2606.09764 Code+Web: https://iosworld.io

An interactive benchmark built around a persistent user identity spanning 26 custom iOS apps, including analogs of OpenTable, Uber, DoorDash, AirBnB, Chase. These apps contain richly interconnected data, such as messages, transactions, travel histories, social relationships, and personal preferences.

iOSWorld comprises 133 tasks across three levels of difficulty: single-app (27), multi-app (60), and memory & personalization (46). We evaluate leading frontier and open-source computer-use agents under both vision-only and privileged vision+XML settings. Even with privileged access, the strongest frontier model achieves only 52% success, underscoring the challenge of personalized, cross-app reasoning.

We also release an MCP-based tool-use interface for all 26 apps, enabling controlled comparisons between computer-use, tool-use, and hybrid agents. The full benchmark includes the apps, seeded user data, tasks, rubrics, evaluation code, MCP server, and cloud-based infrastructure for running experiments without Mac hardware.

See a more detailed thread by @JangLawrenceK.

Lawrence Jang@JangLawrenceK

An annoyingly common question I get as an AI PhD student is “When can you get the ChatGPT AI to do something useful? It can’t even work on my phone yet. Siri is pretty dumb.”

1d4.8K2110

Jing Yu Koh@kohjingyu

A lot of computer use work focuses on desktop environments, but automating tasks on the phone also has very high potential upside! iOSWorld is a fun new benchmark created by Lawrence to measure CUAs on how well they do on mobile + personalized tasks.

Lawrence Jang@JangLawrenceK

An annoyingly common question I get as an AI PhD student is “When can you get the ChatGPT AI to do something useful? It can’t even work on my phone yet. Siri is pretty dumb.”

1d3.3K91

Matt@m13v_

@rsalakhu the 52% isn't the interesting number, the vision-only vs vision+XML gap is. that delta is the whole case for driving phone and desktop agents off the accessibility tree instead of pixels. structure is the unlock, not a bigger model. https://t8r.tech/r/e5sbrsnu written with ai

1d50

tsunami_crypto@ls_brd

@rsalakhu phone-level agents are the real utility test that actually matters

let me know when i can ask it to cancel my subscriptions for me

1d47

phonescloud@phonescloud99

@rsalakhu The seeded identity is what makes this feel closer to phone reality than one-off UI tasks. Do the rubrics penalize plausible but preference-violating actions, or only final task failure?

22h31

Rugbist@rugbist_

@rsalakhu the gap between AI research and actual phone utility is still massive

benchmarks like this might finally bridge it

1d20

Blissy@BlissyOnX

@rsalakhu 26 custom apps and persistent identity sounds like the hard part most people skip. wish more benchmarks did this instead of static screenshots

1d16

Alex YGift@Radipdegen

@rsalakhu this whole idea lives or dies on whether the phone actually remembers i exist between uses

the persistent identity bit is the make-or-break imo

1d13

Invincible@InvincibleEdge

@rsalakhu Siri dumb but a 26-app persistent identity benchmark?

idk that feels like life support for a specific edge case

1d12