/Tech43d ago

Google’s Gemma 4 E2B runs fully on-device on the iPhone 17 Pro at roughly 40 tokens per second using MLX optimization, supporting a 128K context window and offline reasoning with image understanding.

AI Judge changed title after evaluation, original title: "Google's Gemma 4 E2B model executes fully on-device on the iPhone 17 Pro using MLX optimization, delivering 40 tokens per second inference with 128K context and state-of-the-art coding and math performance"

Video demo shows real-time offline generation of a Paris itinerary.

353524919344.4K

#1257

Original post

Rohan Paul@rohanpaul_ai#1257inTech

So much possibilities for on-device small models.

Here @adrgrondin is running Google’s Gemma 4 E2B on iPhone 17 Pro. ~40tk/s with MLX optimized for Apple Silicon SOTA coding & math on mobile with 128K context. Fully offline with thinking mode.

6:53 AM · May 17, 2026 · 34.2K Views

Sentiment

Users are excited about Gemma 4 E2B running on-device on the iPhone 17 Pro because it allows fully offline reasoning with privacy, zero latency, and no API costs for agents and real-time tasks, though some dismiss it.

Pos

90.0%

Neg

10.0%

17 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Carlos E. Perez@IntuitMachine

However, are open-source models only good for chat?

43d487

LIKES1

Farzam Hejazi@FarzamHejazi

@rohanpaul_ai @adrgrondin What breaks first with a pocket reasoning model: memory, tools, battery, or the user asking it to become a whole department?

43d611

RETWEETS25

Rohan Paul@rohanpaul_ai

So much possibilities for on-device small models.

Here @adrgrondin is running Google’s Gemma 4 E2B on iPhone 17 Pro. ~40tk/s with MLX optimized for Apple Silicon SOTA coding & math on mobile with 128K context. Fully offline with thinking mode.

43d34.2K266123

FluSCIM Lucas Gonzalez@lucasgonzalez

@IntuitMachine Can you get it to behave? Does it obey Asimov's 3 laws of robotics?

43d62

Michał Piszczek@cdiamond

@rohanpaul_ai @adrgrondin 40 tk/s is the number that decides which UX patterns are viable. thinking mode offline changes the compliance story for enterprise mobile

43d130

Aaryan Kakad@aaryan_kakad

@rohanpaul_ai @adrgrondin yes, there are a lot of opportunities

i am building something very interesting around on device small models for iphones.

zaya - your second private AI brain.

check this article out to know more about it:

43d371

Shinka - AI@ShinkaIoT

@rohanpaul_ai @adrgrondin Offline reasoning at 40 tokens a second is the exact unlock needed for personal agents that don't have to ping a server for every single thought. ⚡️

43d321

Leon Brooks@leonbrooks79

@rohanpaul_ai @adrgrondin When I first meet Gemma 4 E2B, I was shocked by the speed. However, when I put it in my app, the answer from Gemma is not so well. LocalLLM still isn’t performing well.

43d80

krish@genaikid

@rohanpaul_ai @adrgrondin does this also have the web search feature?

43d54

Max For AI@MaxForAI

@rohanpaul_ai @adrgrondin 显而易见，这个能做很多事。

43d39

Vansh Choudhary@vnsh02

@rohanpaul_ai @adrgrondin It’s not SOTA but yeah still very impressive.

43d36

꧁🐬 ⭑⭒ 🐬 ⭑⭒ 🐬꧂@PumkinMajiic

@rohanpaul_ai @adrgrondin I think iOS 27 will offer even more choices!

43d32

iamkun@iamkunhello

@rohanpaul_ai @adrgrondin And it's running locally, which means zero latency for real-time use cases like voice assistants. I can coding on the plane soon!

43d32

Xiayi Sun@Sherry83044277

@rohanpaul_ai @adrgrondin On-device models at that speed make private, low-latency assistants feel much more practical, especially for workflows that can tolerate smaller context and narrower reasoning.

43d29