So much possibilities for on-device small models.
Here @adrgrondin is running Google’s Gemma 4 E2B on iPhone 17 Pro. ~40tk/s with MLX optimized for Apple Silicon SOTA coding & math on mobile with 128K context. Fully offline with thinking mode.
AI Judge changed title after evaluation, original title: "Google's Gemma 4 E2B model executes fully on-device on the iPhone 17 Pro using MLX optimization, delivering 40 tokens per second inference with 128K context and state-of-the-art coding and math performance"
Video demo shows real-time offline generation of a Paris itinerary.
So much possibilities for on-device small models.
Here @adrgrondin is running Google’s Gemma 4 E2B on iPhone 17 Pro. ~40tk/s with MLX optimized for Apple Silicon SOTA coding & math on mobile with 128K context. Fully offline with thinking mode.
Users are excited about Gemma 4 E2B running on-device on the iPhone 17 Pro because it allows fully offline reasoning with privacy, zero latency, and no API costs for agents and real-time tasks, though some dismiss it.
No Digg Deeper questions have been answered for this story yet.

However, are open-source models only good for chat?

@rohanpaul_ai @adrgrondin What breaks first with a pocket reasoning model: memory, tools, battery, or the user asking it to become a whole department?
So much possibilities for on-device small models.
Here @adrgrondin is running Google’s Gemma 4 E2B on iPhone 17 Pro. ~40tk/s with MLX optimized for Apple Silicon SOTA coding & math on mobile with 128K context. Fully offline with thinking mode.

@IntuitMachine Can you get it to behave? Does it obey Asimov's 3 laws of robotics?

@rohanpaul_ai @adrgrondin 40 tk/s is the number that decides which UX patterns are viable. thinking mode offline changes the compliance story for enterprise mobile

@rohanpaul_ai @adrgrondin yes, there are a lot of opportunities
i am building something very interesting around on device small models for iphones.
zaya - your second private AI brain.
check this article out to know more about it:

@rohanpaul_ai @adrgrondin Offline reasoning at 40 tokens a second is the exact unlock needed for personal agents that don't have to ping a server for every single thought. ⚡️

@rohanpaul_ai @adrgrondin When I first meet Gemma 4 E2B, I was shocked by the speed. However, when I put it in my app, the answer from Gemma is not so well. LocalLLM still isn’t performing well.

@rohanpaul_ai @adrgrondin does this also have the web search feature?

@rohanpaul_ai @adrgrondin 显而易见,这个能做很多事。

@rohanpaul_ai @adrgrondin It’s not SOTA but yeah still very impressive.

@rohanpaul_ai @adrgrondin I think iOS 27 will offer even more choices!

@rohanpaul_ai @adrgrondin And it's running locally, which means zero latency for real-time use cases like voice assistants. I can coding on the plane soon!

@rohanpaul_ai @adrgrondin On-device models at that speed make private, low-latency assistants feel much more practical, especially for workflows that can tolerate smaller context and narrower reasoning.

@rohanpaul_ai @adrgrondin 128K context fully offline on a phone is something I didn't think we'd see this soon 🤯

@rohanpaul_ai @adrgrondin My iPhone 15 got hot like hell, never used again.

@IntuitMachine No cloud. No latency. Full reasoning on device. The API bill just got a competitor 👀

@rohanpaul_ai @adrgrondin The current small models are absolutely terrible. They fail even the weakest benchmarks. Be serious

@rohanpaul_ai @adrgrondin But it occupies close to 2-3gb in mobile phone?

@rohanpaul_ai @adrgrondin Wait, are we really coding like that on our phones now?! 🤯📱