/AI20h ago

Liquid LFM2.5-8B-A1B Beats OpenAI Gpt-Oss-20B On Local Tool Calling

--0--
Quote posts
Reposts
Original post
Rohan Paul@rohanpaul_ai#1032inAI

atomic[.]chat (a desktop app that runs LLMs locally) ran a very revealing comparison for local AI agents, on a MacBook Pro M5 Max, 64GB.

Liquid’s much smaller LFM2.5-8B-A1B beat gpt-oss-20b by finishing every required tool call, cutting runtime by more than half, and using 4.8GB RAM instead of 11GB.

The task was not normal chat, because the model had to plan a trip by calling outside tools for 3 weather checks, 2 currency conversions, 1 email, and 1 reminder.

The striking part is that LFM2.5-8B-A1B is much smaller in active compute, yet it hit every required call at 266tok/s, while gpt-oss-20b used 11GB RAM, made only 3/7 calls, and ran at 146tok/s.

Now, tool calling is a control problem before it is a language problem.

The model has to preserve a checklist across context, decide when language should stop and action should begin, and resist the temptation to answer as if partial completion were enough.

A smaller mixture-of-experts model with only a fraction of its parameters active can win if its training shaped those control habits more sharply than a larger model’s general fluency did.

1:31 PM · May 30, 2026 · 8.6K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
RETWEETS7
Rohan Paul@rohanpaul_ai

atomic[.]chat (a desktop app that runs LLMs locally) ran a very revealing comparison for local AI agents, on a MacBook Pro M5 Max, 64GB.

Liquid’s much smaller LFM2.5-8B-A1B beat gpt-oss-20b by finishing every required tool call, cutting runtime by more than half, and using 4.8GB RAM instead of 11GB.

The task was not normal chat, because the model had to plan a trip by calling outside tools for 3 weather checks, 2 currency conversions, 1 email, and 1 reminder.

The striking part is that LFM2.5-8B-A1B is much smaller in active compute, yet it hit every required call at 266tok/s, while gpt-oss-20b used 11GB RAM, made only 3/7 calls, and ran at 146tok/s.

Now, tool calling is a control problem before it is a language problem.

The model has to preserve a checklist across context, decide when language should stop and action should begin, and resist the temptation to answer as if partial completion were enough.

A smaller mixture-of-experts model with only a fraction of its parameters active can win if its training shaped those control habits more sharply than a larger model’s general fluency did.

20hViews 8.6KLikes 54Bookmarks 28