atomic[.]chat (a desktop app that runs LLMs locally) ran a very revealing comparison for local AI agents, on a MacBook Pro M5 Max, 64GB.
Liquid’s much smaller LFM2.5-8B-A1B beat gpt-oss-20b by finishing every required tool call, cutting runtime by more than half, and using 4.8GB RAM instead of 11GB.
The task was not normal chat, because the model had to plan a trip by calling outside tools for 3 weather checks, 2 currency conversions, 1 email, and 1 reminder.
The striking part is that LFM2.5-8B-A1B is much smaller in active compute, yet it hit every required call at 266tok/s, while gpt-oss-20b used 11GB RAM, made only 3/7 calls, and ran at 146tok/s.
Now, tool calling is a control problem before it is a language problem.
The model has to preserve a checklist across context, decide when language should stop and action should begin, and resist the temptation to answer as if partial completion were enough.
A smaller mixture-of-experts model with only a fraction of its parameters active can win if its training shaped those control habits more sharply than a larger model’s general fluency did.