/AI3h ago

Modal Labs Hits 1033 Tokens Per Second In LLM Inference Test

2190113.1K
Original post
Charles 馃帀 Frye@charles_irl#848inAI

1k tps is the new baseline ig

https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

5:18 PM 路 Jun 9, 2026 路 2.4K Views
Sentiment

Users are impressed by Modal Labs hitting 1033 tokens per second in LLM inference because it was achieved efficiently with a smaller model on far fewer and slower GPUs than typical setups.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS675BOOKMARKS1LIKES3

this is a smaller model, 35B-A3B rather than 1T

but it uses 4x fewer GPUs (two not eight) that are 2x slower (H100 not B200) and at a 2x higher precision (FP8 not FP4). that's 16x out of the ~30x model gap diff. could probably close w custom dflash

not bad for some demo code :)

1k tps is the new baseline ig

https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

2hViews 675Likes 3Bookmarks 1
Do臒a莽@dogacel0

@charles_irl Isn't speculation depth 8 too aggressive for such model? I thought for a big MoE model depth ~4 is better. What is the acceptance length in SPEED-bench or MT-Bench?

2hViews 40