/Tech19h ago

Modal Labs Hits 1033 Tokens Per Second In LLM Inference Test

2290144.6K
Original post
Charles 馃帀 Frye@charles_irl#927inTech

1k tps is the new baseline ig

https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

5:18 PM 路 Jun 9, 2026 路 3.6K Views
Sentiment

Users are impressed by Modal Labs hitting 1033 tokens per second in LLM inference because it was achieved efficiently with a smaller model on far fewer and slower GPUs than typical setups.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS985BOOKMARKS1LIKES5

this is a smaller model, 35B-A3B rather than 1T

but it uses 4x fewer GPUs (two not eight) that are 2x slower (H100 not B200) and at a 2x higher precision (FP8 not FP4). that's 16x out of the ~30x model gap diff. could probably close w custom dflash

not bad for some demo code :)

1k tps is the new baseline ig

https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

19hViews 985Likes 5Bookmarks 1
Do臒a莽@dogacel0

@charles_irl Isn't speculation depth 8 too aggressive for such model? I thought for a big MoE model depth ~4 is better. What is the acceptance length in SPEED-bench or MT-Bench?

19hViews 40