/Tech19h ago

Modal Labs Hits 1033 Tokens Per Second In LLM Inference Test

2290144.6K

#927

Original post

Charles 🎉 Frye@charles_irl#927inTech

1k tps is the new baseline ig

https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

5:18 PM · Jun 9, 2026 · 3.6K Views

/Tech19h ago

Modal Labs Hits 1033 Tokens Per Second In LLM Inference Test

2290144.6K

#927

Original post

Charles 🎉 Frye@charles_irl#927inTech

1k tps is the new baseline ig

https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

5:18 PM · Jun 9, 2026 · 3.6K Views

Sentiment

Users are impressed by Modal Labs hitting 1033 tokens per second in LLM inference because it was achieved efficiently with a smaller model on far fewer and slower GPUs than typical setups.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS985BOOKMARKS1LIKES5

Charles 🎉 Frye@charles_irl

this is a smaller model, 35B-A3B rather than 1T

but it uses 4x fewer GPUs (two not eight) that are 2x slower (H100 not B200) and at a 2x higher precision (FP8 not FP4). that's 16x out of the ~30x model gap diff. could probably close w custom dflash

not bad for some demo code :)

Charles 🎉 Frye@charles_irl

1k tps is the new baseline ig

https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

19h98551

Doğaç@dogacel0

@charles_irl Isn't speculation depth 8 too aggressive for such model? I thought for a big MoE model depth ~4 is better. What is the acceptance length in SPEED-bench or MT-Bench?

19h40