/AI3h ago

Modal Labs Hits 1033 Tokens Per Second In LLM Inference Test

2190113.1K

#848

Original post

Charles 🎉 Frye@charles_irl#848inAI

1k tps is the new baseline ig

https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

5:18 PM · Jun 9, 2026 · 2.4K Views

/AI3h ago

Modal Labs Hits 1033 Tokens Per Second In LLM Inference Test

2190113.1K

#848

Original post

Charles 🎉 Frye@charles_irl#848inAI

1k tps is the new baseline ig

https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

5:18 PM · Jun 9, 2026 · 2.4K Views

Sentiment

Users are impressed by Modal Labs hitting 1033 tokens per second in LLM inference because it was achieved efficiently with a smaller model on far fewer and slower GPUs than typical setups.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS675BOOKMARKS1LIKES3

Charles 🎉 Frye@charles_irl

this is a smaller model, 35B-A3B rather than 1T

but it uses 4x fewer GPUs (two not eight) that are 2x slower (H100 not B200) and at a 2x higher precision (FP8 not FP4). that's 16x out of the ~30x model gap diff. could probably close w custom dflash

not bad for some demo code :)

Charles 🎉 Frye@charles_irl

1k tps is the new baseline ig

https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

2h67531

Doğaç@dogacel0

@charles_irl Isn't speculation depth 8 too aggressive for such model? I thought for a big MoE model depth ~4 is better. What is the acceptance length in SPEED-bench or MT-Bench?

2h40