1k tps is the new baseline ig
https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245
1k tps is the new baseline ig
https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245
Users are impressed by Modal Labs hitting 1033 tokens per second in LLM inference because it was achieved efficiently with a smaller model on far fewer and slower GPUs than typical setups.
this is a smaller model, 35B-A3B rather than 1T
but it uses 4x fewer GPUs (two not eight) that are 2x slower (H100 not B200) and at a 2x higher precision (FP8 not FP4). that's 16x out of the ~30x model gap diff. could probably close w custom dflash
not bad for some demo code :)
1k tps is the new baseline ig
https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

@charles_irl Isn't speculation depth 8 too aggressive for such model? I thought for a big MoE model depth ~4 is better. What is the acceptance length in SPEED-bench or MT-Bench?
1k tps is the new baseline ig
https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245