1k tps is the new baseline ig
https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245
1k tps is the new baseline ig
https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245
Users viewed Modal Labs' 1033 tokens-per-second inference result favorably, noting it used a smaller model on fewer slower GPUs yet still performed competitively.
this is a smaller model, 35B-A3B rather than 1T
but it uses 4x fewer GPUs (two not eight) that are 2x slower (H100 not B200) and at a 2x higher precision (FP8 not FP4). that's 16x out of the ~30x model gap diff. could probably close w custom dflash
not bad for some demo code :)
1k tps is the new baseline ig
https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245

@charles_irl Isn't speculation depth 8 too aggressive for such model? I thought for a big MoE model depth ~4 is better. What is the acceptance length in SPEED-bench or MT-Bench?
No Digg Deeper questions have been answered for this story yet.
1k tps is the new baseline ig
https://github.com/modal-labs/modal-examples/pull/1586#discussion_r3384608245