Entropix creator _xjdr argues provider tokens-per-second claims are unreproducible without standardized reporting context
Omitted metrics include time-to-first-token latency and tokens per GPU.
Users see higher tokens-per-second rates as a useful reference point for judging AI model speed and context handling on devices like Macs.
No Digg Deeper questions have been answered for this story yet.
Most Activity
@_xjdr tok/s? at what tokens/gpu/s and at what ttft and ...
when people or providers say 'this model gets <x> tok/s i honestly have no idea what they mean (and as such can never reproduce / verify said claims)

@_xjdr Cards will always have that 🫡 but yeah it’s like “okay is it special kernels? Spec decode? Quantized and not telling us? Congrats, I think?”

I mean, ones with a higher token per second is speedy, it uses context quicker as well lol. It’s a good reference point if it runs 50 tok/s on a MacBook M2, but let’s say I have the M1, I know it’s POSSIBLE I can get similar, but I know it’s an older gen, so I can assume 30-40tok/s and decide if it’s worth downloading