A little perspective: RL as a field spent 10 years making algorithms slower and slower. If you look at the original ALE, it actually can sim a few thousand frames per second per core. If you look at some of the last big env releases before a ton of people moved over to LLMs, you'll find several at dozens to hundreds of steps per second with such bad engineering that they don't even scale with vectorization.
The field did this exactly because they presumed they would have to train directly in the real world. In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work outside of the benchmarks shown in the original pubs. There's a clear gap between on-pol and other methods. You don't simply switch and scale up compute to save data. You have to spend a TON more compute to match the perf of on-pol, and then you spend even more compute to gain in sample efficiency.
Our whole core realization with PufferLib is that we can write good sims for a lot of problems 10000x faster. Good doesn't even mean accurate. It means accurate enough with domain randomization and other tricks that our agents can implicitly sysid their current setting and act robustly. So far, this has worked across several different industries. I'd love to give examples here, but this is unfortunately where exact client details get confidential. We need to be better about negotiating publicity, and we're starting to do that as Puffer gets bigger.
Another major flaw with slower and slower algorithms is that the core research loop also gets slower and slower. We sim mazes and 2048 at 10+m steps per second. Big deal right, those are easy. Wrong: algorithmic improvements on those envs have consistently predicted performance improvement on every single env in our test suite. Without this, we wouldn't have been able to release so many core breakthroughs in the last 2 years with a grand total of ~20 GPUs. We ran 20,000 experiments on ~12 of them in the 3 weeks leading up to Puffer 4 launch. At traditional speeds, it would have taken Google scale compute and an infra team.
So no, we're not going to step the real world at 20m sps, but assuming that matters (or at least that it is the only thing that matters) is where the field went wrong. /rant.





