/Tech3d ago

Neural MMO creator Joseph Suarez argues reinforcement learning research has spent a decade building slower, poorly scaling simulation environments

PufferLib uses domain randomization to build simulators 10,000x faster.

124263126949.6K
Original post
Joseph Suarez 🐡@jsuarez#1371inTech

A little perspective: RL as a field spent 10 years making algorithms slower and slower. If you look at the original ALE, it actually can sim a few thousand frames per second per core. If you look at some of the last big env releases before a ton of people moved over to LLMs, you'll find several at dozens to hundreds of steps per second with such bad engineering that they don't even scale with vectorization.

The field did this exactly because they presumed they would have to train directly in the real world. In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work outside of the benchmarks shown in the original pubs. There's a clear gap between on-pol and other methods. You don't simply switch and scale up compute to save data. You have to spend a TON more compute to match the perf of on-pol, and then you spend even more compute to gain in sample efficiency.

Our whole core realization with PufferLib is that we can write good sims for a lot of problems 10000x faster. Good doesn't even mean accurate. It means accurate enough with domain randomization and other tricks that our agents can implicitly sysid their current setting and act robustly. So far, this has worked across several different industries. I'd love to give examples here, but this is unfortunately where exact client details get confidential. We need to be better about negotiating publicity, and we're starting to do that as Puffer gets bigger.

Another major flaw with slower and slower algorithms is that the core research loop also gets slower and slower. We sim mazes and 2048 at 10+m steps per second. Big deal right, those are easy. Wrong: algorithmic improvements on those envs have consistently predicted performance improvement on every single env in our test suite. Without this, we wouldn't have been able to release so many core breakthroughs in the last 2 years with a grand total of ~20 GPUs. We ran 20,000 experiments on ~12 of them in the 3 weeks leading up to Puffer 4 launch. At traditional speeds, it would have taken Google scale compute and an infra team.

So no, we're not going to step the real world at 20m sps, but assuming that matters (or at least that it is the only thing that matters) is where the field went wrong. /rant.

8:16 AM · Jun 7, 2026 · 48.5K Views
Sentiment

Positive users praise PufferLib's real-world results in finance, gaming, and other fields while negative users criticize the brittle off-policy algorithms it produces.

Pos
66.7%
Neg
33.3%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS6.5KBOOKMARKS25LIKES47
Spencer Cheng@spenccheng

Most new clients don’t think we can do sim to real on their complex problem. They buy hoping 30% of what we pitched is true.

The “oh shit, it actually worked” moment is a fun call we get to have. These calls turn into imagine if we could do X conversations.

A little perspective: RL as a field spent 10 years making algorithms slower and slower. If you look at the original ALE, it actually can sim a few thousand frames per second per core. If you look at some of the last big env releases before a ton of people moved over to LLMs, you'll find several at dozens to hundreds of steps per second with such bad engineering that they don't even scale with vectorization.

The field did this exactly because they presumed they would have to train directly in the real world. In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work outside of the benchmarks shown in the original pubs. There's a clear gap between on-pol and other methods. You don't simply switch and scale up compute to save data. You have to spend a TON more compute to match the perf of on-pol, and then you spend even more compute to gain in sample efficiency.

Our whole core realization with PufferLib is that we can write good sims for a lot of problems 10000x faster. Good doesn't even mean accurate. It means accurate enough with domain randomization and other tricks that our agents can implicitly sysid their current setting and act robustly. So far, this has worked across several different industries. I'd love to give examples here, but this is unfortunately where exact client details get confidential. We need to be better about negotiating publicity, and we're starting to do that as Puffer gets bigger.

Another major flaw with slower and slower algorithms is that the core research loop also gets slower and slower. We sim mazes and 2048 at 10+m steps per second. Big deal right, those are easy. Wrong: algorithmic improvements on those envs have consistently predicted performance improvement on every single env in our test suite. Without this, we wouldn't have been able to release so many core breakthroughs in the last 2 years with a grand total of ~20 GPUs. We ran 20,000 experiments on ~12 of them in the 3 weeks leading up to Puffer 4 launch. At traditional speeds, it would have taken Google scale compute and an infra team.

So no, we're not going to step the real world at 20m sps, but assuming that matters (or at least that it is the only thing that matters) is where the field went wrong. /rant.

3dViews 6.5KLikes 47Bookmarks 25
RETWEETS29

A little perspective: RL as a field spent 10 years making algorithms slower and slower. If you look at the original ALE, it actually can sim a few thousand frames per second per core. If you look at some of the last big env releases before a ton of people moved over to LLMs, you'll find several at dozens to hundreds of steps per second with such bad engineering that they don't even scale with vectorization.

The field did this exactly because they presumed they would have to train directly in the real world. In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work outside of the benchmarks shown in the original pubs. There's a clear gap between on-pol and other methods. You don't simply switch and scale up compute to save data. You have to spend a TON more compute to match the perf of on-pol, and then you spend even more compute to gain in sample efficiency.

Our whole core realization with PufferLib is that we can write good sims for a lot of problems 10000x faster. Good doesn't even mean accurate. It means accurate enough with domain randomization and other tricks that our agents can implicitly sysid their current setting and act robustly. So far, this has worked across several different industries. I'd love to give examples here, but this is unfortunately where exact client details get confidential. We need to be better about negotiating publicity, and we're starting to do that as Puffer gets bigger.

Another major flaw with slower and slower algorithms is that the core research loop also gets slower and slower. We sim mazes and 2048 at 10+m steps per second. Big deal right, those are easy. Wrong: algorithmic improvements on those envs have consistently predicted performance improvement on every single env in our test suite. Without this, we wouldn't have been able to release so many core breakthroughs in the last 2 years with a grand total of ~20 GPUs. We ran 20,000 experiments on ~12 of them in the 3 weeks leading up to Puffer 4 launch. At traditional speeds, it would have taken Google scale compute and an infra team.

So no, we're not going to step the real world at 20m sps, but assuming that matters (or at least that it is the only thing that matters) is where the field went wrong. /rant.

3dViews 48.5KLikes 425Bookmarks 267
REPLIES2

We're not primarily robotics. PufferLib has been used professionally in finance, commerce, gaming, defense, and animation. Has worked great. The world model stuff straight up doesn't work. Some surrounding lit is borderline scientific fraud. I believe it works better where you have a good amount of static data to train it like in robotics. But in our case, we simply won't need to scale up because we can hit the throughputs of the largest RL projects from OAI/DM on a single node, and it seems like the models can be much smaller than previously expected

3dViews 732Likes 11Bookmarks 5
An Eevee@rw_eevee

Appreciate your perspective and yes, this is a largely accurate history of the field. We’ve been all-in on fast sim + DR + RNN-style networks for the past ~2 years at least. It works great and we’ve gotten some spectacular results.

But sadly some problems have refused to yield to sim as easily as locomotion. That’s why every company suddenly declared world models the new hype. It’s supposed to be a totally general AI-based sim. You can at least put it on thousands of GPUs to get 20 million sps if you want.

Will it work? idk maybe. VLAs kinda remain a joke. Hand-crafted sims aren’t accurate or diverse enough yet. Offline RL has never worked once in history. But you have to put your eggs in some basket 🤷‍♂️

3dViews 845Likes 6Bookmarks 1
kache@yacineMTB

@jsuarez yes

A little perspective: RL as a field spent 10 years making algorithms slower and slower. If you look at the original ALE, it actually can sim a few thousand frames per second per core. If you look at some of the last big env releases before a ton of people moved over to LLMs, you'll find several at dozens to hundreds of steps per second with such bad engineering that they don't even scale with vectorization.

The field did this exactly because they presumed they would have to train directly in the real world. In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work outside of the benchmarks shown in the original pubs. There's a clear gap between on-pol and other methods. You don't simply switch and scale up compute to save data. You have to spend a TON more compute to match the perf of on-pol, and then you spend even more compute to gain in sample efficiency.

Our whole core realization with PufferLib is that we can write good sims for a lot of problems 10000x faster. Good doesn't even mean accurate. It means accurate enough with domain randomization and other tricks that our agents can implicitly sysid their current setting and act robustly. So far, this has worked across several different industries. I'd love to give examples here, but this is unfortunately where exact client details get confidential. We need to be better about negotiating publicity, and we're starting to do that as Puffer gets bigger.

Another major flaw with slower and slower algorithms is that the core research loop also gets slower and slower. We sim mazes and 2048 at 10+m steps per second. Big deal right, those are easy. Wrong: algorithmic improvements on those envs have consistently predicted performance improvement on every single env in our test suite. Without this, we wouldn't have been able to release so many core breakthroughs in the last 2 years with a grand total of ~20 GPUs. We ran 20,000 experiments on ~12 of them in the 3 weeks leading up to Puffer 4 launch. At traditional speeds, it would have taken Google scale compute and an infra team.

So no, we're not going to step the real world at 20m sps, but assuming that matters (or at least that it is the only thing that matters) is where the field went wrong. /rant.

1dViews 1KLikes 1Bookmarks 2

@rw_eevee If you have your own fast rnn pipeline, try Muon and retune if you're still on Adam. Was a major boost. Our other things are harder to integrate. PufferNet rocks but needs kernels

3dViews 295Likes 6Bookmarks 1
An Eevee@rw_eevee

@jsuarez Love it. You’re doing great work man 🫡

3dViews 287Likes 2
Spencer Cheng@spenccheng

@EmanueleUngaro_ We just write our own env physics for specific problems with DR and a couple env tricks. Results transfer fairly well.

@yacineMTB is playing with Puffer and Mujoco Warp and getting some fun results too.

3dViews 36Likes 2
Spencer Cheng@spenccheng

@EmanueleUngaro_ @yacineMTB @finlay_sanders and Sam wrote the drone env as an example and has sim 2 real working there.

3dViews 30Likes 2
Emanuele@EmanueleUngaro_

@spenccheng what do you guys use for physics? that pairs nicely with pufferlib. Mujoco?

3dViews 55
You Jiacheng@YouJiacheng

@jsuarez I wonder what DR do puffer use in different domains. it's really interesting. how to make DR good enough…

A little perspective: RL as a field spent 10 years making algorithms slower and slower. If you look at the original ALE, it actually can sim a few thousand frames per second per core. If you look at some of the last big env releases before a ton of people moved over to LLMs, you'll find several at dozens to hundreds of steps per second with such bad engineering that they don't even scale with vectorization.

The field did this exactly because they presumed they would have to train directly in the real world. In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work outside of the benchmarks shown in the original pubs. There's a clear gap between on-pol and other methods. You don't simply switch and scale up compute to save data. You have to spend a TON more compute to match the perf of on-pol, and then you spend even more compute to gain in sample efficiency.

Our whole core realization with PufferLib is that we can write good sims for a lot of problems 10000x faster. Good doesn't even mean accurate. It means accurate enough with domain randomization and other tricks that our agents can implicitly sysid their current setting and act robustly. So far, this has worked across several different industries. I'd love to give examples here, but this is unfortunately where exact client details get confidential. We need to be better about negotiating publicity, and we're starting to do that as Puffer gets bigger.

Another major flaw with slower and slower algorithms is that the core research loop also gets slower and slower. We sim mazes and 2048 at 10+m steps per second. Big deal right, those are easy. Wrong: algorithmic improvements on those envs have consistently predicted performance improvement on every single env in our test suite. Without this, we wouldn't have been able to release so many core breakthroughs in the last 2 years with a grand total of ~20 GPUs. We ran 20,000 experiments on ~12 of them in the 3 weeks leading up to Puffer 4 launch. At traditional speeds, it would have taken Google scale compute and an infra team.

So no, we're not going to step the real world at 20m sps, but assuming that matters (or at least that it is the only thing that matters) is where the field went wrong. /rant.

2dViews 534Likes 2Bookmarks 0
Sciumo@SciumoInc

@jsuarez “In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work”

A bit verbose for a tshirt, but still worthy

3dViews 292Likes 1
gfodor.id@gfodor

@jsuarez @rw_eevee as a total RL outsider, but someone who just started dabbling with pufferlib - what's the best technique for leveraging the existing VR content ecosystem for this? i have been working on OpenXR for a while now and can't help but think there's a huge overhang from this.

3dViews 90Likes 1
Emanuele@EmanueleUngaro_

@spenccheng @yacineMTB @finlay_sanders so in this field people just write their own simulation engine from scratch everytime? because you can make it super barebone and more efficient?

3dViews 19