Neural MMO creator Joseph Suarez argues reinforcement learning research has spent a decade building slower, poorly scaling simulation environments

VIEWS6.5KBOOKMARKS25LIKES47

Most new clients don’t think we can do sim to real on their complex problem. They buy hoping 30% of what we pitched is true.

The “oh shit, it actually worked” moment is a fun call we get to have. These calls turn into imagine if we could do X conversations.

Joseph Suarez 🐡@jsuarez

A little perspective: RL as a field spent 10 years making algorithms slower and slower. If you look at the original ALE, it actually can sim a few thousand frames per second per core. If you look at some of the last big env releases before a ton of people moved over to LLMs, you'll find several at dozens to hundreds of steps per second with such bad engineering that they don't even scale with vectorization.

The field did this exactly because they presumed they would have to train directly in the real world. In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work outside of the benchmarks shown in the original pubs. There's a clear gap between on-pol and other methods. You don't simply switch and scale up compute to save data. You have to spend a TON more compute to match the perf of on-pol, and then you spend even more compute to gain in sample efficiency.

Our whole core realization with PufferLib is that we can write good sims for a lot of problems 10000x faster. Good doesn't even mean accurate. It means accurate enough with domain randomization and other tricks that our agents can implicitly sysid their current setting and act robustly. So far, this has worked across several different industries. I'd love to give examples here, but this is unfortunately where exact client details get confidential. We need to be better about negotiating publicity, and we're starting to do that as Puffer gets bigger.

Another major flaw with slower and slower algorithms is that the core research loop also gets slower and slower. We sim mazes and 2048 at 10+m steps per second. Big deal right, those are easy. Wrong: algorithmic improvements on those envs have consistently predicted performance improvement on every single env in our test suite. Without this, we wouldn't have been able to release so many core breakthroughs in the last 2 years with a grand total of ~20 GPUs. We ran 20,000 experiments on ~12 of them in the 3 weeks leading up to Puffer 4 launch. At traditional speeds, it would have taken Google scale compute and an infra team.

So no, we're not going to step the real world at 20m sps, but assuming that matters (or at least that it is the only thing that matters) is where the field went wrong. /rant.

3d6.5K4725

RETWEETS29

Joseph Suarez 🐡@jsuarez

A little perspective: RL as a field spent 10 years making algorithms slower and slower. If you look at the original ALE, it actually can sim a few thousand frames per second per core. If you look at some of the last big env releases before a ton of people moved over to LLMs, you'll find several at dozens to hundreds of steps per second with such bad engineering that they don't even scale with vectorization.

The field did this exactly because they presumed they would have to train directly in the real world. In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work outside of the benchmarks shown in the original pubs. There's a clear gap between on-pol and other methods. You don't simply switch and scale up compute to save data. You have to spend a TON more compute to match the perf of on-pol, and then you spend even more compute to gain in sample efficiency.

Our whole core realization with PufferLib is that we can write good sims for a lot of problems 10000x faster. Good doesn't even mean accurate. It means accurate enough with domain randomization and other tricks that our agents can implicitly sysid their current setting and act robustly. So far, this has worked across several different industries. I'd love to give examples here, but this is unfortunately where exact client details get confidential. We need to be better about negotiating publicity, and we're starting to do that as Puffer gets bigger.

Another major flaw with slower and slower algorithms is that the core research loop also gets slower and slower. We sim mazes and 2048 at 10+m steps per second. Big deal right, those are easy. Wrong: algorithmic improvements on those envs have consistently predicted performance improvement on every single env in our test suite. Without this, we wouldn't have been able to release so many core breakthroughs in the last 2 years with a grand total of ~20 GPUs. We ran 20,000 experiments on ~12 of them in the 3 weeks leading up to Puffer 4 launch. At traditional speeds, it would have taken Google scale compute and an infra team.

So no, we're not going to step the real world at 20m sps, but assuming that matters (or at least that it is the only thing that matters) is where the field went wrong. /rant.

3d48.5K425267

REPLIES2

Joseph Suarez 🐡@jsuarez

We're not primarily robotics. PufferLib has been used professionally in finance, commerce, gaming, defense, and animation. Has worked great. The world model stuff straight up doesn't work. Some surrounding lit is borderline scientific fraud. I believe it works better where you have a good amount of static data to train it like in robotics. But in our case, we simply won't need to scale up because we can hit the throughputs of the largest RL projects from OAI/DM on a single node, and it seems like the models can be much smaller than previously expected

3d732115

An Eevee@rw_eevee

Appreciate your perspective and yes, this is a largely accurate history of the field. We’ve been all-in on fast sim + DR + RNN-style networks for the past ~2 years at least. It works great and we’ve gotten some spectacular results.

But sadly some problems have refused to yield to sim as easily as locomotion. That’s why every company suddenly declared world models the new hype. It’s supposed to be a totally general AI-based sim. You can at least put it on thousands of GPUs to get 20 million sps if you want.

Will it work? idk maybe. VLAs kinda remain a joke. Hand-crafted sims aren’t accurate or diverse enough yet. Offline RL has never worked once in history. But you have to put your eggs in some basket 🤷‍♂️

3d84561

kache@yacineMTB

@jsuarez yes

Joseph Suarez 🐡@jsuarez

A little perspective: RL as a field spent 10 years making algorithms slower and slower. If you look at the original ALE, it actually can sim a few thousand frames per second per core. If you look at some of the last big env releases before a ton of people moved over to LLMs, you'll find several at dozens to hundreds of steps per second with such bad engineering that they don't even scale with vectorization.

The field did this exactly because they presumed they would have to train directly in the real world. In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work outside of the benchmarks shown in the original pubs. There's a clear gap between on-pol and other methods. You don't simply switch and scale up compute to save data. You have to spend a TON more compute to match the perf of on-pol, and then you spend even more compute to gain in sample efficiency.

Our whole core realization with PufferLib is that we can write good sims for a lot of problems 10000x faster. Good doesn't even mean accurate. It means accurate enough with domain randomization and other tricks that our agents can implicitly sysid their current setting and act robustly. So far, this has worked across several different industries. I'd love to give examples here, but this is unfortunately where exact client details get confidential. We need to be better about negotiating publicity, and we're starting to do that as Puffer gets bigger.

Another major flaw with slower and slower algorithms is that the core research loop also gets slower and slower. We sim mazes and 2048 at 10+m steps per second. Big deal right, those are easy. Wrong: algorithmic improvements on those envs have consistently predicted performance improvement on every single env in our test suite. Without this, we wouldn't have been able to release so many core breakthroughs in the last 2 years with a grand total of ~20 GPUs. We ran 20,000 experiments on ~12 of them in the 3 weeks leading up to Puffer 4 launch. At traditional speeds, it would have taken Google scale compute and an infra team.

So no, we're not going to step the real world at 20m sps, but assuming that matters (or at least that it is the only thing that matters) is where the field went wrong. /rant.

1d1K12

Joseph Suarez 🐡@jsuarez

@rw_eevee If you have your own fast rnn pipeline, try Muon and retune if you're still on Adam. Was a major boost. Our other things are harder to integrate. PufferNet rocks but needs kernels

3d29561

An Eevee@rw_eevee

@jsuarez Love it. You’re doing great work man 🫡

3d2872

Spencer Cheng@spenccheng

@EmanueleUngaro_ We just write our own env physics for specific problems with DR and a couple env tricks. Results transfer fairly well.

@yacineMTB is playing with Puffer and Mujoco Warp and getting some fun results too.

3d362

Spencer Cheng@spenccheng

@EmanueleUngaro_ @yacineMTB @finlay_sanders and Sam wrote the drone env as an example and has sim 2 real working there.

3d302

Emanuele@EmanueleUngaro_

@spenccheng what do you guys use for physics? that pairs nicely with pufferlib. Mujoco?

3d55

You Jiacheng@YouJiacheng

@jsuarez I wonder what DR do puffer use in different domains. it's really interesting. how to make DR good enough…

Joseph Suarez 🐡@jsuarez

A little perspective: RL as a field spent 10 years making algorithms slower and slower. If you look at the original ALE, it actually can sim a few thousand frames per second per core. If you look at some of the last big env releases before a ton of people moved over to LLMs, you'll find several at dozens to hundreds of steps per second with such bad engineering that they don't even scale with vectorization.

The field did this exactly because they presumed they would have to train directly in the real world. In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work outside of the benchmarks shown in the original pubs. There's a clear gap between on-pol and other methods. You don't simply switch and scale up compute to save data. You have to spend a TON more compute to match the perf of on-pol, and then you spend even more compute to gain in sample efficiency.

Our whole core realization with PufferLib is that we can write good sims for a lot of problems 10000x faster. Good doesn't even mean accurate. It means accurate enough with domain randomization and other tricks that our agents can implicitly sysid their current setting and act robustly. So far, this has worked across several different industries. I'd love to give examples here, but this is unfortunately where exact client details get confidential. We need to be better about negotiating publicity, and we're starting to do that as Puffer gets bigger.

Another major flaw with slower and slower algorithms is that the core research loop also gets slower and slower. We sim mazes and 2048 at 10+m steps per second. Big deal right, those are easy. Wrong: algorithmic improvements on those envs have consistently predicted performance improvement on every single env in our test suite. Without this, we wouldn't have been able to release so many core breakthroughs in the last 2 years with a grand total of ~20 GPUs. We ran 20,000 experiments on ~12 of them in the 3 weeks leading up to Puffer 4 launch. At traditional speeds, it would have taken Google scale compute and an infra team.

So no, we're not going to step the real world at 20m sps, but assuming that matters (or at least that it is the only thing that matters) is where the field went wrong. /rant.

2d53420

Sciumo@SciumoInc

@jsuarez “In reality, what we got out of this is a bunch of brittle off-pol and model-based algorithms that burn a ton of compute and don't work”

A bit verbose for a tshirt, but still worthy

3d2921

gfodor.id@gfodor

@jsuarez @rw_eevee as a total RL outsider, but someone who just started dabbling with pufferlib - what's the best technique for leveraging the existing VR content ecosystem for this? i have been working on OpenXR for a while now and can't help but think there's a huge overhang from this.

3d901

Emanuele@EmanueleUngaro_

@spenccheng @yacineMTB @finlay_sanders so in this field people just write their own simulation engine from scratch everytime? because you can make it super barebone and more efficient?

3d19