/Tech1h ago

Builder Trains 1M-Parameter RL Policy With Pufferlib on 4090 GPUs

81873197.2K

Original post

kache@yacineMTB#199inTech

I own a few GPUs, 4090s. I'm training relatively small models, puffer mingru. The policy I'm showing off is ~1m params. You set up an environment and a reward function. It's a bit of an art; here, you see the top right chart representing the reward. This is the training signal

kache@yacineMTB

I solved this by blasting the task in RL. Each dot here is an individual experiment with its own set of hyperparameters, trained in pufferPPO. Pufferlib is the fastest, by wallclock, RL training loop I've found. X axis is wallclock, Y axis is "score"

6:08 PM · Jun 8, 2026 · 2K Views

/Tech1h ago

Builder Trains 1M-Parameter RL Policy With Pufferlib on 4090 GPUs

81873197.2K

#199

Original post

kache@yacineMTB#199inTech

kache@yacineMTB

6:08 PM · Jun 8, 2026 · 2K Views

Sentiment

Positive users praise the 18M SPS RL training speeds on 4090 GPUs as far beyond prior literature and promising for real robotics ML, while negative users harshly criticize projects for not using Pufferlib.

Pos

50.0%

Neg

50.0%

8 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.9K

kache@yacineMTB

real machine learning for robotics hasn't been tried. no one has thought carefully about what the simulator does. what the distribution of real life is. where the bottlenecks are and where the shortcuts are

puffer is going to make RL faster and faster. The only limit is the env!

1h1.9K453

BOOKMARKS9LIKES49RETWEETS1REPLIES4

kache@yacineMTB

So once the model started scoring well enough that it learned the whipping behavior, but struggled to keep it up longer than 10 seconds, I increased and randomized the episode length per episode

Just a dumb trick I found experimentally to make these little rnns behave better

kache@yacineMTB

The thing that finally made this work was grabbing one of the top scoring hypers on the higher compute runs picked by the GP - and tweaking the task ever so slightly. One things I've noticed about these models is that if episodes end at the same time, they get.. lazy

1h1.7K499

kache@yacineMTB

That kind of gets me to how or why this is possible in the first place. This trains at 18m SPS on some configs with mujoco - I'm using mujoco warp.

I used APIC (API capture) to capture the cudagraph of the task, and make it callable from C. Speed is of utmost importance

kache@yacineMTB

So once the model started scoring well enough that it learned the whipping behavior, but struggled to keep it up longer than 10 seconds, I increased and randomized the episode length per episode

Just a dumb trick I found experimentally to make these little rnns behave better

1h1.6K355

kache@yacineMTB

1h1.8K452

kache@yacineMTB

You learn by experimenting. Shaping reward, helping it along to have the right behaviour, figuring out what it can and can't learn. These models have surprised me, being trained in RL. If you just hold them right.. you can make them do remarkable things

1h64815

kache@yacineMTB

18m steps per second is ridiculously fast compared to what is in the literature. I saw 90k sps mentioned as fast today. That's so slow...

People are doing VLA shaped dead ends for robotics because they just don't have the software infra for RL

I ran 3.6k experiments for this!

1h58991

Dan Advantage@DanAdvantage

@yacineMTB no pufferlib, -1000000000000000 points

1h111

Dan Advantage@DanAdvantage

@yacineMTB @jsuarez constellation making the rounds

1h283