/Tech20h ago

Systems engineer Yacine uses PufferLib to train a reinforcement learning agent to balance a six-segment jointed pendulum cartpole

Training reached 20M steps per second on a single GPU.

24734816739.6K

#229

Original post

kache@yacineMTB#229inTech

I solved this by blasting the task in RL. Each dot here is an individual experiment with its own set of hyperparameters, trained in pufferPPO. Pufferlib is the fastest, by wallclock, RL training loop I've found. X axis is wallclock, Y axis is "score"

kache@yacineMTB

This took me since last Thursday to solve. I had it solved by this morning. I'm only posting it this late in the evening because i had to learn blender.

http://mechanize.work/apply

Rest of the thread is how I solved it (it was the dumbest way possible)

5:59 PM · Jun 8, 2026 · 19.3K Views

/Tech20h ago

Systems engineer Yacine uses PufferLib to train a reinforcement learning agent to balance a six-segment jointed pendulum cartpole

Training reached 20M steps per second on a single GPU.

24734816739.6K

#229

Original post

kache@yacineMTB#229inTech

kache@yacineMTB

This took me since last Thursday to solve. I had it solved by this morning. I'm only posting it this late in the evening because i had to learn blender.

http://mechanize.work/apply

Rest of the thread is how I solved it (it was the dumbest way possible)

5:59 PM · Jun 8, 2026 · 19.3K Views

Sentiment

Users are praising the builder's hands-on RL experiments in Blender for solving complex multi-pendulum control tasks, celebrating the innovative experimentation and impressive technical results.

Pos

100.0%

Neg

0.0%

16 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS15.8KBOOKMARKS58LIKES260RETWEETS6REPLIES7

Joseph Suarez 🐡@jsuarez

Trained with PufferLib! One GPU is all you need when training runs 10-20m steps/second

kache@yacineMTB

behold. THE WORLDS FIRST SIX PENDULUM CARTPOLE SOLVE. Including a sponsor!

To solve this task, I built an environment to train an AI. This is what mechanize does, but for larger AIs. Apply! Salaries are up on their page

Thank you to mechanize for sponsoring!

7h15.8K26058

kache@yacineMTB

real machine learning for robotics hasn't been tried. no one has thought carefully about what the simulator does. what the distribution of real life is. where the bottlenecks are and where the shortcuts are

puffer is going to make RL faster and faster. The only limit is the env!

20h9.6K19921

kache@yacineMTB

So once the model started scoring well enough that it learned the whipping behavior, but struggled to keep it up longer than 10 seconds, I increased and randomized the episode length per episode

Just a dumb trick I found experimentally to make these little rnns behave better

20h8.5K16622

kache@yacineMTB

I own a few GPUs, 4090s. I'm training relatively small models, puffer mingru. The policy I'm showing off is ~1m params. You set up an environment and a reward function. It's a bit of an art; here, you see the top right chart representing the reward. This is the training signal

20h9.7K18214

kache@yacineMTB

I have some time now (i'm looking for a job, that's actually how I closed the mechanize sponsorship deal 🤪). So I'm going to spend the rest of the week standing up existing robotics simulators w/ fast RL for others

20h7.9K17511

kache@yacineMTB

That kind of gets me to how or why this is possible in the first place. This trains at 18m SPS on some configs with mujoco - I'm using mujoco warp.

I used APIC (API capture) to capture the cudagraph of the task, and make it callable from C. Speed is of utmost importance

20h8.1K12217

kache@yacineMTB

The thing that finally made this work was grabbing one of the top scoring hypers on the higher compute runs picked by the GP - and tweaking the task ever so slightly. One things I've noticed about these models is that if episodes end at the same time, they get.. lazy

20h8.9K15611

kache@yacineMTB

You learn by experimenting. Shaping reward, helping it along to have the right behaviour, figuring out what it can and can't learn. These models have surprised me, being trained in RL. If you just hold them right.. you can make them do remarkable things

20h7.8K1357

kache@yacineMTB

18m steps per second is ridiculously fast compared to what is in the literature. I saw 90k sps mentioned as fast today. That's so slow...

People are doing VLA shaped dead ends for robotics because they just don't have the software infra for RL

I ran 3.6k experiments for this!

20h7.3K1284

kache@yacineMTB

greatness cannot be planned

7h4.4K621

kache@yacineMTB

@jsuarez more robotics stuff soon

Joseph Suarez 🐡@jsuarez

Trained with PufferLib! One GPU is all you need when training runs 10-20m steps/second

7h1K370

kache@yacineMTB

@RichardSSutton @yoavgo hello professor. check out my bandwagon. it has 6 pendulums and it is very bitter

kache@yacineMTB

behold. THE WORLDS FIRST SIX PENDULUM CARTPOLE SOLVE. Including a sponsor!

To solve this task, I built an environment to train an AI. This is what mechanize does, but for larger AIs. Apply! Salaries are up on their page

Thank you to mechanize for sponsoring!

5h1.2K281

Joseph Suarez 🐡@jsuarez

@Laz4rz you don't need to autoresearch the squared env. It trains in about a second with any reasonable hypers. PROTEIN >> llm guessing for hparam sweeps as well

7h224103

Lazarz@Laz4rz

@jsuarez also pufferlib btw

7h49062

Joseph Suarez 🐡@jsuarez

right but that problem is so easy that you can solve it with anything. So is cartpole. The minimum problem that has some signal is probably breakout. It trains in around 4 seconds. See if you can get it to solve in 2 w/ changes to the arch/alg, then see if those changes transfer to other envs!

7h15931

kache@yacineMTB

@CarMarinkovic I've gotten drones to do backflips just waiting to be able to post it

20h33313

Carlos Marinkovic@CarMarinkovic

@yacineMTB Will you expound on what happened to the indoor drone swarm? Even just a superficial update? I've been rooting for you, as many have, and I'm curious.

20h3724

Stone Tao@Stone_Tao

@yacineMTB lmao both of these tricks are ones I suggested (longer episodes + randomize episode lengths/stagger resets)

20h3011

Dan Advantage@DanAdvantage

@yacineMTB no pufferlib, -1000000000000000 points

20h211

kache@yacineMTB

@CyborgLavery i have no idea lmfao

19h61