/Tech1d ago

Systems engineer Yacine claims the world's first reinforcement learning solve of a six-segment pendulum cartpole

The project required building a custom simulation environment.

4326.1K1351.2K625.5K
Original post
kache@yacineMTB#487inTech

behold. THE WORLDS FIRST SIX PENDULUM CARTPOLE SOLVE. Including a sponsor!

To solve this task, I built an environment to train an AI. This is what mechanize does, but for larger AIs. Apply! Salaries are up on their page

Thank you to mechanize for sponsoring!

5:50 PM · Jun 8, 2026 · 392.1K Views
Sentiment

Many users congratulated the builder on the first six-pendulum cartpole solve with AI for the impressive technical achievement, while some dismissed it as fake or unnecessary since the AI rather than the human did the work.

Pos
84.3%
Neg
15.7%
136 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS55.2KLIKES822RETWEETS9REPLIES55
kache@yacineMTB

i actually can't believe i was the first person to solve 6 pendulum cartpole that's crazy

1dViews 55.2KLikes 822Bookmarks 62
BOOKMARKS113
kache@yacineMTB

I solved this by blasting the task in RL. Each dot here is an individual experiment with its own set of hyperparameters, trained in pufferPPO. Pufferlib is the fastest, by wallclock, RL training loop I've found. X axis is wallclock, Y axis is "score"

kache@yacineMTB

This took me since last Thursday to solve. I had it solved by this morning. I'm only posting it this late in the evening because i had to learn blender.

http://mechanize.work/apply

Rest of the thread is how I solved it (it was the dumbest way possible)

1dViews 20KLikes 385Bookmarks 113
kache@yacineMTB

AI expert (unemployed)

1dViews 27.5KLikes 732Bookmarks 26

Trained with PufferLib! One GPU is all you need when training runs 10-20m steps/second

kache@yacineMTB

behold. THE WORLDS FIRST SIX PENDULUM CARTPOLE SOLVE. Including a sponsor!

To solve this task, I built an environment to train an AI. This is what mechanize does, but for larger AIs. Apply! Salaries are up on their page

Thank you to mechanize for sponsoring!

1dViews 19.1KLikes 311Bookmarks 63
kache@yacineMTB

This took me since last Thursday to solve. I had it solved by this morning. I'm only posting it this late in the evening because i had to learn blender.

http://mechanize.work/apply

Rest of the thread is how I solved it (it was the dumbest way possible)

kache@yacineMTB

behold. THE WORLDS FIRST SIX PENDULUM CARTPOLE SOLVE. Including a sponsor!

To solve this task, I built an environment to train an AI. This is what mechanize does, but for larger AIs. Apply! Salaries are up on their page

Thank you to mechanize for sponsoring!

1dViews 24.8KLikes 276Bookmarks 41
kache@yacineMTB

So once the model started scoring well enough that it learned the whipping behavior, but struggled to keep it up longer than 10 seconds, I increased and randomized the episode length per episode

Just a dumb trick I found experimentally to make these little rnns behave better

kache@yacineMTB

The thing that finally made this work was grabbing one of the top scoring hypers on the higher compute runs picked by the GP - and tweaking the task ever so slightly. One things I've noticed about these models is that if episodes end at the same time, they get.. lazy

1dViews 10.7KLikes 196Bookmarks 31
kache@yacineMTB

real machine learning for robotics hasn't been tried. no one has thought carefully about what the simulator does. what the distribution of real life is. where the bottlenecks are and where the shortcuts are

puffer is going to make RL faster and faster. The only limit is the env!

1dViews 9.6KLikes 199Bookmarks 21
kache@yacineMTB

I own a few GPUs, 4090s. I'm training relatively small models, puffer mingru. The policy I'm showing off is ~1m params. You set up an environment and a reward function. It's a bit of an art; here, you see the top right chart representing the reward. This is the training signal

kache@yacineMTB

I solved this by blasting the task in RL. Each dot here is an individual experiment with its own set of hyperparameters, trained in pufferPPO. Pufferlib is the fastest, by wallclock, RL training loop I've found. X axis is wallclock, Y axis is "score"

1dViews 12.3KLikes 214Bookmarks 20
kache@yacineMTB

That kind of gets me to how or why this is possible in the first place. This trains at 18m SPS on some configs with mujoco - I'm using mujoco warp.

I used APIC (API capture) to capture the cudagraph of the task, and make it callable from C. Speed is of utmost importance

kache@yacineMTB

So once the model started scoring well enough that it learned the whipping behavior, but struggled to keep it up longer than 10 seconds, I increased and randomized the episode length per episode

Just a dumb trick I found experimentally to make these little rnns behave better

1dViews 10.1KLikes 140Bookmarks 21
kache@yacineMTB

The thing that finally made this work was grabbing one of the top scoring hypers on the higher compute runs picked by the GP - and tweaking the task ever so slightly. One things I've noticed about these models is that if episodes end at the same time, they get.. lazy

kache@yacineMTB

I own a few GPUs, 4090s. I'm training relatively small models, puffer mingru. The policy I'm showing off is ~1m params. You set up an environment and a reward function. It's a bit of an art; here, you see the top right chart representing the reward. This is the training signal

1dViews 11.2KLikes 180Bookmarks 13
kache@yacineMTB

I have some time now (i'm looking for a job, that's actually how I closed the mechanize sponsorship deal 🤪). So I'm going to spend the rest of the week standing up existing robotics simulators w/ fast RL for others

1dViews 7.9KLikes 175Bookmarks 11
kache@yacineMTB

You learn by experimenting. Shaping reward, helping it along to have the right behaviour, figuring out what it can and can't learn. These models have surprised me, being trained in RL. If you just hold them right.. you can make them do remarkable things

kache@yacineMTB

That kind of gets me to how or why this is possible in the first place. This trains at 18m SPS on some configs with mujoco - I'm using mujoco warp.

I used APIC (API capture) to capture the cudagraph of the task, and make it callable from C. Speed is of utmost importance

1dViews 9.7KLikes 153Bookmarks 9
gfodor.id@gfodor

@yacineMTB At this point it’s like taking credit for your kid’s accomplishments. The computer figures the stuff out now

kache@yacineMTB

i actually can't believe i was the first person to solve 6 pendulum cartpole that's crazy

1dViews 17.6KLikes 132Bookmarks 8
kache@yacineMTB

18m steps per second is ridiculously fast compared to what is in the literature. I saw 90k sps mentioned as fast today. That's so slow...

People are doing VLA shaped dead ends for robotics because they just don't have the software infra for RL

I ran 3.6k experiments for this!

1dViews 7.3KLikes 128Bookmarks 4
kache@yacineMTB

greatness cannot be planned

1dViews 5.7KLikes 73Bookmarks 3
kache@yacineMTB

i mean as far as i can tell i was

kache@yacineMTB

i actually can't believe i was the first person to solve 6 pendulum cartpole that's crazy

1dViews 8.5KLikes 83Bookmarks 0

@yacineMTB next challenge level balance in 3d. maybe it is nearly as easy. but maybe not.

kache@yacineMTB

behold. THE WORLDS FIRST SIX PENDULUM CARTPOLE SOLVE. Including a sponsor!

To solve this task, I built an environment to train an AI. This is what mechanize does, but for larger AIs. Apply! Salaries are up on their page

Thank you to mechanize for sponsoring!

1dViews 1.1KLikes 25Bookmarks 5
kache@yacineMTB

@jsuarez more robotics stuff soon

Trained with PufferLib! One GPU is all you need when training runs 10-20m steps/second

1dViews 1.1KLikes 39Bookmarks 0

@Laz4rz you don't need to autoresearch the squared env. It trains in about a second with any reasonable hypers. PROTEIN >> llm guessing for hparam sweeps as well

1dViews 224Likes 10Bookmarks 3
Lazarz@Laz4rz

@jsuarez also pufferlib btw

1dViews 490Likes 6Bookmarks 2
Load more posts