I trained this with pufferlib. Pufferlib is absurdly fast RL training. Like, absurdly, absurdly fast. You're often only limited by your env speed.
There's something called mujoco warp. It's a joint project between NVidia & google (judging by the commits coming from both orgs)
I just trained cartpole in mujoco at 18 million steps per second. This policy learned in **less than 3 seconds**
rollout policy batch size was 8192 agents