I solved this by blasting the task in RL. Each dot here is an individual experiment with its own set of hyperparameters, trained in pufferPPO. Pufferlib is the fastest, by wallclock, RL training loop I've found. X axis is wallclock, Y axis is "score"
This took me since last Thursday to solve. I had it solved by this morning. I'm only posting it this late in the evening because i had to learn blender.
http://mechanize.work/apply
Rest of the thread is how I solved it (it was the dumbest way possible)





