MIT CSAIL's Alex Zhang open-sources a sandbox-free recursive language model training harness built on prime-rl
The companion 30B model is hosted on Hugging Face.
can’t wait for the releases alex is planning for this summer. in the meantime, he’s open-sourcing some RL code for RLMs and a small recursive MoE model
Introducing a minimal training harness built on prime-rl and verifiers, so you can now train your own RLMs without sandboxes! All available in the `training/` folder in the RLM GitHub repo! We train RLM-Qwen3-30B-A3B-v0.1, using RL on a separate split of environments (OOLONG-Spam, BC+ split) to greatly improve performance across the board on long-context tasks evaluated in the original RLM paper. We trained for a day on an 8xA100 using prime-rl; code and model are open-source and available on GitHub / Huggingface.
The training harness directly trains around the inference code used in the RLM repo. So anything trained in it should directly translate to and be usable in the inference engine.
RLM repo: https://github.com/alexzhang13/rlm RLM-Qwen3-30B-A3B-v0.1: https://huggingface.co/mit-oasys/rlm-qwen3-30b-a3b-v0.1
Introducing a minimal training harness built on prime-rl and verifiers, so you can now train your own RLMs without sandboxes! All available in the `training/` folder in the RLM GitHub repo! We train RLM-Qwen3-30B-A3B-v0.1, using RL on a separate split of environments (OOLONG-Spam, BC+ split) to greatly improve performance across the board on long-context tasks evaluated in the original RLM paper. We trained for a day on an 8xA100 using prime-rl; code and model are open-source and available on GitHub / Huggingface.
Worth shouting out other works that have introduced RLM training harnesses, such as the @askalphaxiv's wonderful implementation using @NovaSkyAI's SkyRL library!
Training RLMs will lead to serious gains across nearly all tasks (especially long-horizon), and for smaller OSS models it is now easier than ever to do. Stay tuned for more infra that scales to even larger models :)
The training harness directly trains around the inference code used in the RLM repo. So anything trained in it should directly translate to and be usable in the inference engine. RLM repo: https://github.com/alexzhang13/rlm RLM-Qwen3-30B-A3B-v0.1: https://huggingface.co/mit-oasys/rlm-qwen3-30b-a3b-v0.1