1h ago

MIT CSAIL's Alex Zhang open-sources a sandbox-free recursive language model training harness built on prime-rl

The companion 30B model is hosted on Hugging Face.

0
Original post

Introducing a minimal training harness built on prime-rl and verifiers, so you can now train your own RLMs without sandboxes! All available in the `training/` folder in the RLM GitHub repo! We train RLM-Qwen3-30B-A3B-v0.1, using RL on a separate split of environments (OOLONG-Spam, BC+ split) to greatly improve performance across the board on long-context tasks evaluated in the original RLM paper. We trained for a day on an 8xA100 using prime-rl; code and model are open-source and available on GitHub / Huggingface.

6:52 AM · May 27, 2026 View on X

can’t wait for the releases alex is planning for this summer. in the meantime, he’s open-sourcing some RL code for RLMs and a small recursive MoE model

alex zhangalex zhang@a1zhang

Introducing a minimal training harness built on prime-rl and verifiers, so you can now train your own RLMs without sandboxes! All available in the `training/` folder in the RLM GitHub repo! We train RLM-Qwen3-30B-A3B-v0.1, using RL on a separate split of environments (OOLONG-Spam, BC+ split) to greatly improve performance across the board on long-context tasks evaluated in the original RLM paper. We trained for a day on an 8xA100 using prime-rl; code and model are open-source and available on GitHub / Huggingface.

1:52 PM · May 27, 2026 · 15.1K Views
2:35 PM · May 27, 2026 · 5.1K Views

The training harness directly trains around the inference code used in the RLM repo. So anything trained in it should directly translate to and be usable in the inference engine.

RLM repo: https://github.com/alexzhang13/rlm RLM-Qwen3-30B-A3B-v0.1: https://huggingface.co/mit-oasys/rlm-qwen3-30b-a3b-v0.1

alex zhangalex zhang@a1zhang

Introducing a minimal training harness built on prime-rl and verifiers, so you can now train your own RLMs without sandboxes! All available in the `training/` folder in the RLM GitHub repo! We train RLM-Qwen3-30B-A3B-v0.1, using RL on a separate split of environments (OOLONG-Spam, BC+ split) to greatly improve performance across the board on long-context tasks evaluated in the original RLM paper. We trained for a day on an 8xA100 using prime-rl; code and model are open-source and available on GitHub / Huggingface.

1:52 PM · May 27, 2026 · 15.1K Views
1:52 PM · May 27, 2026 · 1.8K Views

Worth shouting out other works that have introduced RLM training harnesses, such as the @askalphaxiv's wonderful implementation using @NovaSkyAI's SkyRL library!

Training RLMs will lead to serious gains across nearly all tasks (especially long-horizon), and for smaller OSS models it is now easier than ever to do. Stay tuned for more infra that scales to even larger models :)

alex zhangalex zhang@a1zhang

The training harness directly trains around the inference code used in the RLM repo. So anything trained in it should directly translate to and be usable in the inference engine. RLM repo: https://github.com/alexzhang13/rlm RLM-Qwen3-30B-A3B-v0.1: https://huggingface.co/mit-oasys/rlm-qwen3-30b-a3b-v0.1

1:52 PM · May 27, 2026 · 1.8K Views
1:52 PM · May 27, 2026 · 1.3K Views
MIT CSAIL's Alex Zhang open-sources a sandbox-free recursive language model training harness built on prime-rl · Digg