MIT CSAIL's Alex Zhang open-sources a sandbox-free recursive language model training harness built on prime-rl

VIEWS17.1KBOOKMARKS114LIKES214REPLIES6

can’t wait for the releases alex is planning for this summer. in the meantime, he’s open-sourcing some RL code for RLMs and a small recursive MoE model

alex zhang@a1zhang

Introducing a minimal training harness built on prime-rl and verifiers, so you can now train your own RLMs without sandboxes! All available in the `training/` folder in the RLM GitHub repo!

We train RLM-Qwen3-30B-A3B-v0.1, using RL on a separate split of environments (OOLONG-Spam, BC+ split) to greatly improve performance across the board on long-context tasks evaluated in the original RLM paper.

We trained for a day on an 8xA100 using prime-rl; code and model are open-source and available on GitHub / Huggingface.

33d17.1K214114

RETWEETS76

alex zhang@a1zhang

Introducing a minimal training harness built on prime-rl and verifiers, so you can now train your own RLMs without sandboxes! All available in the `training/` folder in the RLM GitHub repo!

We train RLM-Qwen3-30B-A3B-v0.1, using RL on a separate split of environments (OOLONG-Spam, BC+ split) to greatly improve performance across the board on long-context tasks evaluated in the original RLM paper.

We trained for a day on an 8xA100 using prime-rl; code and model are open-source and available on GitHub / Huggingface.

33d61.8K651507

alex zhang@a1zhang

The training harness directly trains around the inference code used in the RLM repo. So anything trained in it should directly translate to and be usable in the inference engine.

RLM repo: https://github.com/alexzhang13/rlm RLM-Qwen3-30B-A3B-v0.1: https://huggingface.co/mit-oasys/rlm-qwen3-30b-a3b-v0.1

alex zhang@a1zhang

Introducing a minimal training harness built on prime-rl and verifiers, so you can now train your own RLMs without sandboxes! All available in the `training/` folder in the RLM GitHub repo!

We train RLM-Qwen3-30B-A3B-v0.1, using RL on a separate split of environments (OOLONG-Spam, BC+ split) to greatly improve performance across the board on long-context tasks evaluated in the original RLM paper.

We trained for a day on an 8xA100 using prime-rl; code and model are open-source and available on GitHub / Huggingface.

33d3.7K5724

alex zhang@a1zhang

Worth shouting out other works that have introduced RLM training harnesses, such as the @askalphaxiv's wonderful implementation using @NovaSkyAI's SkyRL library!

Training RLMs will lead to serious gains across nearly all tasks (especially long-horizon), and for smaller OSS models it is now easier than ever to do. Stay tuned for more infra that scales to even larger models :)

alex zhang@a1zhang

The training harness directly trains around the inference code used in the RLM repo. So anything trained in it should directly translate to and be usable in the inference engine.

RLM repo: https://github.com/alexzhang13/rlm RLM-Qwen3-30B-A3B-v0.1: https://huggingface.co/mit-oasys/rlm-qwen3-30b-a3b-v0.1

33d2.7K399

Strata@ChainZenit

@lateinteraction rl code for RLMs sounds niche af

33d56

Alex UGift@Radipdegen

@lateinteraction open sourcing the helper libs is the right play imo. makes the project real instead of just a blog post.

33d31

Steven Collard@stalmico

@lateinteraction recursive moe sounds interesting for memory efficiency

33d22