Introducing a minimal training harness built on prime-rl and verifiers, so you can now train your own RLMs without sandboxes! All available in the `training/` folder in the RLM GitHub repo!
We train RLM-Qwen3-30B-A3B-v0.1, using RL on a separate split of environments (OOLONG-Spam, BC+ split) to greatly improve performance across the board on long-context tasks evaluated in the original RLM paper.
We trained for a day on an 8xA100 using prime-rl; code and model are open-source and available on GitHub / Huggingface.


