TMax: An open RL recipe for terminal agents
I’m very excited to get to share a new RL paper today that I got to have a small part in – a type of paper I suspect we’ll see much more of in the future. The key is that RL research is very different today, in mid-2026, than what most observers have in their context. The average conception of an RL paper is grounded in the RLVR revolution of early 2025, where many people could use vanilla RLVR libraries to hillclimb on math benchmarks. Crucially, this style of math work could be done on base models or fairly stably on already trained models. With agents, the tasks of focus are very hard, requiring complex tool-use, harnesses where the model automatically manages its history, and much more training to make smaller eval improvements. We’re shifting from a renaissance of RL study to rapidly needing to improve its empirical rigor and common community engagements.
TMax is the best open data for hillclimbing on frontier terminal tasks. It’s been validated with rigorous experiments, and if the authors wanted to just form a “RL environments startup” they could probably sell it for millions of dollars. This data work is some of my favorite stuff to be around in my 2.5+ years at Ai2.
As a general summary, the recipe is open data and recipe lessons from hillclimbing the Qwen 3.5 smaller, dense models on terminal tasks. These models are super hard to hillclimb in this area, as they’re already trained heavily on the task. The training is very infrastructure-dependent, and most of the RL innovations are more designed to make training stable than to improve the rate of learning.
I strongly recommend this paper. I joke around that I was happy to be an author just so I had to read it twice! You can find Hamish’s thread sharing more here or read the paper here. You can click through to find the model weights, the data, and even some fun further artifacts to study like all the RL rollouts from a training run – where the model sometimes became aware that it was being tested.
The biggest takeaway I have from following this work, and more of the work in the community, is how important recipe work is. Let me define “recipe work.” It is a style of paper that explains all the steps you need to make crucial model improvements – data, algorithm, codebase, pitfalls, etc.
Getting started in meaningful RL experiments today is a substantial expense. There are a ton of companies, an entire industry emerging really, around the idea of taking open-weight language models and finetuning them with RL on your domain-specific tasks. What I see in many projects is that getting an initial baseline is very hard. This phase, which can cost weeks and anywhere from $10K to $1M+, feels like spinning your wheels (A fun fact is that an RL step on a model like Nvidia Nemotron 3 Ultra on Tinker costs $1K and a meaningful RL run would be hundreds of steps – credit Edward Hu). It takes a lot of time to get traction in learning signal on meaningful, hard RL tasks.
What we need as a community is a way for people to study small ablations to established RL recipes, as most labs won’t have the resources to do it from scratch in a meaningful way. This is what I hope TMAX can be for terminal agents, or the start of. Yes the training jobs are expensive, as the paper documents a standard training job being 8 nodes of H100s (2 train 6 inference) for 2-3 days, but that is approaching something academics can study. The establishment of this recipe took O(100) of these training jobs to get right.
This isn’t my first time trying to establish this direction. When we launched Olmo 3 we had the “RL Zero“ model families, which are clean RL runs from a base model on a certain domain. This type of recipe-dependent work is a clear indicator that meaningful post-training work today looks much more like pretraining work of years past. We need decision-making ladders, clear ways of seeing small improvements in the models, stability, and so on.
Part of this is down to academic gatekeepers, who won’t reward a paper doing very clean empirical work to push a recipe 1-2% up. They’ll favor a “new algorithm” that matches results, or something sort of bogus. My hope is that we can have multiple, stable, clear recipes across agent types, so innovations can be tested more clearly in multiple domains. (If you’re working on this, please reach out – I’m happy to support if I can, but I likely can’t reply to every email).
As a quick aside, the RL frameworks in vogue today seem to be SLIME and SkyRL. The libraries of choice have shifted throughout these seasons in RL, which further contributes to a form of fragility in the literature. A bit of continuity will go a long way.
So, go read this paper. It’s a really great example of how seemingly simple data and infrastructure work can be very hard and impactful. It’s also got me looking for more applications of Divergence Proximal Policy Optimization (DPPO) as another small evolution to the best RL algorithms of the day, by virtue of being a bit more stable by improving token-level clipping.