a cute diagram on the small but important differences that I think is easy to understand :) it also brings back the debate of CodeAct vs. ReAct, which ill expand on below:
btw, it's worth mentioning for those who say RLMs were a "common engineering trick" in industry pre-October 2025, I want to be clear that doing RLM-style things at the time (and even now, in May 2026) was a pretty poor design decision for a production agent stack and likely was just worse than doing standard tool calls (although the more probable answer is that they were implementing something different anyways). it's kind of a moot point I haven't really interacted much with because it's a bit of a cop out "I can't show you, but it existed, trust me bro", but I do think it's worth talking about a bit
RLMs, like most harness designs we've been interested in since 2022, are super simple. but like the diagram below points out, it changes quite a bit about what we want the language models themselves to be doing (both conditioning on, and outputting) to get them to work well. as we've seen with the latest products in AI, model-harness codesign is extremely important
it's why something like having the community accept ideas like ReAct is a big deal, because while the idea is simple, whether we choose to accept ReAct as the standard has ripple effects on the models and infra we design in the future. it's kind of like the whole MITO vs. TITO debate going on now; they're both "easy to implement", but what we as a community accept has long-term effects and potentially technical debt on the infra we build in the future
one of the goals of this entire "programmatic sub-agent and tool-calling" discourse that RLMs push for dates back to the ReAct vs. CodeAct debate and whether tool calls are JSON tool calls or functions in code. CodeAct kind of lost out to ReAct in 2024-2026, and for the past few years a lot of harnesses we've built out + the models underpinning them assume the JSON-like tool calling structure. if we were to continue along this route in the limit as a community, RLMs would basically never really work despite the idea being so simple
in this sense, RLMs are pushing for the CodeAct-style "programmatic tool calling" (PTC), but with an emphasis on context as an object and sub-calls as a function that always exists. the paper itself shows that this style of workflow has sparks of potential on frontier models doing long-context tasks, but it's why we've been so interested in pushing for this style of harness
so yeah, deferring to your LM to make decisions in code with sub-calls and context offloading is a simple idea. if you implemented this in your production agents, im sorry, but it probably wasn't and still isn't a good idea just yet. it takes time, and formalizing the idea is important to justifying why we should push further in this direction. there's a lot of work to be done, both on the infra side (guardrails, sandboxing, training, etc.) and the model training side (getting a model that actually is good at this style of thinking!!!) that are completely non-trivial and will take a lot of time to get right.