/AI17h ago

Researcher Pushes Model-Harness Co-Design for LLM Agents

--0--
Quote posts
Reposts
Original post
alex zhang@a1zhang#853inAI

a cute diagram on the small but important differences that I think is easy to understand :) it also brings back the debate of CodeAct vs. ReAct, which ill expand on below:

btw, it's worth mentioning for those who say RLMs were a "common engineering trick" in industry pre-October 2025, I want to be clear that doing RLM-style things at the time (and even now, in May 2026) was a pretty poor design decision for a production agent stack and likely was just worse than doing standard tool calls (although the more probable answer is that they were implementing something different anyways). it's kind of a moot point I haven't really interacted much with because it's a bit of a cop out "I can't show you, but it existed, trust me bro", but I do think it's worth talking about a bit

RLMs, like most harness designs we've been interested in since 2022, are super simple. but like the diagram below points out, it changes quite a bit about what we want the language models themselves to be doing (both conditioning on, and outputting) to get them to work well. as we've seen with the latest products in AI, model-harness codesign is extremely important

it's why something like having the community accept ideas like ReAct is a big deal, because while the idea is simple, whether we choose to accept ReAct as the standard has ripple effects on the models and infra we design in the future. it's kind of like the whole MITO vs. TITO debate going on now; they're both "easy to implement", but what we as a community accept has long-term effects and potentially technical debt on the infra we build in the future

one of the goals of this entire "programmatic sub-agent and tool-calling" discourse that RLMs push for dates back to the ReAct vs. CodeAct debate and whether tool calls are JSON tool calls or functions in code. CodeAct kind of lost out to ReAct in 2024-2026, and for the past few years a lot of harnesses we've built out + the models underpinning them assume the JSON-like tool calling structure. if we were to continue along this route in the limit as a community, RLMs would basically never really work despite the idea being so simple

in this sense, RLMs are pushing for the CodeAct-style "programmatic tool calling" (PTC), but with an emphasis on context as an object and sub-calls as a function that always exists. the paper itself shows that this style of workflow has sparks of potential on frontier models doing long-context tasks, but it's why we've been so interested in pushing for this style of harness

so yeah, deferring to your LM to make decisions in code with sub-calls and context offloading is a simple idea. if you implemented this in your production agents, im sorry, but it probably wasn't and still isn't a good idea just yet. it takes time, and formalizing the idea is important to justifying why we should push further in this direction. there's a lot of work to be done, both on the infra side (guardrails, sandboxing, training, etc.) and the model training side (getting a model that actually is good at this style of thinking!!!) that are completely non-trivial and will take a lot of time to get right.

8:56 PM · May 30, 2026 · 20.9K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS2.4KBOOKMARKS6LIKES12RETWEETS2REPLIES1

Great read. @OpenHandsDev had CodeAct, added tool calling to it at the end of 2024, and gave up the CodeAct side in 2025.

Usability was a concern, and also performance, because LLMs post-trained with tool calling got way too good comparatively

alex zhang@a1zhang

a cute diagram on the small but important differences that I think is easy to understand :) it also brings back the debate of CodeAct vs. ReAct, which ill expand on below:

btw, it's worth mentioning for those who say RLMs were a "common engineering trick" in industry pre-October 2025, I want to be clear that doing RLM-style things at the time (and even now, in May 2026) was a pretty poor design decision for a production agent stack and likely was just worse than doing standard tool calls (although the more probable answer is that they were implementing something different anyways). it's kind of a moot point I haven't really interacted much with because it's a bit of a cop out "I can't show you, but it existed, trust me bro", but I do think it's worth talking about a bit

RLMs, like most harness designs we've been interested in since 2022, are super simple. but like the diagram below points out, it changes quite a bit about what we want the language models themselves to be doing (both conditioning on, and outputting) to get them to work well. as we've seen with the latest products in AI, model-harness codesign is extremely important

it's why something like having the community accept ideas like ReAct is a big deal, because while the idea is simple, whether we choose to accept ReAct as the standard has ripple effects on the models and infra we design in the future. it's kind of like the whole MITO vs. TITO debate going on now; they're both "easy to implement", but what we as a community accept has long-term effects and potentially technical debt on the infra we build in the future

one of the goals of this entire "programmatic sub-agent and tool-calling" discourse that RLMs push for dates back to the ReAct vs. CodeAct debate and whether tool calls are JSON tool calls or functions in code. CodeAct kind of lost out to ReAct in 2024-2026, and for the past few years a lot of harnesses we've built out + the models underpinning them assume the JSON-like tool calling structure. if we were to continue along this route in the limit as a community, RLMs would basically never really work despite the idea being so simple

in this sense, RLMs are pushing for the CodeAct-style "programmatic tool calling" (PTC), but with an emphasis on context as an object and sub-calls as a function that always exists. the paper itself shows that this style of workflow has sparks of potential on frontier models doing long-context tasks, but it's why we've been so interested in pushing for this style of harness

so yeah, deferring to your LM to make decisions in code with sub-calls and context offloading is a simple idea. if you implemented this in your production agents, im sorry, but it probably wasn't and still isn't a good idea just yet. it takes time, and formalizing the idea is important to justifying why we should push further in this direction. there's a lot of work to be done, both on the infra side (guardrails, sandboxing, training, etc.) and the model training side (getting a model that actually is good at this style of thinking!!!) that are completely non-trivial and will take a lot of time to get right.

3hViews 2.4KLikes 12Bookmarks 6