/Tech4h ago

New Blog Explains Token-In-Token-Out For On-Policy Agentic RL

4738509.4K

#934

Original post

Banghua Zhu#1680

LMSYS Org@lmsysorg

📝 New blog: No Token Left Behind: Demystifying Token-In-Token-Out in Miles

In agentic RL, a rollout is a chain of model calls, tool outputs & resumed turns. Token-In-Token-Out (TITO) ensures the trainer evaluates the exact tokens the inference engine produced — break it, and training silently drifts off-policy.

Why it matters: 📦 One sample per task, not per turn: ~10× less compute on 30–50 turn trajectories 🎯 Keeps every token on-policy

How Miles enforces it: 1️⃣ Inference session server: one append-only token buffer per trajectory 2️⃣ Append-only at 3 levels: messages, template rendering, tokens 3️⃣ Pluggable TITO tokenizer: incremental tokenize + per-model splice patches 4️⃣ TokenSeqComparator: verifies every rollout stays bit-perfect

Supports Qwen3, GLM, Kimi-K2, Nemotron, Minimax & DeepSeek families.

9:03 AM · Jun 9, 2026 · 6.5K Views

/Tech4h ago

New Blog Explains Token-In-Token-Out For On-Policy Agentic RL

4738509.4K

#934

Original post

Banghua Zhu#1680

LMSYS Org@lmsysorg

📝 New blog: No Token Left Behind: Demystifying Token-In-Token-Out in Miles

Why it matters: 📦 One sample per task, not per turn: ~10× less compute on 30–50 turn trajectories 🎯 Keeps every token on-policy

Supports Qwen3, GLM, Kimi-K2, Nemotron, Minimax & DeepSeek families.

9:03 AM · Jun 9, 2026 · 6.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.5KBOOKMARKS14LIKES23RETWEETS2

Ying Sheng@ying11231

I remember when Jiajun told me he wants to push for TITO because he thinks this is important though he does not understand why people are not doing it. It’s great they were able to insist on their own judgement! A long work before this blog.

LMSYS Org@lmsysorg

📝 New blog: No Token Left Behind: Demystifying Token-In-Token-Out in Miles

Why it matters: 📦 One sample per task, not per turn: ~10× less compute on 30–50 turn trajectories 🎯 Keeps every token on-policy

Supports Qwen3, GLM, Kimi-K2, Nemotron, Minimax & DeepSeek families.

3h2.5K2314

Banghua Zhu@BanghuaZ

Getting the chat template consistent across multiple turns for agentic training can be much more tricker than people think. There have been headaches like reasoning trajectory pruned by chat templates, detokenize-retokenize mismatch etc.

Token-In-Token-Out (TITO) ensures that the tokens prefix across turns are consistent, removing silent off-policyness introduced by multi-turn agentic training. Miles has fully supported TITO for popular open source models. Check the blog here!

LMSYS Org@lmsysorg

📝 New blog: No Token Left Behind: Demystifying Token-In-Token-Out in Miles

Why it matters: 📦 One sample per task, not per turn: ~10× less compute on 30–50 turn trajectories 🎯 Keeps every token on-policy

Supports Qwen3, GLM, Kimi-K2, Nemotron, Minimax & DeepSeek families.

2h2.1K2214

LMSYS Org@lmsysorg

Read full blog: https://www.lmsys.org/blog/2026-05-13-no-token-left-behind/

4h28421