/AI5h ago

New Blog Explains Token-In-Token-Out For On-Policy Agentic RL

4100126713.8K
Original postBanghua Zhu#1147
LMSYS Org@lmsysorg

📝 New blog: No Token Left Behind: Demystifying Token-In-Token-Out in Miles

In agentic RL, a rollout is a chain of model calls, tool outputs & resumed turns. Token-In-Token-Out (TITO) ensures the trainer evaluates the exact tokens the inference engine produced — break it, and training silently drifts off-policy.

Why it matters: 📦 One sample per task, not per turn: ~10× less compute on 30–50 turn trajectories 🎯 Keeps every token on-policy

How Miles enforces it: 1️⃣ Inference session server: one append-only token buffer per trajectory 2️⃣ Append-only at 3 levels: messages, template rendering, tokens 3️⃣ Pluggable TITO tokenizer: incremental tokenize + per-model splice patches 4️⃣ TokenSeqComparator: verifies every rollout stays bit-perfect

Supports Qwen3, GLM, Kimi-K2, Nemotron, Minimax & DeepSeek families.

9:03 AM · Jun 9, 2026 · 8.2K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS3.1KLIKES26
Ying Sheng@ying11231

I remember when Jiajun told me he wants to push for TITO because he thinks this is important though he does not understand why people are not doing it. It’s great they were able to insist on their own judgement! A long work before this blog.

LMSYS Org@lmsysorg

📝 New blog: No Token Left Behind: Demystifying Token-In-Token-Out in Miles

In agentic RL, a rollout is a chain of model calls, tool outputs & resumed turns. Token-In-Token-Out (TITO) ensures the trainer evaluates the exact tokens the inference engine produced — break it, and training silently drifts off-policy.

Why it matters: 📦 One sample per task, not per turn: ~10× less compute on 30–50 turn trajectories 🎯 Keeps every token on-policy

How Miles enforces it: 1️⃣ Inference session server: one append-only token buffer per trajectory 2️⃣ Append-only at 3 levels: messages, template rendering, tokens 3️⃣ Pluggable TITO tokenizer: incremental tokenize + per-model splice patches 4️⃣ TokenSeqComparator: verifies every rollout stays bit-perfect

Supports Qwen3, GLM, Kimi-K2, Nemotron, Minimax & DeepSeek families.

4hViews 3.1KLikes 26Bookmarks 15
BOOKMARKS17RETWEETS4
Banghua Zhu@BanghuaZ

Getting the chat template consistent across multiple turns for agentic training can be much more tricker than people think. There have been headaches like reasoning trajectory pruned by chat templates, detokenize-retokenize mismatch etc.

Token-In-Token-Out (TITO) ensures that the tokens prefix across turns are consistent, removing silent off-policyness introduced by multi-turn agentic training. Miles has fully supported TITO for popular open source models. Check the blog here!

LMSYS Org@lmsysorg

📝 New blog: No Token Left Behind: Demystifying Token-In-Token-Out in Miles

In agentic RL, a rollout is a chain of model calls, tool outputs & resumed turns. Token-In-Token-Out (TITO) ensures the trainer evaluates the exact tokens the inference engine produced — break it, and training silently drifts off-policy.

Why it matters: 📦 One sample per task, not per turn: ~10× less compute on 30–50 turn trajectories 🎯 Keeps every token on-policy

How Miles enforces it: 1️⃣ Inference session server: one append-only token buffer per trajectory 2️⃣ Append-only at 3 levels: messages, template rendering, tokens 3️⃣ Pluggable TITO tokenizer: incremental tokenize + per-model splice patches 4️⃣ TokenSeqComparator: verifies every rollout stays bit-perfect

Supports Qwen3, GLM, Kimi-K2, Nemotron, Minimax & DeepSeek families.

4hViews 2.9KLikes 26Bookmarks 17
LMSYS Org@lmsysorg

Read full blog: https://www.lmsys.org/blog/2026-05-13-no-token-left-behind/

5hViews 284Likes 2Bookmarks 1