/Tech20h ago

PrimeIntellect releases prime-rl v0.6.0, enabling reinforcement learning for trillion-parameter MoE models with step times under five minutes

The update integrates FSDP2 with deep-ep expert parallelism.

851.3K94613324.8K

#573

Original post

Prime Intellect@PrimeIntellect

Today we're releasing prime-rl v0.6.0 — enabling RL at trillion-parameter MoE scale on agentic workloads at the highest efficiency.

We've relentlessly optimized our RL infra.

The result: GLM-5 on agentic SWE tasks at 131k context and sub-5-minute step time.

7:15 PM · Jun 22, 2026 · 236.1K Views

Sentiment

Users are praising Prime-RL v0.6.0 for enabling efficient trillion-parameter MoE RL training with standout gains like sub-5min steps and major KL reductions.

Pos

59.3%

Neg

40.7%

24 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

PRIMEINTELLECT.AIVia

#1136

Posts from X

Most Activity

VIEWS14.3KLIKES89

Prime Intellect@PrimeIntellect

prime-rl is fully open source, and we're hiring systems engineers to take it further.

Read the full prime-rl performance deep dive:

https://www.primeintellect.ai/blog/rl-at-1t-scale/

20h14.3K8941

BOOKMARKS72REPLIES9

elvis@omarsar0

Highly-recommended read.

It's exciting to see large-scale agentic RL becoming more accessible. Cool to see the infra layer for this is being built and I think this plays an important role in self-improving agents arc and "owning your AI."

elie@eliebakouch

every infra piece you need to know to do RL on GLM-5

https://www.primeintellect.ai/blog/rl-at-1t-scale

6h10.8K5572

RETWEETS32

samsja@samsja19

prime-rl can now train 1T parameters MoE blazingly fast, under 5 minutes per step, or 1k steps in ~3 days

To achieve this we shipped in our latest prime-rl 0.6.0:

* inference: wide-ep, fp8 inference, llm-d router, mooncake, kv cache cpu offloading

* training: fsdp2, deep-ep expert parallelism, dsa cp, fp8 training, router replay

* agentic rollout: we rewrote the core of our rollout orchestrator for better scalability

support for glm5, kimi, nemotron, ...,

prime-rl is open source but also end to end optimized to run on our dedicated RL infra and compute layer

Prime Intellect@PrimeIntellect

Today we're releasing prime-rl v0.6.0 — enabling RL at trillion-parameter MoE scale on agentic workloads at the highest efficiency.

We've relentlessly optimized our RL infra.

The result: GLM-5 on agentic SWE tasks at 131k context and sub-5-minute step time.

19h72.9K372160

Lucas Beyer (bl16)@giffmana

> 3D-parallel (FSDP2 + CP + EP)

This is, at least conceptually, my favorite sharding. I wonder if at some point during development you also tried and timed others (like PP or TP with EP) and how they compared, even just informally, or if you went straight for this one and only this one and just optimized it a lot?

16h5.4K4514

Prime Intellect@PrimeIntellect

We disaggregate prefill and decode onto separate workers.

A long prefill used to stall decode for everyone. Now it doesn't.

20h3.6K476

Prime Intellect@PrimeIntellect

One Mooncake store pools KV cache across all nodes, so any worker can reuse any prefix.

The router picks workers by a score over load, queue depth, KV usage and prefix overlap. You get cross-replica cache hits with balanced routing across the whole deployment.

20h3K426

Prime Intellect@PrimeIntellect

In RL, inference is the bottleneck — we optimize for throughput, not latency.

High concurrency, FP8 precision, and wide expert parallelism over 32+ GPUs. Every GPU holds its own slice of experts and acts as its own endpoint.

20h4.8K564

Prime Intellect@PrimeIntellect

Over a long run the trainer and inference policies slowly drift apart, and that mismatch can kill your training.

R3 (router replay) captures the routing decisions from the inference engine, replays them on the trainer - KL mismatch drops ~10x.

20h2.8K395

Prime Intellect@PrimeIntellect

Huge thanks to the @vllm_project team, and @robertshaw21 in particular, for all the help along the way.

Also to the llm-d and Dynamo teams for the collaboration on routing and inference.

20h3K404

Prime Intellect@PrimeIntellect

The trainer is 3D-parallel (FSDP2 + CP + EP), built on TorchTitan.

FSDP2 shards params, grads & optimizer state. EP keeps experts sharded and routes tokens with all2all instead of all-gathering ~80GB per layer. CP handles the 131k context and GLM-5's DSA attention.

20h2.7K421

Patrick C Toulme@PatrickToulme

@PrimeIntellect why yall using torchtitan and not JAX?

19h1.7K172

Matej Sirovatka@m_sirovatka

@giffmana @PrimeIntellect It was kind of both - TP/PP in torch is somewhat miserable so I try to avoid it as much as I can. CP has the advantage that with DSA it is almost communication free here, so it is kinda a no-brainer in GLM case, in other cases it was mostly from previous experience

11h52517

Matej Sirovatka@m_sirovatka

@PatrickToulme @PrimeIntellect my proposal for jax rewrite got outvoted (1:everyone else)

19h429181