ByteDance and Renmin University release iLLaDA, an 8B masked diffusion language model that outperforms Qwen2.5 7B · Digg

/Tech4h ago

ByteDance and Renmin University release iLLaDA, an 8B masked diffusion language model that outperforms Qwen2.5 7B

Story Overview

ByteDance and Renmin University researchers trained iLLaDA from scratch as a fully bidirectional 8B masked diffusion model, keeping the same objective through both pre-training on 12 trillion tokens and later fine-tuning, then reported it beating Qwen2.5 7B on several base-model benchmarks while using grouped-query attention and tied embeddings.

8694286.4K

Original post

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr#613inTech

Improved Large Language Diffusion Models

"We introduce iLLaDA (improved LLaDA), an 8B fully bidirectional masked diffusion language model trained from scratch. For pre-training, iLLaDA scales the corpus to 12T tokens, uses grouped-query attention to reduce cache-style inference memory and tied input/output embeddings to reduce parameter count, and modifies the learning-rate schedule for large-scale training. For post-training, iLLaDA modifies the SFT strategy for variable-length generation and trains on a 25B-token instruction corpus for 12 epochs. For inference and evaluation, iLLaDA uses variable-length generation for efficiency and confidence-based scoring for multiple-choice benchmarks"

"Against Qwen2.5 7B, iLLaDA-Base is slightly stronger on average, while iLLaDA-Instruct still lags behind Qwen2.5 7B Instruct."

1:21 AM · Jun 25, 2026 · 3K Views

Benchmark Edge

Where diffusion training actually helps

The model posts clear lifts over earlier diffusion LLMs like LLaDA and Dream on tasks such as BBH and MATH, yet the instruct version still trails the much larger-data Qwen2.5 baseline on average.

Open Question

Access remains an open variable

The preprint links to a GitHub repo for weights and code but supplies no release date, license, or hosting details, so it is not yet known when or how developers can actually run the model.

Sentiment

Users are excited about ByteDance's iLLaDA 8B diffusion LLM outperforming Qwen2.5 because it shows alternative approaches can still improve and may deliver breakthroughs with open weights.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Improved Large Language Diffusion Models

ARXIV.ORGVia

Improved Large Language Diffusion Models

ARXIV.ORGVia

Posts from X

Most Activity

VIEWS3KLIKES31REPLIES4

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

So far ByteDance is supreme in videogen, but open weights LLMs from startups can humble Doubao Seed. Would be funny if they make a great leap forward when they finally figure out how to build fully diffusion-based LLMs too.

Xiuyu Li@sheriyuo

iLLaDA is an 8B masked diffusion language model trained from scratch with fully bidirectional attention, keeping the masked-diffusion objective all the way through pretraining and SFT rather than bolting diffusion onto an autoregressive base.

Improved Large Language Diffusion Models Paper: http://arxiv.org/abs/2606.25331

2h3K312

BOOKMARKS3

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

abs: https://arxiv.org/abs/2606.25331

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Improved Large Language Diffusion Models

"We introduce iLLaDA (improved LLaDA), an 8B fully bidirectional masked diffusion language model trained from scratch. For pre-training, iLLaDA scales the corpus to 12T tokens, uses grouped-query attention to reduce cache-style inference memory and tied input/output embeddings to reduce parameter count, and modifies the learning-rate schedule for large-scale training. For post-training, iLLaDA modifies the SFT strategy for variable-length generation and trains on a 25B-token instruction corpus for 12 epochs. For inference and evaluation, iLLaDA uses variable-length generation for efficiency and confidence-based scoring for multiple-choice benchmarks"

"Against Qwen2.5 7B, iLLaDA-Base is slightly stronger on average, while iLLaDA-Instruct still lags behind Qwen2.5 7B Instruct."

4h66543

Emily@IamEmily2050

@teortaxesTex I believe we should see something from them before end of the year, Seedance V2.5 and Seedream V5 Pro will help a lot, they will get a lot of feedback and new data and interesting RL problems to solve.

1h53

芝麻85 有点忙@miu17096dbw

@teortaxesTex 开源模型确实快要追上闭源壁垒了

2h11

Jayita Bhattacharyya (JB)@jayitabhattac11

@iScienceLuvr Text diffusion is making quite a noise!

2h8

Ozar@ozarliquid

@teortaxesTex diffusion llms actually catching up feels like the plot twist nobody saw coming

wonder if seed's lead forces bytedance to open weights just to flex

19m4

The lena@lenooooo68

@iScienceLuvr The interesting part is that diffusion language models are still improving despite transformers dominating the conversation. It is good to see alternative approaches getting serious scaling efforts.

2h3

Drey@dreyfomo

@teortaxesTex fingers crossed diffusion LLMs are the real breakthrough. would be wild if open weights ends up beating closed source at their own game.

55m