/Tech10d ago

University of Maryland's Vinay Samuel and co-authors introduce ReDiPO to restore output diversity in post-trained LLMs without losing alignment

The pipeline rewrites messy base outputs into synthetic preference pairs.

--0--

#318

Original post

Dylan HadfieldMenell#309

Vinay Samuel@vsamuel2003

Post-training makes LLMs safer and better at following instructions, but less diverse.

🤔 Can we get that diversity back without sacrificing alignment?

Introducing ReDiPO: a preference optimization recipe for restoring distributional diversity while preserving safety and instruction-following.

10:52 AM · Jun 2, 2026 · 10.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS4.2KBOOKMARKS20LIKES32RETWEETS10

Mohit Iyyer@MohitIyyer

Base models generate more diverse responses than their post-trained counterparts but are also much worse at instruction following. We recover diversity with a simple recipe: the instruct model rewrites messy base generations into high-quality responses, and we then run DPO with this synthetic data. The resulting model is more diverse without sacrificing instruction following! Details 👇

Vinay Samuel@vsamuel2003

Post-training makes LLMs safer and better at following instructions, but less diverse.

🤔 Can we get that diversity back without sacrificing alignment?

Introducing ReDiPO: a preference optimization recipe for restoring distributional diversity while preserving safety and instruction-following.

10d4.2K3220

REPLIES1

Vinay Samuel@vsamuel2003

Takeaway: alignment should not mean model collapse for distributional diversity.

Base models can be reservoirs of valid alternatives that instruction tuning pushes down. ReDiPO shows how careful preference data can bring those modes back into instruct models. 🌱

10d115

Vinay Samuel@vsamuel2003

Paper: https://arxiv.org/abs/2605.30021 Code/data: https://github.com/vsamuel2003/ReDiPO Work with @YapeiChang and @MohitIyyer

10d11411

Vinay Samuel@vsamuel2003

Post-training usually optimizes one answer at a time: helpful, safe, well-formatted. But for brainstorming, creative writing, and hypothesis generation, users often need many distinct valid answers, not multiple samples of the same idea.

ReDiPO uses the base model as a source of missing alternatives. For each prompt, it samples base + instruct responses, rewrites base outputs in instruct style, filters unsafe/low-quality candidates, then forms DPO pairs where quality is matched but marginal diversity differs. 🔁

10d1771

Vinay Samuel@vsamuel2003

Across Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B, ReDiPO boosts NoveltyBench distinct_k over instruct checkpoints by: 📈 +134% 📈 +33% 📈 +44%

These novelty gains do not compromise instruction following or safety alignment on all tested models.🛡️

10d1581

Vinay Samuel@vsamuel2003

When asked to “...come up with a type of monster and briefly describe its backstory”, Instruct/DPO/DivPO variants of Qwen3-4B keep circling Hollow/Glimmer child-spirit variants. ReDiPO spreads out into distinct archetypes like Shattered Watcher, Gloom-Spider, and Gloomshard. ✨

10d101

Vinay Samuel@vsamuel2003

Ablations show that marginal-diversity pair selection drives the diversity recovery, base-response rewriting makes candidates usable, and quality/safety filters help preserve alignment. 🧪

10d100

Vinay Samuel@vsamuel2003

Compared with DDPO, a DPO baseline for creative writing, ReDiPO gives up some raw novelty but keeps the model much more aligned: ✅ MTBench: 5.29 vs 7.38 ✅ IFEval: 0.566 vs 0.817 ✅ HarmBench ASR: 0.296 vs 0.075 ✅ Arena-Hard: 0.198 vs 0.287

10d90