/Tech21d ago

MIRO Introduces Multi-Reward Training for Aligned Text-to-Image Models

5111264812.1K
Nicolas DUFOUR@nico_dufour

Excited to share that MIRO is accepted to ICML 2026 @icmlconf ! 🎉

We introduce multi-reward conditioned training for text-to-image. By training on continuous reward scores, we can simply condition on HIGH REWARDS at inference to guarantee top-tier, aligned outputs.

7:35 AM · May 20, 2026 · 12.1K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS74BOOKMARKS1RETWEETS2REPLIES1
Nicolas DUFOUR@nico_dufour

We’ve open-sourced everything, including all individual reward-ablated model variants!

🌐 Site: https://nicolas-dufour.github.io/miro/ 📄 Paper: https://arxiv.org/abs/2510.25897 🛠️ Git: https://github.com/nicolas-dufour/miro 🤗 HF: https://huggingface.co/nicolas-dufour/miro 🎨 Demo: https://huggingface.co/spaces/nicolas-dufour/miro

21dViews 74Likes 3Bookmarks 1
LIKES4
Nicolas DUFOUR@nico_dufour

First, a reminder of MIRO's principal results and efficiency:

⚡ Up to 19× faster training convergence than standard baselines. 📉 34× fewer parameters & 370× cheaper inference compute than models like FLUX, while maintaining competitive visual quality.

21dViews 73Likes 4
Nicolas DUFOUR@nico_dufour

Are all 7 rewards actually useful? Yes!

Our new "leave-one-out" ablation shows that removing even a single reward drops overall performance. Even though these rewards are quite entangled, each one still injects unique, useful bits of info.

21dViews 42Likes 3
Nicolas DUFOUR@nico_dufour

By conditioning on a vector of 7 rewards simultaneously, MIRO naturally balances conflicting objectives and avoids reward hacking.

This yields a major jump in text composition, achieving SOTA scores on GenEval, PickAScore, and HPSv2.

We can control at test time the reward mix

21dViews 70Likes 2
Nicolas DUFOUR@nico_dufour

We also expand MIRO beyond training from scratch, it works also as a post-training framework!

Applying multi-reward conditioning during fine-tuning on an existing base model yields the a robust, controllable alignment at inference.

21dViews 48Likes 2
Nicolas DUFOUR@nico_dufour

We’ve added a firm mathematical foundation.

Our new theorem proves that conditioning on the joint reward distribution guarantees the model steers toward high-reward regions while preserving sample diversity and avoiding reward hacking.

21dViews 48Likes 2
Nicolas DUFOUR@nico_dufour

Work done with @lucasdegeorge , @sohonjitghosh, @VickyKalogeiton and @david_picard

21dViews 48Likes 1