/Tech21d ago

MIRO Introduces Multi-Reward Training for Aligned Text-to-Image Models

5111264812.1K

#1697

Original post

Andrei Bursuc @CVPR#1697

Nicolas DUFOUR@nico_dufour

Excited to share that MIRO is accepted to ICML 2026 @icmlconf ! 🎉

We introduce multi-reward conditioned training for text-to-image. By training on continuous reward scores, we can simply condition on HIGH REWARDS at inference to guarantee top-tier, aligned outputs.

7:35 AM · May 20, 2026 · 12.1K Views

/Tech21d ago

MIRO Introduces Multi-Reward Training for Aligned Text-to-Image Models

5111264812.1K

#1697

Original post

Andrei Bursuc @CVPR#1697

Nicolas DUFOUR@nico_dufour

Excited to share that MIRO is accepted to ICML 2026 @icmlconf ! 🎉

We introduce multi-reward conditioned training for text-to-image. By training on continuous reward scores, we can simply condition on HIGH REWARDS at inference to guarantee top-tier, aligned outputs.

7:35 AM · May 20, 2026 · 12.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS74BOOKMARKS1RETWEETS2REPLIES1

Nicolas DUFOUR@nico_dufour

We’ve open-sourced everything, including all individual reward-ablated model variants!

🌐 Site: https://nicolas-dufour.github.io/miro/ 📄 Paper: https://arxiv.org/abs/2510.25897 🛠️ Git: https://github.com/nicolas-dufour/miro 🤗 HF: https://huggingface.co/nicolas-dufour/miro 🎨 Demo: https://huggingface.co/spaces/nicolas-dufour/miro

21d7431

LIKES4

Nicolas DUFOUR@nico_dufour

First, a reminder of MIRO's principal results and efficiency:

⚡ Up to 19× faster training convergence than standard baselines. 📉 34× fewer parameters & 370× cheaper inference compute than models like FLUX, while maintaining competitive visual quality.

21d734

Nicolas DUFOUR@nico_dufour

Are all 7 rewards actually useful? Yes!

Our new "leave-one-out" ablation shows that removing even a single reward drops overall performance. Even though these rewards are quite entangled, each one still injects unique, useful bits of info.

21d423

Nicolas DUFOUR@nico_dufour

By conditioning on a vector of 7 rewards simultaneously, MIRO naturally balances conflicting objectives and avoids reward hacking.

This yields a major jump in text composition, achieving SOTA scores on GenEval, PickAScore, and HPSv2.

We can control at test time the reward mix

21d702

Nicolas DUFOUR@nico_dufour

We also expand MIRO beyond training from scratch, it works also as a post-training framework!

Applying multi-reward conditioning during fine-tuning on an existing base model yields the a robust, controllable alignment at inference.

21d482

Nicolas DUFOUR@nico_dufour

We’ve added a firm mathematical foundation.

Our new theorem proves that conditioning on the joint reward distribution guarantees the model steers toward high-reward regions while preserving sample diversity and avoiding reward hacking.

21d482

Nicolas DUFOUR@nico_dufour

Work done with @lucasdegeorge , @sohonjitghosh, @VickyKalogeiton and @david_picard

21d481