wow we finally don't need AdaLN?
Xianbang Wang@kevinxbwang2007
Our simple rule: remove every part that seems to be removable. We starts with pixel space, the standard T5-L encoder, and a simple multimodal MM-JiT backbone with x-prediction.
5:54 AM · Jun 19, 2026 · 1.5K Views