13h ago

SenseTime open-sources SenseNova U1, a series of native multimodal models that integrate image and text understanding, reasoning, and generation using the NEO-unify architecture.

Models post state-of-the-art open-source benchmark results with interleaved image-text output support.

0
Original post

Chinese AI lab SenseTime just open-sourced SenseNova U1, a unified multimodal model that can understand, reason, and generate images + text inside 1 model. The interesting part is the architecture: it removes the usual visual encoder and variational auto-encoder setup, then handles image and language inside a shared representation space, instead of being passed between separate modules. That means less handoff between modules, less information loss, and better consistency when creating dense visual content like infographics, guides, posters, comics, and image-text workflows. That’s how the model can generate coherent text and images together in one flow, which is why it is strong for infographics, guides, comics, posters, and step-by-step visual content. For infographic generation specifically, it is also around 2x faster than Qwen-Image-2.0 / Seedream-4.5 while staying in the same rough quality band, based on the client benchmark chart. 1/n

9:09 AM · May 20, 2026 View on X

1/ I have been spending time with SenseNova U1, a native multimodal model series released by @SenseTime_Al.

It is built on an architecture called NEO-unify that processes images and text together in one single system. It is a big change from the usual way of handing tasks off between separate components.

Look at this thread 🧵:

5:59 PM · May 21, 2026 · 5.1K Views

2/ I tested the interleaved generation by starting with a monochrome sketch of Darth Vader. It builds out mechanical textures and lighting step by step while keeping the same visual style throughout.

Watching it refine the image without losing the original structure is pretty impressive..

Chubby♨️Chubby♨️@kimmonismus

1/ I have been spending time with SenseNova U1, a native multimodal model series released by @SenseTime_Al. It is built on an architecture called NEO-unify that processes images and text together in one single system. It is a big change from the usual way of handing tasks off between separate components. Look at this thread 🧵:

5:59 PM · May 21, 2026 · 5.1K Views
5:59 PM · May 21, 2026 · 1.2K Views

3/ It also handles high-density info rendering very well. I generated this awareness infographic to see how it manages complex layouts.

It kept the text clean and the icons structured, which is a common pain point for open-source models.

Chubby♨️Chubby♨️@kimmonismus

2/ I tested the interleaved generation by starting with a monochrome sketch of Darth Vader. It builds out mechanical textures and lighting step by step while keeping the same visual style throughout. Watching it refine the image without losing the original structure is pretty impressive..

5:59 PM · May 21, 2026 · 1.2K Views
5:59 PM · May 21, 2026 · 449 Views

6/ The full Technical Report is now out, their most detailed model disclosure yet. SenseNova-U1-A3B-MoT (38B-A3B MoE) weights are now open-sourced.

You can check out the report or try the tools at the links below.

Try it here: https://unify.light-ai.top/login?next=%2Fhome

Technical Report: https://arxiv.org/abs/2605.12500

GitHub: https://github.com/OpenSenseNova/SenseNova-U1

Hugging Face: https://huggingface.co/collections/sensenova/sensenova-u1

Chubby♨️Chubby♨️@kimmonismus

5/ They are open-sourcing the Lite series in two sizes, an 8B dense model and an A3B mixture-of-experts version. The product has some exciting new updates. An 8-step distilled LoRA now open-sourced: inference cut from 23s to 2s on H100 (100 NFE -> 8 NFE). ComfyUI is now supported, with ready-to-run workflows for text-to-image, image editing, and interleaved generation.

5:59 PM · May 21, 2026 · 1.6K Views
5:59 PM · May 21, 2026 · 1.5K Views