20h ago

SenseTime Open-Sources SenseNova-U1 Native Multimodal Model With MoE Backbone

0
Original post

Chinese AI labs are increasingly releasing very serious open source work. SenseNova U1 just dropped on HuggingFace: native multimodal modeling, MoT architecture (38B-Active 3B MoE) It attacks the hardest part of image generation: readable, structured, consistent image-text output. The most interesting part of SenseNova U1 is it treats multimodal generation as one native modeling problem, not a chain of separate vision, language, and image modules. That means less handoff between modules, less information loss, and better consistency when creating dense visual content like infographics, guides, posters, comics, and image-text workflows. ComfyUI support, fast A3B inference, and absolutely brilliant for dense visuals like infographics, posters, comics, and guides.

8:31 PM · May 19, 2026 View on X

Chinese AI lab SenseTime just open-sourced SenseNova U1, a unified multimodal model that can understand, reason, and generate images + text inside 1 model.

The interesting part is the architecture: it removes the usual visual encoder and variational auto-encoder setup, then handles image and language inside a shared representation space, instead of being passed between separate modules.

That means less handoff between modules, less information loss, and better consistency when creating dense visual content like infographics, guides, posters, comics, and image-text workflows.

That’s how the model can generate coherent text and images together in one flow, which is why it is strong for infographics, guides, comics, posters, and step-by-step visual content.

For infographic generation specifically, it is also around 2x faster than Qwen-Image-2.0 / Seedream-4.5 while staying in the same rough quality band, based on the client benchmark chart. 1/n

4:09 PM · May 20, 2026 · 1.3K Views

2/n Most multimodal systems still feel stitched together. - 1 part reads the image. - 1 part turns that into tokens. - 1 part reasons over language. - 1 part sends instructions to an image generator.

Every handoff can lose detail.

SenseNova U1 is trying a cleaner route: keep vision and language closer from the start, so the model can reason across both without constantly translating between separate systems.

Rohan PaulRohan Paul@rohanpaul_ai

Chinese AI lab SenseTime just open-sourced SenseNova U1, a unified multimodal model that can understand, reason, and generate images + text inside 1 model. The interesting part is the architecture: it removes the usual visual encoder and variational auto-encoder setup, then handles image and language inside a shared representation space, instead of being passed between separate modules. That means less handoff between modules, less information loss, and better consistency when creating dense visual content like infographics, guides, posters, comics, and image-text workflows. That’s how the model can generate coherent text and images together in one flow, which is why it is strong for infographics, guides, comics, posters, and step-by-step visual content. For infographic generation specifically, it is also around 2x faster than Qwen-Image-2.0 / Seedream-4.5 while staying in the same rough quality band, based on the client benchmark chart. 1/n

4:09 PM · May 20, 2026 · 1.3K Views
4:09 PM · May 20, 2026 · 206 Views

3/n The release includes the SenseNova U1 Lite series:

- SenseNova U1-8B-MoT, built on a dense backbone. - SenseNova U1-A3B-MoT, built on a mixture-of-experts backbone.

The impressive part is the size-performance tradeoff. U1 Lite reaches leading results among open-source models of similar scale, and even gets close to commercial image models on generation quality while being faster in inference.

Rohan PaulRohan Paul@rohanpaul_ai

2/n Most multimodal systems still feel stitched together. - 1 part reads the image. - 1 part turns that into tokens. - 1 part reasons over language. - 1 part sends instructions to an image generator. Every handoff can lose detail. SenseNova U1 is trying a cleaner route: keep vision and language closer from the start, so the model can reason across both without constantly translating between separate systems.

4:09 PM · May 20, 2026 · 206 Views
4:09 PM · May 20, 2026 · 143 Views

4/n The most useful part is the dense information rendering.

Infographics are hard for image models because they require layout control, readable text, visual structure, and semantic consistency at the same time.

SenseNova U1 is built for exactly this kind of output: knowledge posters, presentation-style visuals, comics, structured guides, and long image-text content where the text and images need to agree with each other.

Rohan PaulRohan Paul@rohanpaul_ai

3/n The release includes the SenseNova U1 Lite series: - SenseNova U1-8B-MoT, built on a dense backbone. - SenseNova U1-A3B-MoT, built on a mixture-of-experts backbone. The impressive part is the size-performance tradeoff. U1 Lite reaches leading results among open-source models of similar scale, and even gets close to commercial image models on generation quality while being faster in inference.

4:09 PM · May 20, 2026 · 143 Views
4:09 PM · May 20, 2026 · 37 Views

5/n. Another big piece is interleaved image-text generation.

That means the model can produce a full flow where text and visuals appear together, step by step, instead of generating a single image and stopping there.

All these examples demonstrate how the model can generate a complete multimodal sequence, pairing text and visuals across cooking guidance, iterative sketch refinement, comic-style drawing, and multi-view scene generation.

This is useful for tutorials, product guides, visual storytelling, education content, and agent workflows.

Rohan PaulRohan Paul@rohanpaul_ai

4/n The most useful part is the dense information rendering. Infographics are hard for image models because they require layout control, readable text, visual structure, and semantic consistency at the same time. SenseNova U1 is built for exactly this kind of output: knowledge posters, presentation-style visuals, comics, structured guides, and long image-text content where the text and images need to agree with each other.

4:09 PM · May 20, 2026 · 37 Views
4:09 PM · May 20, 2026 · 46 Views

6/n SenseNova U1-8B-MoT is around 2x faster than Qwen-Image-2.0 / Seedream-4.5 while staying in the same rough quality band

Much faster generation, while staying close to Qwen-Image-2.0 and Seedream-4.5 on quality.

Rohan PaulRohan Paul@rohanpaul_ai

5/n. Another big piece is interleaved image-text generation. That means the model can produce a full flow where text and visuals appear together, step by step, instead of generating a single image and stopping there. All these examples demonstrate how the model can generate a complete multimodal sequence, pairing text and visuals across cooking guidance, iterative sketch refinement, comic-style drawing, and multi-view scene generation. This is useful for tutorials, product guides, visual storytelling, education content, and agent workflows.

4:09 PM · May 20, 2026 · 46 Views
4:09 PM · May 20, 2026 · 694 Views

The full Technical Report is now out, their most detailed model disclosure yet.

Paper - http://arxiv.org/abs/2605.12500

Also: SenseNova-U1-A3B-MoT weights(38B-A3B MoE) are open-sourced.

Open-sourced an 8-step distilled LoRA: 100 NFE → 8 NFE, cutting H100 inference from 23s to 2s. ComfyUI is now supported, with ready-to-run workflows for t2i, image editing, and interleaved generation.

Also worth checking the SenseNova-Skills examples if you want prompt guides for infographic generation.

Built by @SenseTime_AI #AI #OpenSourceAI #MultimodalAI

Discord: https://discord.gg/BuTXPHmQub

Check it out here:

- GitHub: https://github.com/OpenSenseNova/SenseNova-U1

- Hugging Face: https://huggingface.co/collections/sensenova/sensenova-u1

Rohan PaulRohan Paul@rohanpaul_ai

6/n SenseNova U1-8B-MoT is around 2x faster than Qwen-Image-2.0 / Seedream-4.5 while staying in the same rough quality band Much faster generation, while staying close to Qwen-Image-2.0 and Seedream-4.5 on quality.

4:09 PM · May 20, 2026 · 694 Views
4:09 PM · May 20, 2026 · 683 Views