1d ago

HiDream Open-Sources 8B Pixel-Level Unified Transformer Image Model

145311308.3K

——0——

Original post

HiDream just open-sourced an 8B image model with a big message behind it: the old diffusion pipeline (VAE-plus-text-encoder) may not be the only serious path left. 8B param, HiDream-O1-Image (8B) claims parity with models over 3x its size (e.g., 27B Qwen-Image). @HiDream_AI , @vivago_ai Key Features 🧬 Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder. 🎨 One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, and storyboard generation in a single architecture. 🧠 Reasoning-Driven Prompt Agent — Built-in "thinking" agent that resolves implicit knowledge, layout, and text rendering before generation. 🖼️ Native High Resolution — Direct synthesis up to 2,048 × 2,048 with sharp fine-grained detail. ⚡ Exceptional Efficiency and Versatility at 8B Scale — With only 8B parameters, achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models. Most image models still split the job across a text encoder, a VAE, and a diffusion model, so details can get lost when real pixels are compressed into hidden image codes. HiDream-O1-Image removes that split by using a Pixel-level Unified Transformer, where raw image patches, text tokens, and task conditions enter the same model space. That means text-to-image, image editing, and subject personalization become variants of one in-context generation task, not separate pipelines. A prompt agent first rewrites messy user requests into clearer visual instructions, reasoning through layout, subject attributes, physics, and context before generation. The strongest result is text rendering. On LongText-Bench, the 8B model scores 0.979 in English and 0.978 in Chinese, while the 200B+ model reaches 0.982 and 0.980. That is the part to watch, because clean text inside generated images is still one of the hardest problems for image models. 🧵 1.

11:03 AM · May 18, 2026

Reposted by

#1032@ROHANPAUL_AI

#1032Rohan Paul@ROHANPAUL_AI

🧵 2. This diagram explains why HiDream is not just another image generator with a bigger parameter count.

HiDream maps text, reference images, task instructions, and raw image patches into one shared token space before generation starts.

The model then sends those mixed tokens through one Unified Transformer Backbone, where self-attention lets every part of the request look at every other part.

For text-to-image, the condition can be only a prompt, such as cattle grazing in a field.

For image editing, the condition includes an input image plus an instruction, such as changing grass, sky, and lighting while keeping the same animal.

For subject-driven personalization, the condition includes a reference subject, such as a person, so the model can preserve identity while creating a new scene.

The right side explains the generation process: the model predicts clean image patches from noisy image patches, then stitches those patches back into a final image.

The important technical shift is that HiDream treats generation, editing, and personalization as the same kind of token prediction problem, instead of building a different pipeline for each task.

Rohan Paul@rohanpaul_ai

6:03 PM · May 18, 2026 · 5.1K Views

6:03 PM · May 18, 2026 · 328 Views

#1032Rohan Paul@ROHANPAUL_AI

🧵 3. HiDream-O1-Image (codename: Peanut) debuts at #8 in the Artificial Analysis Text to Image Arena,

Rohan Paul@rohanpaul_ai

🧵 2. This diagram explains why HiDream is not just another image generator with a bigger parameter count. HiDream maps text, reference images, task instructions, and raw image patches into one shared token space before generation starts. The model then sends those mixed tokens through one Unified Transformer Backbone, where self-attention lets every part of the request look at every other part. For text-to-image, the condition can be only a prompt, such as cattle grazing in a field. For image editing, the condition includes an input image plus an instruction, such as changing grass, sky, and lighting while keeping the same animal. For subject-driven personalization, the condition includes a reference subject, such as a person, so the model can preserve identity while creating a new scene. The right side explains the generation process: the model predicts clean image patches from noisy image patches, then stitches those patches back into a final image. The important technical shift is that HiDream treats generation, editing, and personalization as the same kind of token prediction problem, instead of building a different pipeline for each task.

6:03 PM · May 18, 2026 · 328 Views

6:03 PM · May 18, 2026 · 272 Views

#1032Rohan Paul@ROHANPAUL_AI

🧵 4. Standard image models usually compress images before generation, which saves compute but can lose tiny details like sharp text, clean texture, and exact layout.

HiDream-O1-Image instead patchifies raw pixels and feeds them into the same Transformer stream as text, edits, and reference images.

Rohan Paul@rohanpaul_ai

🧵 3. HiDream-O1-Image (codename: Peanut) debuts at #8 in the Artificial Analysis Text to Image Arena,

6:03 PM · May 18, 2026 · 272 Views

6:03 PM · May 18, 2026 · 84 Views

#1032Rohan Paul@ROHANPAUL_AI

🧵 5. General text-to-image generation at up to 2,048 × 2,048.

Rohan Paul@rohanpaul_ai

🧵 4. Standard image models usually compress images before generation, which saves compute but can lose tiny details like sharp text, clean texture, and exact layout. HiDream-O1-Image instead patchifies raw pixels and feeds them into the same Transformer stream as text, edits, and reference images.

6:03 PM · May 18, 2026 · 84 Views

6:03 PM · May 18, 2026 · 73 Views

#1032Rohan Paul@ROHANPAUL_AI

🧵 6. Long-text rendering & layout control — accurate, multi-region, multilingual text.

Rohan Paul@rohanpaul_ai

🧵 5. General text-to-image generation at up to 2,048 × 2,048.

6:03 PM · May 18, 2026 · 73 Views

6:03 PM · May 18, 2026 · 75 Views

#1032Rohan Paul@ROHANPAUL_AI

🧵 7. Subject-driven personalization — preserve identity / IP across new scenes.

Rohan Paul@rohanpaul_ai

🧵 6. Long-text rendering & layout control — accurate, multi-region, multilingual text.

6:03 PM · May 18, 2026 · 75 Views

6:03 PM · May 18, 2026 · 58 Views

#1032Rohan Paul@ROHANPAUL_AI

🧵 8. The model also uses a Prompt Agent built on Gemma. When the user gives a messy prompt, the agent first reasons about layout, subject details, physical logic, and scene context, then turns that into a clearer generation prompt.

That is useful for prompts like posters, ads, storyboards, product shots, and multi-object edits.

Rohan Paul@rohanpaul_ai

🧵 7. Subject-driven personalization — preserve identity / IP across new scenes.

6:03 PM · May 18, 2026 · 58 Views

6:03 PM · May 18, 2026 · 284 Views

#1032Rohan Paul@ROHANPAUL_AI

🧵 9. HiDream-O1-Image’s 8B model beats GPT Image 2, Seedream-4.0, FLUX.2, and Qwen-Image on “LongText-Bench”.

And this is the hard part that usually breaks image models: it keeps long English and Chinese text readable inside generated images

Rohan Paul@rohanpaul_ai

6:03 PM · May 18, 2026 · 284 Views

6:03 PM · May 18, 2026 · 242 Views

#1032Rohan Paul@ROHANPAUL_AI

🧵 10. More beautiful results from HiDream-O1-Image

Rohan Paul@rohanpaul_ai

🧵 9. HiDream-O1-Image’s 8B model beats GPT Image 2, Seedream-4.0, FLUX.2, and Qwen-Image on “LongText-Bench”. And this is the hard part that usually breaks image models: it keeps long English and Chinese text readable inside generated images

6:03 PM · May 18, 2026 · 242 Views

6:03 PM · May 18, 2026 · 95 Views

#1032Rohan Paul@ROHANPAUL_AI

Github: https://github.com/HiDream-ai/HiDream-O1-Image

Huggingface: https://huggingface.co/HiDream-ai/HiDream-O1-Image

Rohan Paul@rohanpaul_ai

🧵 11. The text rendering capability of HiDream-O1-Image is truly the finest.

6:03 PM · May 18, 2026 · 980 Views

6:03 PM · May 18, 2026 · 964 Views

#1032Rohan Paul@ROHANPAUL_AI

🧵 11. The text rendering capability of HiDream-O1-Image is truly the finest.

Rohan Paul@rohanpaul_ai

🧵 10. More beautiful results from HiDream-O1-Image

6:03 PM · May 18, 2026 · 95 Views

6:03 PM · May 18, 2026 · 980 Views

HiDream Open-Sources 8B Pixel-Level Unified Transformer Image Model

Sentiment

Cluster engagement