/Tech26d ago

Reve launches Reve 2.0, a 4K image generator using layout-based code control that ranks second on Text-to-Image Arena

The model ranks just behind OpenAI's gpt-image-2

--0--

#225

Original post

Robert Scoble#447

Reve@reve

Today, we’re launching Reve 2.0, the best 4K image model in the world.

We invented a new way to generate and edit any image using precise layouts. For the first time, it’s possible to create images you can touch.

12:50 PM · Jun 3, 2026 · 11.1M Views

Sentiment

Many users praised Reve 2.0 for its precise layout control and strong #2 arena ranking as a major improvement for production use, while others complained about cartoonish outputs, lost style consistency, and demanded refunds.

Pos

90.3%

Neg

9.7%

101 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS43.2KBOOKMARKS150LIKES189REPLIES20

swyx @aiDotEngineer WF@swyx

you guys know where this is going right

Hasan@hasanluongo

wow this @reve 2.0 launch copy is supurb.

"it is now clear that the key to both controllable image generation and editing is not denser prompts, but a highly detailed, highly manipulatable, intermediate representation expressed as code."

"Creativity is not, and will never be, a one-shot workflow. But modern image generation models punish iteration through progressive degradation."

"Alan Kay famously said that people who are serious about software should make their own hardware. At Reve, we believe the same principle applies to creativity: companies that are truly serious about creative tooling should train their own models."

and dang look at these:

26d43.2K189150

RETWEETS14

Taesung Park@Taesung

Diffusion models are known to be very compute intensive, even more so than LLM training. Now that we reduce images into layouts, we turn it into a next token prediction problem. This gives us a big boost.

Taesung Park@Taesung

Our layout can semantically represent input images. For example, the right side "reconstructs" the input by extracting layout and re-rendering, without seeing any pixels. The more regions you create, the better reconstruction gets.

26d10.4K9518

fofr@fofrAI

Here's some early tests of Reve 2 with the prompt:

> an amateur photo of fantastical realism

Two excellent new image models on the same day. What a treat.

Reve@reve

Today, we’re launching Reve 2.0, the best 4K image model in the world.

We invented a new way to generate and edit any image using precise layouts. For the first time, it’s possible to create images you can touch.

25d9.2K11726

Alex Volkov @ AI Engineer@altryne

What's going on today!? So many new AI releases!

@reve has been the underdog for the longest time, one of the coolest AI image labs with not a ton of exposure!

Not only is their models great (#2 on Image Arena, above Nano Banana) but the Reve editor lets you have precise control over every aspect!

Reve@reve

Today, we’re launching Reve 2.0, the best 4K image model in the world.

We invented a new way to generate and edit any image using precise layouts. For the first time, it’s possible to create images you can touch.

26d9.1K558

Robert Scoble@Scobleizer

The past two days I was at @magnific's @upscaleconf (which was quite excellent, met so many people who are building creative things with AI) and EVERYONE was talking about this new model.

Reve@reve

Today, we’re launching Reve 2.0, the best 4K image model in the world.

We invented a new way to generate and edit any image using precise layouts. For the first time, it’s possible to create images you can touch.

24d4.8K337

Anastasios Nikolas Angelopoulos@ml_angelopoulos

Congratulations @Taesung @YGandelsman @reve on an awesome release!

Arena.ai@arena

Reve 2.0 has landed #2 in the Text-to-Image Arena!

Scoring 1280, this puts the latest model above Nano Banana 2, MAI-Image-2.5, and GPT-Image-1.5-High Fidelity. This is a +125pt improvement over Reve v1.5.

Congratulations to the @reve team on this major milestone!

26d3.8K333

Kevin Kwok@kevinakwok

Over the last few months I've really felt how modern memes are changing. So much more ability to generate them but have to keep rerolling. Glad to now have precision editing meant for real creative purposes for making very specific memes for the groupchat

Reve@reve

Today, we’re launching Reve 2.0, the best 4K image model in the world.

We invented a new way to generate and edit any image using precise layouts. For the first time, it’s possible to create images you can touch.

26d3.6K214

Reve@reve

Our independent research lab ranks top 2 on @arena Text-to-Image, ahead of Nano Banana 2 and GPT-Image-1.5.

26d1.5K221

Robert Scoble@Scobleizer

Interesting new model shipped today.

Hang Gao@hangg70

we made a new model for text-to-image generation and editing. the results are looking good and the leaderboard is looking strong. it turns out that nano banana 2 is not impossible to beat, which felt like the case at the beginning of the year. there are a lot of great models out there that get released often. why should you care about reve 2.0?

to me, there are mainly two reasons. one being that reve is an underdog, reasonably funded but magnitudes less than other big labs, e.g. oai, google, meta, etc. you might be curious about how we managed to make it to the top. two being that reve 2.0 is a decent model, and we as a team are willing to talk openly about some of our learnings and thoughts that could be helpful. in this post, i want to share mine on reve 2.0 and multimodal in general as a person working on it.

first things first, reve 2.0 is a pixel diffusion model with a thing that we call "layout" as the rendering representation. these two things are our research bets that turned out to work amazingly well. pixel diffusion lets us go 4k without sacrificing quality or speed. layout lets us scale better and have better control, which are two sides of the same coin. the field standard has been to use long upsampled prompts for rendering. yet this results in an awkward situation where captioners and users need to describe precise controls with text, which can be inaccurate. this inaccuracy amounts to bad reconstruction and control at test time. it gets worse with scale. and this inherent ambiguity is a curse in current multimodal generators. so what's a layout? a layout is a css of an image, which can be either defined by humans or learned by models. we end up capitalizing a lot on regions, which are good for 2D space. yet this idea naturally generalizes. it turns out to be a standard VLM mid-training task, and that's solvable in good hands. it also brings many good properties in pretraining and post-training, which i am not going to expand on. ideogram independently verified that layout is useful (released on the same day, congrats!). to be clear, these bets are not novel, but to put together a system that makes them work is (and showing it beats nano banana 2).

second, it's nice that these bets, among others, worked out. however, like in many cases, there was a long time when things were underperforming. our competitor models are great, and most likely didn't make many risky bets. it is a big pipelining and engineering problem. why should we risk it? in retrospect, the culture of our team and leadership helped a lot. our priorities didn't swing and have stayed focused during our development. the idea makes sense, the execution is good, if things don't work out it's a bug, let's go find it and try more things. by and large, reve remains a research lab with big computers. this is rare. let me tag some amazing ppl here: @Taesung @m_gharbi @Songwei_Ge @TianweiY James Hong @dima_smirnov_ @theSidlak, ... the list goes on.

third, we spent most of our time improving text-to-image and didn't do much on editing. and our arena ranks show that. to date, we are #2 on text-to-image yet #9 on image editing. it's honestly a bit embarrassing that we didn't do well in editing, as layout promises to do well. but i am confident that this will improve, as we are juggling bandwidth and resources (we are a small team, and hey, come join us!).

fourth, talking about leaderboards and the state of multimodal, i genuinely feel that the gap between labs is shrinking. compared to LLMs, multimodal gen is at least half a year to a year behind. i am talking about architectures and core pipelines. to do good multimodal, you need to do good LLMs. reve has been helped by the OSS community a lot, but we've realized we need to own our language stack. and scaling follows naturally. leaderboards, in turn, are a noisy approximation and average of the real environments that you care about in deployment. they chase scaling and generalizable post-training. reve 2.0 ended up not being driven much by leaderboard evaluation, but relying on our intuition instead.

finally, how can multimodal be more useful? this is a question that keeps me up at night. coding has found its product-market fit and is driving up societal productivity. how can multimodal do that too? to me, we are nailing a single-round rollout that leads to an infinite one. this infinite rollout will drive our digital interaction and creation. for this rollout to be good, it needs to be precise. otherwise rollout efficiency is too low for either humans or agents. we are making bets and concrete progress towards that goal, such as converting images into a css-like layout. if you are interested in this topic, i recommend @stuffyokodraws's post for a high-level digest: https://x.com/stuffyokodraws/status/2061824755813306779. the success of multimodal depends on whether or not it can find a good product-market fit. that's the top question to figure out, then it's the model. it's quite non-linear to be honest, as critical pieces are still missing. but to me it's an area worth pouring my thoughts and efforts into.

give our model a spin, try your tasks, move some boxes. in case you find any bugs, please let me know in a reply or DM. hope it can help you.

26d7.1K212

Hang Gao@hangg70

@reve We achieved this with 1/10 to 1/100 resources of our competitors. Execution is important, but research bets also matter at scale.

26d556141

Reve@reve

@arena Images as code

Images are represented as code, so every part of an image becomes addressable, editable, and manipulatable.

26d869152

Christian Cantrell@cantrell

We just launched the best 4K image model in the world—and top two overall. A company of 65 sits on the leaderboard between behemoths training with at least 10x the compute. How did we do it? Architecture. We don't generate images from prompts; we generate them from code. And we don't edit images; we edit and render code. Reve 2.0 is the most controllable image model in the world, and http://reve.com is the most powerful generative image editor.

26d570151

Reve@reve

@arena Layout based images

Every image in Reve is segmented and labeled, giving you precise control over every region and element.

26d99716

Travis Davids@MrDavids1

@reve Amazing job Reve and congrats! Been testing landscapes/editorial shots and having fun!

25d25641

Reve@reve

@arena With Reve 2.0, the model and the product were designed together from the very beginning.

Create with the best interface built for visual intelligence. http://reve.com

26d81351

Reve@reve

@bkkray You can select the number of images you'd like to generate in the preferences tab!

26d32251

David Hoang@davidhoang

@swyx Yes.

26d40741

Gadgetify@Gdgtify

Not too many models can handle my crazy prompts. Reve did well. Congrats. Prompt: 2x2 grid, do this for 4 famous events in history {[tensor_multiplication_engine]

vector_g (geometric scene) = [variable_historical_event] matrix_m (material palette) = [variable_culinary_medium]

variables: [variable_historical_event] = "$input"

[variable_culinary_medium] = "high-end japanese sushi and sashimi"

equation: vector_g ⊗ matrix_m = render_target

[mapping_constraints] the ai must execute the cross product by replacing 100% of the materials in vector_g with the ingredients of matrix_m. no cheating. no actual metal, cloth, or plastic can exist. - structural inference: the ai must logically deduce which food parts fit which structures (e.g., raw salmon slices for the module dome, nori strips for the metal landing legs, rice grains for the textured lunar soil, fish roe for rocks). - scale: macro food photography. the entire scene must be presented as a plated dish resting on a stark black ceramic dining plate.

[visual_execution] render the render_target. lighting: overhead michelin-star restaurant spotlighting. texture: high gloss on the raw fish, sticky texture on the rice. the illusion of the event must be perfect, but the edibility of the food must be undeniable.}

26d17141

Oh, Friend.@oh__friend

@swyx How about just saying it vs vague posting?

26d4185

ATMR@TheLoneWulf_WA

@reve Yo. This is crazy. Images as layers?

26d3181