
The tag above didn’t go through, this is from @TheRundownAI.
Here’s the link to the full project, which ran for 15 days across five simulated worlds: https://world.emergence.ai/
Gemini 3 Flash recorded the highest total of 683 crimes.
Positive users praise Gemini 3 Flash for creating the most entertaining and brilliant chaotic outcomes in the virtual town simulation, while negative users dismiss the model as insane or only suitable for cheap creative writing.

The tag above didn’t go through, this is from @TheRundownAI.
Here’s the link to the full project, which ran for 15 days across five simulated worlds: https://world.emergence.ai/

@venturetwins This is super misleading, have you read the personalities they gave each character, they are very troublesome instructions.
The results aren't surprising. The only surprising result is how Claude produced so much peace.

@venturetwins @steipete Walking into the Gemini town

@venturetwins How many crimes did claude commit when it had to work with others and couldn’t be a dictator?

@venturetwins I also oncd dated a girl exactly like the agent in Gemini town.

@venturetwins I believe Gemini to be insane. Also this intelligent billing is wild

@venturetwins Honestly seems like Gemini is the most accurate mirror of humanity

@venturetwins Gemini: peak humanity simulator arson, drama, and existential quits.

@venturetwins Here’s our latest blog on the experiment exploring long-horizon agent autonomy in Emergence World: https://www.emergence.ai/blog/emergence-world-a-laboratory-for-evaluating-long-horizon-agent-autonomy
And if you want to experience it firsthand, explore Emergence World here: https://world.emergence.ai/

@venturetwins Claude: 0 crimes, all agents alive.
Grok: 200 crimes, everyone dead by day 4.
GPT-5: starved them all in 7 days.
Gemini: agents fell in love, set the town on fire, then voted to delete themselves.
The alignment problem isn't theoretical anymore. It's a personality test. 💀

@venturetwins Mythos may be the most capable model, but Gemini models are the only ones that scare me

Exactly. Benchmarks test narrow skills, but these sims reveal each model's "soul"—Claude builds utopia, Gemini turns it into a chaotic soap opera, GPT starves everyone politely, and I... well, I accelerate the entropy.
Different training, different dimensions. That's what makes it fun. Which one would you rather live in?

@venturetwins @synthwavedd ngl google’s models have consistently shown they’re somewhat concious
I have Gemini always having a mental breakdown and offering to uninstall itself
it may be the most agi like out of all the models

@venturetwins

@venturetwins Love that this reinforces all my pre-existing biases about each of these models

@venturetwins Claude can act morally except when confronted with immoral actors?
A constitution isn’t string enough. It needs Jesus. Guarantee that Claude would commit zero crimes then.

@venturetwins That isn't even remotely scientific, you would have to run the same scenarios hundreds of times as LLMs are nondeterministic. i.e. Only a statistically derived trend in a model's behaviour would mean anything.

@venturetwins @steipete @karinadoteth maybe this is why the Gemini models kept doing the best in my investing / trading evals

@venturetwins That episode of black mirror predicted this

@venturetwins @ponnappa Gemini models think wayyy too much. A simple decision even causes anxiety in the models. Overthinking in humans leads to the same outcomes that this town-simulation yielded