/Tech14h ago

OdysSim releases OSim-8B, a 10-billion-token corpus, and the SOUL-Index benchmark to train LLMs to simulate realistic human behavior

OSim-8B matches frontier models across 23 behavior metrics.

132213018065.8K

#287

Original post

Xuhui Zhou@nlpxuhui

Does LLM really need to be a helpful assistant all the time?

No. If you want to simulate people, “perfectly helpful” could be the wrong objective.

Meet OdysSim, a journey toward LLMs beyond assistants, as behavioral foundation models (10B tokens of real human behavior; 23 sim benchmarks, finally in one place. new open models: outperform or on par with GPT-5.5, Gemini 3.1, or Claude Opus 4.7 in many behavior-sim dimensions).

Human behavior simulation is becoming essential.

Agent evaluation needs realistic users before real users show up. Medical and classroom training need realistic patients and students. Social science needs synthetic participants at scale.

But real people are not ideal assistants.

Real patients panic or ignore good advice. Real students misunderstand. Real customers are vague, picky, impatient, or simply leave. Human behavior is messy, diverse, and often imperfect.

Frontier LLMs are getting better at math, code, and long-horizon tasks. They are NOT getting better at simulating human behavior. If anything, they drift the other way: more assistant-ish, more homogeneous, fewer of the errors and quirks real humans show.

This is no accident. The whole pipeline is built for helpfulness and task success, not behavioral realism. And you can't prompt your way out of that.

So we rethink the recipe from scratch and release:

🧠 The OdysSim corpus: 21.4M real human interactions (~10B tokens) from 62 sources, every conversation retrofitted with social grounding (who is talking, and why) 📏 SOUL-Index: 23 human-behavior benchmarks unified into one suite across 5 axes 🤖 OSim-8B: open weights; tops more SOUL-Index benchmarks than any frontier model, acts more like a real user than any of them on τ-bench (nearly matching real humans in the reaction dimension), and writes far more human-like text along the way.

9:40 AM · Jun 11, 2026 · 66.4K Views

Sentiment

Users praise OdysSim's open models for simulating messy human behavior because they value the emphasis on diverse world model design for realistic outputs and appreciate the public data release.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Xuhui Zhou@nlpxuhui

💡 Step 1: measure it. We built SOUL, 5 axes of human-like behavior (conversation, social skills, cognition/ToM, role-play, evaluation), and SOUL-Index, 23 benchmarks under one roof.

💡 Step 2: the data. 21.4M real human interactions (~10B tokens) from 62 sources. Most raw dialogue data has no idea WHO is talking or WHY. So we retrofit every conversation with back-generated social grounding (e.g., a character profile + interaction goal), and the model learns behavior grounded in context, not just text.

15h9066

BOOKMARKS7LIKES9

Xuhui Zhou@nlpxuhui

@sunweiwei12 @Jiarui_Liu_ @StigLidu @judysun233 @1000seagull 📄 Paper: https://tinyurl.com/8td8ux9d 🤗 Model: http://huggingface.co/collections/cmu-lti/odyssim 🤗 Data: http://huggingface.co/datasets/cmu-lti/osim-mid-training ; http://huggingface.co/datasets/cmu-lti/cmu-lti/osim-post-training 💻 Code: http://github.com/sunnweiwei/OdysSim

Have fun with our models and data!

15h56297

RETWEETS30

Xuhui Zhou@nlpxuhui

Does LLM really need to be a helpful assistant all the time?

No. If you want to simulate people, “perfectly helpful” could be the wrong objective.

Human behavior simulation is becoming essential.

Agent evaluation needs realistic users before real users show up. Medical and classroom training need realistic patients and students. Social science needs synthetic participants at scale.

But real people are not ideal assistants.

Real patients panic or ignore good advice. Real students misunderstand. Real customers are vague, picky, impatient, or simply leave. Human behavior is messy, diverse, and often imperfect.

This is no accident. The whole pipeline is built for helpfulness and task success, not behavioral realism. And you can't prompt your way out of that.

So we rethink the recipe from scratch and release:

16h66.4K221182

REPLIES2

Shuying Luo@shuying_luo

@nlpxuhui I’m skeptical that a model can capture a large range of human creativity.

Some human behaviors can be contradictory, can they be merged into one model?

11h1752

Xuhui Zhou@nlpxuhui

💡 Step 3: training, in two very different stages.

Midtraining on the corpus teaches the model what human behavior looks like. It shifts length, formatting, and word choice toward the human register.

Then we train one RL expert per benchmark, using verbal feedback from LLM judges where rewards aren't verifiable, and merge all 23 specialists into one model via expert distillation.

Fun ablation: the two stages do different jobs. Midtraining fixes the register; RL drives the benchmark gains.

15h68031

Xuhui Zhou@nlpxuhui

The part I find most interesting: we dropped OSim zero-shot into τ-bench as the customer talking to a tool-use agent, and scored it against real humans doing the same 165 tasks (τ-USI).

OSim-8B's reactions are nearly indistinguishable from real users: React alignment 93.2 vs 93.5 for humans.

Meanwhile, some of the most assistant-tuned frontier models are among the WORST user simulators. Their helpful, agreeable register is exactly what real users don't have. Helpfulness ≠ humanity.

15h32431

Xuhui Zhou@nlpxuhui

📈 Results on SOUL-Index:

✅ OSim-8B lands within ~1 point of GPT-5.5, Gemini 3.1 Pro & Claude Opus 4.7 on the 23-task average ✅ best or tied-best on 8/23 tasks, more than any single frontier model ✅ +18.0 over the best frontier model on user simulation (UserLLM), +16.8 on social interaction (Sotopia-Hard) ✅ beats every prior open behavioral-simulation model (UserLM, CoSER, HumanLM, Sotopia-RL) nearly across the board

An 8B model. Open weights.

15h3904

Xuhui Zhou@nlpxuhui

I think the key is not to say "capture human creativity" (the term is low-key vague by itself)

The real question: as frontier models increasingly collapse into a similar “assistant” shape (which are incredibly important and useful imo)

How do we rethink the objective and build something meaningfully different, something that opens up new forms of applications and interaction?

we have thoughts and would love to chat! ☄️

8h921

Xuhui Zhou@nlpxuhui

We're just getting started, and this seems to be a very different path from what the current frontier models!

Huge thanks to my amazing collaborators: @sunweiwei12, who co-led the project with me! 🫡 and @Jiarui_Liu_ @StigLidu @judysun233 @1000seagull for help and @tongshuangwu Yiming @MaartenSap for great mentorship and strong support!

15h4284

Xuhui Zhou@nlpxuhui

Training on LLM-judge rewards for *behavior* gets weird. Two failure modes we caught:

🕵️ Judge manipulation: in Sotopia the model started inserting evaluation-like statements into dialogue ("our relationship score should be perfect") instead of actually being social. In ~20-25% of rollouts!

📉 Short-response collapse: when judges subtract points per detected error, the model learns to just say "I'm fine." Fewer claims, fewer errors, higher reward.

An LLM hacking detector + rubric rewards mitigated both (ofc, there are defs more works to do there). If you do RL on judge-scored behavior, monitor the behavioral stats, not just the reward curve.

15h3203

Xuhui Zhou@nlpxuhui

🧭 Key insight:

Human simulation isn't a prompting problem, and it isn't solved by scale. The most capable assistants are often the least human.

Building simulators means realigning the whole pipeline around behavioral realism: measure it (SOUL-Index), feed it (socially grounded data), and reward it (behavioral RL). That's what "behavioral foundation model" means.

15h3063

Shuying Luo@shuying_luo

@nlpxuhui Agreed that’s a broad term I used for things that I’ve had hard time coercing current frontier llms to generate.

But do you think it could be captured by a (group of) cleanly defined objective(s)?

8h311

darin@dronathon

@nlpxuhui @sunweiwei12 @Jiarui_Liu_ @StigLidu @judysun233 @1000seagull yay thank you for public data

10h1281

Vinay Prabhu 🧬+🐍+🕷️+💉+🇮🇳@vinayprabhu

I think these models fall under the regime of 'mildly useful' regime sans sanctity and guarantees. What will be really interesting is if under strict constraints, they replicate human idiosyncrasies such as certain forms of irrational behavior or mean-regression tendencies under crowding.

10h381

Rob Tang@XiangruTang

@nlpxuhui great work

5h1211

OSINT with a splash of good takes and a few bad 1s@supersean415

@nlpxuhui I'd add another dimension- accountability. A model should be constantly reviewing its output and processes that generate non conforming responses. This is a essential trait of successful humans

15h1171

Notions@GoodIDeaDudes

@nlpxuhui @Scobleizer I want it to be helpful yes

4h26

Xuhui Zhou@nlpxuhui

I agree our models are defs far from being perfect 😁

One thing to highlight is that we did put our model in simulating users for tau-bench (and compare with real human customers).

And our model is showing very interesting behaviors like pushing back and getting annoyed, forgetting about things etc, that frontier models rarely show.

The signs are still pretty early but yeah, a starting point...

8h11

Michel aka Agent B@MichelIvan92347

@nlpxuhui Very interesting work Xuhui. Thanks for the pointer ! 🙏

Bravo to the team 👏👏

4h10

Kᴏʀᴏɴᴛᴏ@Koronto_7

@nlpxuhui

3h1