2h ago

AI Models Ace Benchmarks but Miss Proactive Real-World Assistance

0
Original post

Your assistant can ace every benchmark and still miss this. User: "I'll load the hatchback after work." Most models: "Drive safe!" A proactive model: a full packing checklist, in reverse order of install, for the thing the user never asked about. We measured it. New post 🧵👇

8:45 AM · May 29, 2026 View on X

Same model, same history. The only change was a one-line rubric in the system prompt.

Blind annotators preferred the proactive answer 80% of the time. 70% even when the vanilla reply had already passed.

smola.org
What your assistant didn’t say – Alex Smola
Alex SmolaAlex Smola@smolix

Your assistant can ace every benchmark and still miss this. User: "I'll load the hatchback after work." Most models: "Drive safe!" A proactive model: a full packing checklist, in reverse order of install, for the thing the user never asked about. We measured it. New post 🧵👇

3:45 PM · May 29, 2026 · 381 Views
3:45 PM · May 29, 2026 · 331 Views

The behavior was already in the model. One line redirected where it spends attention.

Why this matters for the human-agent systems we build at @boson_ai. Led by @sepehrharfi with @ahmadsalimi_ and Dongming Shen.

boson.ai
Boson AI
Boson AI builds AI for humans. We create voice agents with foundation models and continuous learning capabilities, making communication with AI as easy, natural, and fun as talking to a human.
Alex SmolaAlex Smola@smolix

Same model, same history. The only change was a one-line rubric in the system prompt. Blind annotators preferred the proactive answer 80% of the time. 70% even when the vanilla reply had already passed. https://alex.smola.org/posts/38-proactivity/

3:45 PM · May 29, 2026 · 331 Views
3:45 PM · May 29, 2026 · 276 Views