AI Models Ace Benchmarks but Miss Proactive Real-World Assistance
Same model, same history. The only change was a one-line rubric in the system prompt.
Blind annotators preferred the proactive answer 80% of the time. 70% even when the vanilla reply had already passed.
Your assistant can ace every benchmark and still miss this. User: "I'll load the hatchback after work." Most models: "Drive safe!" A proactive model: a full packing checklist, in reverse order of install, for the thing the user never asked about. We measured it. New post 🧵👇
The behavior was already in the model. One line redirected where it spends attention.
Why this matters for the human-agent systems we build at @boson_ai. Led by @sepehrharfi with @ahmadsalimi_ and Dongming Shen.
Same model, same history. The only change was a one-line rubric in the system prompt. Blind annotators preferred the proactive answer 80% of the time. 70% even when the vanilla reply had already passed. https://alex.smola.org/posts/38-proactivity/