Two strong open-weight models were released last month: GLM 5.1 from @Zai_org and Kimi K2.6 from @Kimi_Moonshot. We wanted to see how they hold up at proactive assistance, so we tested the models on 🍐 PARE-Bench.
PARE-Bench evaluates the models as proactive assistants in mobile-style environments: an observer agent monitors user actions and environment notifications, infers user intent, and proposes a task for user confirmation. Once the proposal is accepted, an executor agent completes the task.
Let's dive into the results below 👇🧵
1/7