(DeepSeek V3.2, GLM 4.7, and Kimi K2.5) prompted to resolve issues in real public code repositories. All trajectories include interleaved reasoning and tool calls.
2/4
Artificial Analysis just announced AgentPerf, the industry’s first agentic AI benchmark. The benchmark uses several real-world agentic use trajectories and employs OpenCode agentic harness using three top open-source models with reasoning enabled
1/4

