@xeophon and yes, there we did use native agent harnesses! and still all agents basically suck. it's gonna be a very interesting benchmark. i know i'm teasing too much...
@maksym_andr you are treating me too well...
Maksym Andriushchenko, principal investigator leading the AI Safety and Alignment Group at ELLIS Institute Tübingen and MPI-IS, confirmed the harness configuration for an upcoming benchmark from the PTB and FutureSim teams. Tests conducted with native agent harnesses showed that all agents performed poorly. The results will form part of a detailed comparison of current agent capabilities.
@xeophon and yes, there we did use native agent harnesses! and still all agents basically suck. it's gonna be a very interesting benchmark. i know i'm teasing too much...
@maksym_andr you are treating me too well...
Users dismiss the benchmark on AI agents struggling even with native harnesses due to low expectations of the PTB and FutureSim teams involved.
No Digg Deeper questions have been answered for this story yet.
@maksym_andr i do not expect anything else from the PTB + FutureSim ppl tbh
@xeophon and yes, there we did use native agent harnesses! and still all agents basically suck. it's gonna be a very interesting benchmark. i know i'm teasing too much...