/Tech44d ago

Upcoming benchmark from PTB and FutureSim teams finds all AI agents perform poorly

Maksym Andriushchenko, principal investigator leading the AI Safety and Alignment Group at ELLIS Institute Tübingen and MPI-IS, confirmed the harness configuration for an upcoming benchmark from the PTB and FutureSim teams. Tests conducted with native agent harnesses showed that all agents performed poorly. The results will form part of a detailed comparison of current agent capabilities.

1600180

#1207

Original post

Maksym Andriushchenko@maksym_andr#1207inTech

@xeophon and yes, there we did use native agent harnesses! and still all agents basically suck. it's gonna be a very interesting benchmark. i know i'm teasing too much...

Florian Brand@xeophon

@maksym_andr you are treating me too well...

11:11 AM · May 16, 2026 · 95 Views

Sentiment

Users dismiss the benchmark on AI agents struggling even with native harnesses due to low expectations of the PTB and FutureSim teams involved.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS85LIKES5

Florian Brand@xeophon

@maksym_andr i do not expect anything else from the PTB + FutureSim ppl tbh

Maksym Andriushchenko@maksym_andr

@xeophon and yes, there we did use native agent harnesses! and still all agents basically suck. it's gonna be a very interesting benchmark. i know i'm teasing too much...

44d8550