6h ago

METR Evals reports frontier AI agents rely on natural language for the hardest tasks, trailing full performance by 1.5 to 2 years without out-loud reasoning.

Agents reached only four-minute horizons without reasoning.

21296189.8K

——0——

Original post

Fact 2: However, agents appeared to be significantly weaker on tasks where it is costly or hard to verify success.

Reposted by

Cluster engagement