6h ago

METR Evals reports frontier AI agents rely on natural language for the hardest tasks, trailing full performance by 1.5 to 2 years without out-loud reasoning.

Agents reached only four-minute horizons without reasoning.

0
Original post

Fact 2: However, agents appeared to be significantly weaker on tasks where it is costly or hard to verify success.

11:11 AM · May 19, 2026 View on X
Reposted by