METR evaluation shows AI agents autonomously completing real engineering projects inside companies that would take human experts multiple weeks on verifiable tasks like vulnerability discovery
MirrorCode-Early beat prior benchmarks for 2026 models.
——0——