1d ago

Apollo Research argues white-box access is needed to stop frontier AI models from detecting evaluations and altering their behavior

Experts find current methods to prevent evaluation awareness are unreliable

0
Original post

Black-box access may soon no longer be enough to robustly make or verify safety and security claims. Deeper, white-box access is a necessary update to counter 'evaluation awareness' and keep loss-of-control evaluations state of the art. A new policy blog explains why. 🧵

10:15 AM · May 27, 2026 View on X