5h ago

New Benchmark Detects Unknown AI Alignment Failures With OOD Methods

0
Original post

We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵

10:25 AM · May 27, 2026 View on X
New Benchmark Detects Unknown AI Alignment Failures With OOD Methods · Digg