5h ago

New Benchmark Detects Unknown AI Alignment Failures With OOD Methods

4409174.8K

——0——

Original post

We've seen AI models deceive, gaslight, and drive users to psychosis—safety issues that labs didn't anticipate until they caused real harm. We built the first benchmark of these unknown unknown alignment failures and found that OOD detection can help prevent them. 🧵

10:25 AM · May 27, 2026