5h ago

Simple OOD Detection Boosts AI Safety Classifiers Beyond Larger Guard Models

0
Original post

We find safety classifiers fail to generalize to unforeseen safety failures, but combining them with simple OOD detection methods like perplexity or Mahalanobis distance helps, outperforming guard models that are 20x the size. OOD detection boosts recall across model scales.

10:25 AM · May 27, 2026 View on X
Simple OOD Detection Boosts AI Safety Classifiers Beyond Larger Guard Models · Digg