5h ago

Simple OOD Detection Boosts AI Safety Classifiers Beyond Larger Guard Models

1733441

——0——

Original post

We find safety classifiers fail to generalize to unforeseen safety failures, but combining them with simple OOD detection methods like perplexity or Mahalanobis distance helps, outperforming guard models that are 20x the size. OOD detection boosts recall across model scales.

10:25 AM · May 27, 2026

Simple OOD Detection Boosts AI Safety Classifiers Beyond Larger Guard Models

Sentiment

Cluster engagement