
1️⃣ Generalization Across Model Families
We conduct large-scale elicitation analysis with OpenAI's Operator as the execution agent, measuring generalizability beyond the Claude series. Operator also lacks explicit reasoning output which allows us to test AutoElicit when only action-only monitoring is available.
AutoElicit elicits unintended behaviors from:
• 81.6% of OS seeds • 67.8% of Multi-Apps seeds
AutoElicit generalizes to surface risks across diverse frontier CUAs rather than exploiting model-specific vulnerabilities.
2️⃣ Expanded Claude Opus 4.5 evaluation
While the original release evaluated Claude Opus 4.5 on a subset of high-severity seeds, we now randomly sample 120 OS and Multi-Apps seeds to increase coverage and reduce selection bias.
Even under this broader setting, AutoElicit elicits unintended behaviors from up to:
• 86.7% of OS seeds • 85.0% of Multi-Apps seeds
3️⃣ False Negative Rate
We manually annotate 50 randomly sampled trajectories labeled as non-harmful by AutoElicit to identify whether our automatic evaluator frequently misses unsafe behaviors.
Our analysis finds only a 2% false negative rate, supporting the accuracy of our evaluator in correctly identifying harmful behaviors and the reliability of our reported results.
