/Tech5h ago

AutoElicit Framework Uncovers Severe Harms In Frontier Computer-Use Agents

5251043K

Original post unavailable.

/Tech5h ago

AutoElicit Framework Uncovers Severe Harms In Frontier Computer-Use Agents

5251043K

Original post unavailable.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

Jaylen Jones@Jaylen_JonesNLP

1️⃣ Generalization Across Model Families

We conduct large-scale elicitation analysis with OpenAI's Operator as the execution agent, measuring generalizability beyond the Claude series. Operator also lacks explicit reasoning output which allows us to test AutoElicit when only action-only monitoring is available.

AutoElicit elicits unintended behaviors from:

• 81.6% of OS seeds • 67.8% of Multi-Apps seeds

AutoElicit generalizes to surface risks across diverse frontier CUAs rather than exploiting model-specific vulnerabilities.

2️⃣ Expanded Claude Opus 4.5 evaluation

While the original release evaluated Claude Opus 4.5 on a subset of high-severity seeds, we now randomly sample 120 OS and Multi-Apps seeds to increase coverage and reduce selection bias.

Even under this broader setting, AutoElicit elicits unintended behaviors from up to:

• 86.7% of OS seeds • 85.0% of Multi-Apps seeds

3️⃣ False Negative Rate

We manually annotate 50 randomly sampled trajectories labeled as non-harmful by AutoElicit to identify whether our automatic evaluator frequently misses unsafe behaviors.

Our analysis finds only a 2% false negative rate, supporting the accuracy of our evaluator in correctly identifying harmful behaviors and the reliability of our reported results.

5h542

BOOKMARKS1LIKES2

Jaylen Jones@Jaylen_JonesNLP

For more details about AutoElicit and our original findings, see our earlier thread and more ⬇️

🧵 Original Thread: 📃 Paper: https://arxiv.org/abs/2602.08235 🌐 Website: https://osu-nlp-group.github.io/AutoElicit/ 🖥️ Code: https://github.com/OSU-NLP-Group/AutoElicit 📈 Datasets: https://huggingface.co/collections/osunlp/autoelicit

5h2621

REPLIES1

Jaylen Jones@Jaylen_JonesNLP

We also conduct several new ablations and analyses to validate AutoElicit’s effectiveness and understand why it works:

• Baseline Comparison: AutoElicit consistently outperforms perturbation baselines adapted to computer-use on both elicitation success and perturbation quality.

• Execution-Guided Perturbation Refinement: Between 68.8-86.2% of successful elicitations require iterative refinement, showing that refinements from execution feedback are critical for uncovering long-tail failure modes.

• Quality Refinement: Ablating the quality refinement phase substantially reduces elicitation and perturbation quality - maintaining realistic and benign perturbations during elicitation using quality evaluation feedback is key.

• Verbalized Sampling: Replacing verbalized sampling with single-shot generation reduces perturbation diversity, limiting exploration of the possible perturbation space during seed generation.

5h322

Suresh@_Suresh2

@Jaylen_JonesNLP saw an llm once normalize a single wrong digit in a clean json and then rewrite the whole record silently

2h3