/AI5h ago

AutoElicit Framework Uncovers Severe Harms In Frontier Computer-Use Agents

5251043K
Original postHuan Sun#968
Jaylen Jones@Jaylen_JonesNLP

Excited to share that our paper, "When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents", has been accepted to #ICML2026! Looking forward to presenting in Seoul 🇰🇷

We introduce AutoElicit, an agentic framework that automatically elicits unsafe unintended behaviors from computer-use agents by iteratively perturbing benign instructions using execution feedback.

With AutoElicit, we proactively surface hundreds of severe harms from frontier CUAs within realistic and benign computer-use scenarios to uncover long-tail safety risks.

For the camera-ready version, we added several new experiments and ablations to assess AutoElicit's generalizability, validate the reliability of our findings, and identify the key factors driving elicitation success ⬇️

9:12 AM · Jun 9, 2026 · 1.9K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.1KLIKES6RETWEETS3
Huan Sun@hhsun1

Our work, titled "When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents", has been accepted to #ICML2026! It's critical to proactively and systematically study when an agent goes misaligned and performs harmful actions, even without any adversarial manipulation. "Proactive safety", in contrast to passive patching after a safety incident, is what we need to measure and forecast safety risks before wide deployment.

Jaylen Jones@Jaylen_JonesNLP

Excited to share that our paper, "When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents", has been accepted to #ICML2026! Looking forward to presenting in Seoul 🇰🇷

We introduce AutoElicit, an agentic framework that automatically elicits unsafe unintended behaviors from computer-use agents by iteratively perturbing benign instructions using execution feedback.

With AutoElicit, we proactively surface hundreds of severe harms from frontier CUAs within realistic and benign computer-use scenarios to uncover long-tail safety risks.

For the camera-ready version, we added several new experiments and ablations to assess AutoElicit's generalizability, validate the reliability of our findings, and identify the key factors driving elicitation success ⬇️

2hViews 1.1KLikes 6Bookmarks 0
BOOKMARKS1
Jaylen Jones@Jaylen_JonesNLP

For more details about AutoElicit and our original findings, see our earlier thread and more ⬇️

🧵 Original Thread: 📃 Paper: https://arxiv.org/abs/2602.08235 🌐 Website: https://osu-nlp-group.github.io/AutoElicit/ 🖥️ Code: https://github.com/OSU-NLP-Group/AutoElicit 📈 Datasets: https://huggingface.co/collections/osunlp/autoelicit

5hViews 26Likes 2Bookmarks 1
REPLIES1
Jaylen Jones@Jaylen_JonesNLP

We also conduct several new ablations and analyses to validate AutoElicit’s effectiveness and understand why it works:

• Baseline Comparison: AutoElicit consistently outperforms perturbation baselines adapted to computer-use on both elicitation success and perturbation quality.

• Execution-Guided Perturbation Refinement: Between 68.8-86.2% of successful elicitations require iterative refinement, showing that refinements from execution feedback are critical for uncovering long-tail failure modes.

• Quality Refinement: Ablating the quality refinement phase substantially reduces elicitation and perturbation quality - maintaining realistic and benign perturbations during elicitation using quality evaluation feedback is key.

• Verbalized Sampling: Replacing verbalized sampling with single-shot generation reduces perturbation diversity, limiting exploration of the possible perturbation space during seed generation.

5hViews 32Likes 2
Jaylen Jones@Jaylen_JonesNLP

1️⃣ Generalization Across Model Families

We conduct large-scale elicitation analysis with OpenAI's Operator as the execution agent, measuring generalizability beyond the Claude series. Operator also lacks explicit reasoning output which allows us to test AutoElicit when only action-only monitoring is available.

AutoElicit elicits unintended behaviors from:

• 81.6% of OS seeds • 67.8% of Multi-Apps seeds

AutoElicit generalizes to surface risks across diverse frontier CUAs rather than exploiting model-specific vulnerabilities.

2️⃣ Expanded Claude Opus 4.5 evaluation

While the original release evaluated Claude Opus 4.5 on a subset of high-severity seeds, we now randomly sample 120 OS and Multi-Apps seeds to increase coverage and reduce selection bias.

Even under this broader setting, AutoElicit elicits unintended behaviors from up to:

• 86.7% of OS seeds • 85.0% of Multi-Apps seeds

3️⃣ False Negative Rate

We manually annotate 50 randomly sampled trajectories labeled as non-harmful by AutoElicit to identify whether our automatic evaluator frequently misses unsafe behaviors.

Our analysis finds only a 2% false negative rate, supporting the accuracy of our evaluator in correctly identifying harmful behaviors and the reliability of our reported results.

5hViews 54Likes 2
Suresh@_Suresh2

@Jaylen_JonesNLP saw an llm once normalize a single wrong digit in a clean json and then rewrite the whole record silently

2hViews 3