Some coding scaffolds block and retry risky actions. In a new paper, we find this reveals information a malicious AI can use to bypass monitoring. Resampling without blocked actions in context is less exploitable, but techniques that help in one setting can hurt in another. 🧵
James Lucassen and Adam Kaufman find that retrying blocked actions leaks context in AI agent scaffolds, letting malicious models evade audits
Switching to resampling increased safety from 61% to 71%.
No Digg Deeper questions have been answered for this story yet.
Most Activity

Paper: http://arxiv.org/abs/2605.26047 Blog post: http://blog.redwoodresearch.org/p/retrying-vs-resampling-in-ai-control Authors: @jzlucassen, @eccentric1ty
James and Adam study control techniques for AI agents like Claude Code's auto mode, where the policy is asked to generate replacement actions when an action is flagged as dangerous.
Main results: these techniques can cheaply improve safety, but some variants are exploitable.
Some coding scaffolds block and retry risky actions. In a new paper, we find this reveals information a malicious AI can use to bypass monitoring. Resampling without blocked actions in context is less exploitable, but techniques that help in one setting can hurt in another. 🧵

In fact, in our new setting using the minimum suspicion dramatically lowers safety as you resample more. While most of the dynamics line up with the findings from Ctrl-Z, naively applying the best techniques from Ctrl-Z to our new setting would massively decrease safety.

Coding scaffolds like Claude Code auto mode and Codex Auto-review tell the model why the monitor thought blocked actions were risky. If the model is honestly trying to do the task, this feedback will help it address the monitor’s concerns and avoid costly mistakes.

We study techniques like this from an AI control perspective, where we try to measure how robust safety techniques are to intentional subversion by a malicious AI. We have three main results.

Firstly, we find that providing monitor reasoning lets an untrusted model write sneakier attacks, erasing safety gains.

Secondly, we found that resampling techniques, which draw multiple samples from the same context, increase safety with zero usefulness cost and without huge increases in inference cost. Unlike retrying, resampling doesn’t reveal information an attacker can exploit.

Retrying without showing monitor reasoning still lets the untrusted model know that its previous action was suspicious, but doesn’t explain why. This isn’t exploitable in the same way, but we think attack selection strategies might still defeat it, which we leave to future work.

This gives the red team a great opportunity: mix in a small fraction of benign actions. This will tank the minimum suspicion, disguising attacks. The blue team can defend against this by executing the least suspicious action, preventing attacks from getting executed.

But in our new setting, the best way to make auditing decisions is to take the maximum suspicion over samples, not the minimum. Inconsistent attacking is no longer a good red team strategy, and executing the least suspicious sample is unnecessary!

Thirdly, we find resampling behaves very differently than in Ctrl-Z. Many dynamics from Ctrl-Z hold up, but one difference changes the equilibrium: executing the least suspicious sample was crucial for the blue team in Ctrl-Z, but we find it isn’t needed in our new setting.

This is because in Ctrl-Z, the best way to make auditing decisions was to take the minimum suspicion over samples. That way when honest false positives returned to normal behavior they’d get a low min score, while attacks would continue to be suspicious and get a high min score.

Naive resampling multiplies inference costs. We can save inference by resampling only suspicious steps, at the cost of some latency. Drawing 4 extra samples per step increases safety by 10.2pp, but +6.2pp can be recovered with only 0.4 extra samples per step.

Retrying and resampling will likely stick around in coding scaffolds. But to provide protection against a strategic adversary they need careful red teaming. Small differences in protocol or setting can dramatically affect optimal red/blue team choices and equilibrium safety.