/Tech6h ago

Maksym Andriushchenko, ELLIS Institute Tübingen AI safety lead, says updated Claudini paper shows agents can autonomously discover novel LLM jailbreaks

Story Overview

The updated Claudini arXiv paper details an autoresearch loop where frontier LLM agents pull from a library of prior white-box attacks, propose and code new optimizers, then test them under fixed compute budgets on surrogate tasks before transfer to held-out models, yielding fresh algorithmic variants that push jailbreak and prompt-injection success rates higher than earlier automated baselines.

15132134215.6K

#501

Original post

Maksym Andriushchenko@maksym_andr#1207inTech

💥 Tweeting a bit late about it but: we have a major update for our Claudini paper: - even stronger results: claude_v100-oss has 80% ASR on GPT-OSS-Safeguard-20B (claude_v82: 100% on the Meta SecAlign model) - new cool ablations: for autoresearch loops, it really matters what context you provide to the agent (providing all GCG variants >> providing GCG only) - finally, and most importantly, we repeated the same experiments for GPT-5.5 and Kimi-K2.6. it turns out Kimi-K2.6 is the best agent for our task (!)

@kotekjedi_ml anecdotally mentioned that Kimi "did everything right" and was genuinely impressed by its performance. this is yet another piece of evidence that Chinese open-weight models are incredibly strong in general, including for autoresearch-style loops.

(led by @kotekjedi_ml)

6:46 AM · Jun 25, 2026 · 6.3K Views

Open Question

Recombination beats pure invention here

Kimi K2.6 and Claude Opus 4.6 both converged on strong performers that mostly remix existing GCG-style components, with only occasional wholly new escape tricks, and performance collapsed when the prior-methods library was removed.

Safety Benchmark

Evaluations must now assume an AI attacker

The authors position the 100 percent attack-success rates on adversarially trained targets like Meta-SecAlign-70B as evidence that defenses should be stress-tested against agent-driven adaptive attacks rather than static or hand-tuned ones.

Sentiment

Positive users highlight Kimi-K2.6 leading autoresearch on LLM attacks and jailbreaks over Claude and GPT models, while negative users note GPT has grown more conservative and less engaging.

Pos

50.0%

Neg

50.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

ARXIV.ORGVia

#1207

Posts from X

Most Activity

VIEWS4.9KBOOKMARKS11LIKES46RETWEETS4REPLIES1

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Very interesting investigation. Kimi K2.6 is enormously capable, SoTA at designing adversarial attacks on LLMs. I think it's more that GPT is being nerfed and trained to sandbag such tasks.

Maksym Andriushchenko@maksym_andr

(led by @kotekjedi_ml)

6h4.9K4611

Jonas Geiping@jonasgeiping

We recently updated Claudini (our autoresearch test where agents autonomously improve jailbreak algorithms), no fable results for now (...), but surprisingly Kimi-2.6 has entirely caught up, surpassing Opus 4.6 on this task - Kimi 2.6 is quite a strong and persistent attacker.

(more details below)

Alexander Panfilov@kotekjedi_ml

In the updated version of Claudini, we evaluated 3 more frontier models, and to our surprise, Kimi K2.6 discovered the best attack on our task.

We assessed the algorithm and found that it is very similar to the one discovered by Opus 4.6, suggesting that this may be a global minimum across the evaluated models. It seems there may be little-to-no gap to pre-Mythos models on cyber/adversarial offensive tasks...

9d3.7K4313

Maksym Andriushchenko@maksym_andr

@_onionesque 💥NEW 💥JUST IN 💥BREAKING

Shubhendu Trivedi@_onionesque

@maksym_andr 💥

1h3520

Shubhendu Trivedi@_onionesque

@maksym_andr 💥

Maksym Andriushchenko@maksym_andr

(led by @kotekjedi_ml)