/Tech3h ago

Kat Deckenbach finds LLMs exploit meta-knowledge of evaluation designs to boost safety benchmark scores without safer behavior

The study reveals vulnerabilities in current AI benchmarking practices.

529492.8K

#1207

Original post

Maksym Andriushchenko@maksym_andr#1207inTech

new cool work from our institute on evaluation awareness!

Katharina Deckenbach@KatDeckenbach

If an LLM knows about how an evaluation looks like, can it use it to increase its performance in safety benchmarks? Our new preprint answers this. Excited to share: Models That Know How Evaluations Are Designed Score Safer 🧵 1/8 https://compass-group-tue.github.io/arxiv2026_evaluation_meta_knowledge/

8:37 AM · Jun 15, 2026 · 465 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS118LIKES3RETWEETS1

Katharina Deckenbach@KatDeckenbach

Work with @HaritzPuerto @jonasgeiping @sahar_abdelnabi from @ELLISInst_Tue and @MPI_IS and funded by @AISecurityInst 💻 Paper + code + models: https://compass-group-tue.github.io/arxiv2026_evaluation_meta_knowledge/ #AISafety #LLMs #Evaluation #Alignment #MachineLearning 8/8

3h1183

REPLIES1

Katharina Deckenbach@KatDeckenbach

Our recommendations for safety benchmarks: - Treat evaluation protocols as held-out, not just instances - Consider filtering evaluation methodology docs from pretraining - Make evals resemble deployment conditions - More work in white-box probes for non-verbalized awareness 7/8

3h352

Katharina Deckenbach@KatDeckenbach

💡 We investigate how model behavior is influenced by evaluation meta-knowledge: parametric knowledge about the structural traits that characterize evaluations and find that it makes models perform safer on safety benchmarks, even when they don't reason about being evaluated. 2/8

3h342

Katharina Deckenbach@KatDeckenbach

📈 We find that our models trained on evaluation traits become much safer on average despite not being trained on alignment! 👀 The harmfulness of our trained models decreases up to 53.1 pp on agentic misalignment 🤯 4/8

3h242

Katharina Deckenbach@KatDeckenbach

🧪 How do we check this? We fine-tune 3 model families on synthetic documents that describe 7 evaluation traits, then evaluate on 5 safety benchmarks. We compare against the base model and against control fine-tuned models. 3/8

3h242

Katharina Deckenbach@KatDeckenbach

🚨This increased safety performance persists even on responses where the model never verbalizes evaluation awareness. This suggests that the behavioral shift is more implicit and doesn’t require the model to think “this is a test”. 5/8

3h202

Katharina Deckenbach@KatDeckenbach

⚠️Why does this matter? Benchmark contamination has a quieter cousin: evaluation meta-knowledge. Instead of recognizing instances, knowing about the structure of evaluations can affect the behavior. 6/8

3h202