Our ICML mechinterp workshop paper demonstrates how feature geometry can lead to model failures, and analyzing that geometry can help us to efficiently build adversarial test sets based on concept combinations.
We don’t always know what problems are hard for LLMs. So devs evaluate on tasks HUMANS find hard or on broad benchmarks. What if we could instead anticipate which scenarios a model will fail on—all without evaluating specific input examples?
🧵NEW PAPER by @jenniferlumeng &al