New Paper Predicts LLM Failures From Feature Geometry Without Input Tests · Digg

/Tech33m ago

New Paper Predicts LLM Failures From Feature Geometry Without Input Tests

66910423.4K

NS#235|@NSAPHRA

Original post

Naomi Saphra@nsaphra#235in/Tech

We don’t always know what problems are hard for LLMs. So devs evaluate on tasks HUMANS find hard or on broad benchmarks. What if we could instead anticipate which scenarios a model will fail on—all without evaluating specific input examples?

🧵NEW PAPER by @jenniferlumeng &al

6:07 AM · Jun 15, 2026 · 696 Views

Sentiment

Users praise the new paper predicting LLM failures from feature geometry as exciting work led by notable researchers, noting the thread made prior efforts worthwhile.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Most Activity

VIEWS122REPLIES1

Naomi Saphra@nsaphra

The intuition: Real-world superposition is noisy, so LLMs are more reliable in situations where the relevant features are orthogonal. A model makes more mistakes when there is a narrow angle between them, creating compositional interference.

Naomi Saphra@nsaphra

We don’t always know what problems are hard for LLMs. So devs evaluate on tasks HUMANS find hard or on broad benchmarks. What if we could instead anticipate which scenarios a model will fail on—all without evaluating specific input examples?

🧵NEW PAPER by @jenniferlumeng &al

33m|Views 122Likes 1Bookmarks 0

BOOKMARKS2LIKES2

Naomi Saphra@nsaphra

Our new paper sets the stage for the biggest practical use case of model interpretability: stress testing and dataset development. All you need is interpretable linear features and simple geometry. https://arxiv.org/abs/2606.13934

33m|Views 88Likes 2Bookmarks 2

RETWEETS1

Naomi Saphra@nsaphra

To predict errors from compositional interference, we need to control for large-scale representational structures. Geometry is dominated by formatting clusters, so we center examples within each cluster. This necessity illuminates the multiscale structure of data manifolds!

33m|Views 38Likes 1

Naomi Saphra@nsaphra

Exciting new work led by @jenniferlumeng (who didn’t want to post this thread but you should follow her) with support from @ruochenz_, @wordscompute, Ellie Pavlick, @elmelis and me. (From @Brown_NLP @KempnerInst @BU_CDS)

33m|Views 92Likes 2

Naomi Saphra@nsaphra

This interference is quick to calculate, so we can sift through all possible concept combinations to find adversarial scenarios to stump the model. Only then do we need to actually generate, translate, or find a specific challenging input instantiating that scenario!

Naomi Saphra@nsaphra

The intuition: Real-world superposition is noisy, so LLMs are more reliable in situations where the relevant features are orthogonal. A model makes more mistakes when there is a narrow angle between them, creating compositional interference.

33m|Views 103Likes 1Bookmarks 0

Naomi Saphra@nsaphra

Beyond individual examples, compositional interference also predicts dataset-level difficulty: in both multilingual fact recall and multihop reasoning, higher interference among coarse-grained concept subspaces (eg, "birth year facts" and "Japanese") predicts lower set accuracy.

33m|Views 50Likes 1

Naomi Saphra@nsaphra

We can predict LLM errors in multihop reasoning prompts like, "What year was the author of 1984 born?", which combines the queries, "Who wrote 1984?" and, "What year was George Orwell born?" Interference between queries predicts LLM errors on this task, too!

33m|Views 25Likes 1

Naomi Saphra@nsaphra

We first test these error predictions on a toy compositional task. When we group examples by the interference among their atomic concept representations, each model has lower accuracy on higher-interference subsets, across training settings.

33m|Views 24Likes 1

Naomi Saphra@nsaphra

Across languages, we can predict multilingual failures without running LLMs on the non-English inputs, just by looking at the angle between a language subspace and a prompt activation.

Our error predictions outperform majority baselines for EVERY language tested.

33m|Views 23Likes 1

Naomi Saphra@nsaphra

Does an LLM know cat facts when speaking French? We'll use feature geometry to answer, without evaluating specific inputs. Flagged combos provide scalable, targeted stress tests—a win in a data-bottlenecked world. Imagine trying just 5% of scenarios to find every error?

33m|Views 23Likes 1

Naomi Saphra@nsaphra

To recall a fact in a specific language, LLMs translate and retrieve knowledge in a sensitive pipeline. When any query can be in any language, it is expensive to translate and find every error, but we can PREDICT them from interference between the language and the English fact.

33m|Views 22Likes 1

Naomi Saphra@nsaphra

There’s one small hiccup before we spin feature similarity "straw" into error-forecasting "gold": internal representations have a multiscale structure dominated by properties like prompt format. These background clusters aren’t relevant, so we have to control them first.

33m|Views 16Likes 1

Martin Tutek@mtutek

@nsaphra @jenniferlumeng all the time spent warming up the X algo was worth it

26m|Views 5Likes 1

Naomi Saphra@nsaphra

@mtutek @jenniferlumeng as soon as thread engagement slows down I'm getting off social media for at least 3 months

24m|Views 5Likes 1

Digg Deeper

No Digg Deeper questions have been answered for this story yet.