There’s one small hiccup before we spin feature similarity "straw" into error-forecasting "gold": internal representations have a multiscale structure dominated by properties like prompt format. These background clusters aren’t relevant, so we have to control them first.
This interference is quick to calculate, so we can sift through all possible concept combinations to find adversarial scenarios to stump the model. Only then do we need to actually generate, translate, or find a specific challenging input instantiating that scenario!
