GoodfireAI shares research showing sparse autoencoders tile and shatter curved neural manifolds in large models rather than linear directions, recasting unsupervised discovery as an inverse Ising problem
Visualizations map features across a manifold spanning 1800 to 1998.
The next step here is to go for direct unsupervised recovery of feature geometry from activations, rather than this two-step SAE → clustering stuff.
we have an automatic shape-finder, so you can find the shapes your model thinks in and receive twitter clout.
code is open-source, or you can also get silico to find your manifolds for you
The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)
Super excited to have this paper finally out! So many nuggets here, but a critical highlight: you should *not* interpret SAE features in isolation. The population geometry is where it's all at! Similar to this image of us @GoodfireAI folks playing out the elephant parable. :P

The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)
For the physics bros: if you think of SAE features as mere on-off switches, that oughta remind you of Ising models. You can use this to unsupervisedly discover manifolds from SAE activities! Check out code link below!