Exemplar partitioning rivals sparse autoencoders on AxBench benchmark
Exemplar partitioning applies Voronoi partitions directly to model activations to surface human-understandable structure. The technique performs comparably to or better than sparse autoencoders while using orders of magnitude less compute. It was evaluated on the AxBench benchmark and introduced through an introductory post published on LessWrong.
a very neat new method with great evals on AxBench!!
Voronoi partitions on activations reveal interpretable structure with orders of magnitude less compute than SAEs! Here is an introduction to a new interpretability method: https://www.lesswrong.com/posts/RroeHBSkBXXDsrryq/an-introduction-to-exemplar-partitioning-for-mechanistic-1
my gut feeling about feature geometry: there is great progress lately on untying this Gordian knot. but I really really hope methods with very limited presuppositions about geometry can cut through it directly
a very neat new method with great evals on AxBench!!