New paper! We introduce a new automated interpretability technique where attention heads are explained with Python programs. Turns out you can drop-in replace ~40% of attention patterns in Llama-3B with outputs of these programs and barely affect task performance!
More broadly, I’m excited about the actionable implications of this technique: Understanding attention phenomena has historically led to architectural improvements in Transformers (see e.g. attention sinks), and I’m excited about the potential for this technique to uncover more such opportunities.
Make sure to check out @amirihayes_ thread below! ⬇️