
Surprisingly, one attention head (L0H7) does most of the work for structure token reasoning
Ablating it alone changes 40% of secondary structure predictions. 10 random layer-0 heads: <17%.
Also happy to hear that @KestenGal observed similarly disproportionate contribution of L0H7 in their article about protein repeats https://arxiv.org/abs/2602.23179

