my favorite interp researcher can identify neurons responsible for any behavior and provide steering vectors for them
her name is backprop and her steering vectors are just gradients
The joke critiques the complexity of modern mechanistic interpretability.
my favorite interp researcher can identify neurons responsible for any behavior and provide steering vectors for them
her name is backprop and her steering vectors are just gradients
Positive users express excitement over validation of their backpropagation interpretability takes, while negative users criticize the approach for hiding neurons' functional roles.
No Digg Deeper questions have been answered for this story yet.
true
my favorite interp researcher can identify neurons responsible for any behavior and provide steering vectors for them
her name is backprop and her steering vectors are just gradients
@aryaman2020 related - I’m always curious why interpretability people design a cool new parameter efficient finetuning family like steering vectors and then choose not to optimize by gradient descent
true
@khoomeik my favorite interp researcher is @aryaman2020, he can identify neurons responsible for any behavior by just eyeballing the matrices
my favorite interp researcher can identify neurons responsible for any behavior and provide steering vectors for them
her name is backprop and her steering vectors are just gradients
@khoomeik im a gradients guy https://arxiv.org/abs/2604.07615
@aryaman2020 wow my interp take has been aryaman approved lfg
@johnhewtt @aryaman2020 John please retract your take immediately and respect the data efficiency of TCAVs
@aryaman2020 related - I’m always curious why interpretability people design a cool new parameter efficient finetuning family like steering vectors and then choose not to optimize by gradient descent
@aryaman2020 wow my interp take has been aryaman approved lfg
true
@ChengleiSi @aryaman2020 truly the goat
@khoomeik my favorite interp researcher is @aryaman2020, he can identify neurons responsible for any behavior by just eyeballing the matrices

@khoomeik not a big fan of her - she hides the functional role of neurons from me - which is like kinda the point of interp

@jatin_n0 she would explain it to you, but you don’t speak her language
@johnhewtt @aryaman2020 bc juergen invented it already
@aryaman2020 related - I’m always curious why interpretability people design a cool new parameter efficient finetuning family like steering vectors and then choose not to optimize by gradient descent
@johnhewtt @aryaman2020 And then they will also reclaim it is a new thing...
i cant believe I just realized this now, but the reason BitFit (bias only fine-tuning) works, is actually the same reason steering vectors work. or rather, bitfit offers a richer class of adaptations than steering vectors.

@khoomeik The most powerful interpretability tool is still just gradients

@khoomeik i try to! i am but a humble translator