Schmidhuber Traces Transformer Attention to 1991 Linear Model

VIEWS2.9KBOOKMARKS5LIKES8RETWEETS2REPLIES1

2026 update

Some still don't know: Google's 2017 normalized quadratic Transformer [TR1] is based on the principles of the 1991 unnormalized linear Transformer [ULTRA]. In 1991, KEY/VALUE was called FROM/TO. ULTRA’s computational costs scale linearly in input size, rather than quadratically! The 1993 paper on a recurrent ULTRA extension [FWP2] introduced the attention terminology: learning "internal spotlights of attention" by gradient descent. See the T in ChatGPT! Details and references: https://people.idsia.ch/~juergen/1991-unnormalized-linear-transformer.html

2h2.9K85