Some still don't know: Google's 2017 normalized quadratic Transformer [TR1] is based on the principles of the 1991 unnormalized linear Transformer [ULTRA]. In 1991, KEY/VALUE was called FROM/TO. ULTRA’s computational costs scale linearly in input size, rather than quadratically! The 1993 paper on a recurrent ULTRA extension [FWP2] introduced the attention terminology: learning "internal spotlights of attention" by gradient descent. See the T in ChatGPT! Details and references: https://people.idsia.ch/~juergen/1991-unnormalized-linear-transformer.html
AI pioneer Jürgen Schmidhuber argues Google's 2017 Transformer is based on his 1991 linear model
He also traced 'attention' terminology to his 1993 paper.
Some users dismissed Schmidhuber's claims tracing transformer origins to his 1991 Linear ULTRA model as unnecessary complaining.
No Digg Deeper questions have been answered for this story yet.





