Everybody is talking about test time training (TTT). Of course, the 1991 Unnormalized Linear Transformer (ULTRA) did TTT.
Some still don't know: Google's 2017 normalized quadratic Transformer [TR1] is based on the principles of the 1991 unnormalized linear Transformer [ULTRA]. In 1991, KEY/VALUE was called FROM/TO. ULTRA’s computational costs scale linearly in input size, rather than quadratically! The 1993 paper on a recurrent ULTRA extension [FWP2] introduced the attention terminology: learning "internal spotlights of attention" by gradient descent. See the T in ChatGPT! Details and references: https://people.idsia.ch/~juergen/1991-unnormalized-linear-transformer.html