2022 tweet on the 1991 Unnormalized Linear Transformer ULTRA
30 years ago: Transformers with linearized self-attention in NECO 1992, equivalent to fast weight programmers (apart from normalization), separating storage and control. Key/value was called FROM/TO. The attention terminology was introduced at ICANN 1993 https://sferics.idsia.ch/pub/juergen/fastweights.pdf