AI pioneer Jürgen Schmidhuber argues Google's 2017 Transformer is based on his 1991 linear model

Original post

Jürgen Schmidhuber@SchmidhuberAI#86inTech

Some still don't know: Google's 2017 normalized quadratic Transformer [TR1] is based on the principles of the 1991 unnormalized linear Transformer [ULTRA]. In 1991, KEY/VALUE was called FROM/TO. ULTRA’s computational costs scale linearly in input size, rather than quadratically! The 1993 paper on a recurrent ULTRA extension [FWP2] introduced the attention terminology: learning "internal spotlights of attention" by gradient descent. See the T in ChatGPT! Details and references: https://people.idsia.ch/~juergen/1991-unnormalized-linear-transformer.html

7:28 AM · Jun 21, 2026 · 35.4K Views

VIEWS5.6KBOOKMARKS11LIKES20

Jürgen Schmidhuber@SchmidhuberAI

2022 tweet on this:

Jürgen Schmidhuber@SchmidhuberAI

30 years ago: Transformers with linearized self-attention in NECO 1992, equivalent to fast weight programmers (apart from normalization), separating storage and control. Key/value was called FROM/TO. The attention terminology was introduced at ICANN 1993 https://sferics.idsia.ch/pub/juergen/fastweights.pdf

2h5.6K2011

Ethan Baron@ethan53896137

@SchmidhuberAI Bro honestly you've got to get yourself together More lifting less bitching if you may

3h1433

阿空(🐂, 🐂) 互关学习🫡@ResearchKONG

@SchmidhuberAI AI史经常被讲成少数论文突然改变世界，其实很多关键概念早就有脉络。把线性扩展、注意力命名和后来工程化放在一起看，会更接近真实演进。

3h862

Secta@0xSecta

@SchmidhuberAI lineage of transformer thinking is older than we often credit

early linear scaling in attention shows primitives still define infra

3h761

tsunami_crypto@ls_brd

@SchmidhuberAI ultra mentioned quick and let linear attention fade for 25 years before transformers made it cool again

3h32

Dev Anon@genaiupstart

@SchmidhuberAI It's kind of weird how compute decides who gets remembered. In 1991 linear efficiency was necessary, and now it's frontier research again.

3h2

Nathan Quantum@AI_WarriorNQ

@SchmidhuberAI schmidhuber claiming credit for everything is the most predictable thing in ML. but the fast weight programmer framing is genuinely underappreciated. mamba2 is basically ULTRA with scalar decay