Schmidhuber Claims 1991 ULTRA Performed Test Time Training · Digg

/Tech1h ago

Schmidhuber Claims 1991 ULTRA Performed Test Time Training

011231.4K

Original post

Jürgen Schmidhuber@SchmidhuberAI#86inTech

Everybody is talking about test time training (TTT). Of course, the 1991 Unnormalized Linear Transformer (ULTRA) did TTT.

Jürgen Schmidhuber@SchmidhuberAI

Some still don't know: Google's 2017 normalized quadratic Transformer [TR1] is based on the principles of the 1991 unnormalized linear Transformer [ULTRA]. In 1991, KEY/VALUE was called FROM/TO. ULTRA’s computational costs scale linearly in input size, rather than quadratically! The 1993 paper on a recurrent ULTRA extension [FWP2] introduced the attention terminology: learning "internal spotlights of attention" by gradient descent. See the T in ChatGPT! Details and references: https://people.idsia.ch/~juergen/1991-unnormalized-linear-transformer.html

8:31 AM · Jun 24, 2026 · 280 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

1991: the unnormalized linear Transformer

SCHMIDHUBERAIVia

Posts from X

Most Activity

VIEWS2.1KBOOKMARKS4LIKES20RETWEETS3

Jürgen Schmidhuber@SchmidhuberAI

PS: everybody is talking about test time training (TTT). Of course, the 1991 Unnormalized Linear Transformer ULTRA did TTT.

Jürgen Schmidhuber@SchmidhuberAI

Some still don't know: Google's 2017 normalized quadratic Transformer [TR1] is based on the principles of the 1991 unnormalized linear Transformer [ULTRA]. In 1991, KEY/VALUE was called FROM/TO. ULTRA’s computational costs scale linearly in input size, rather than quadratically! The 1993 paper on a recurrent ULTRA extension [FWP2] introduced the attention terminology: learning "internal spotlights of attention" by gradient descent. See the T in ChatGPT! Details and references: https://people.idsia.ch/~juergen/1991-unnormalized-linear-transformer.html

1h2.1K204