Yoav Goldberg and Susan Zhang argue that low-level transformer mechanics do not explain practical LLM behavior · Digg

/Tech8h ago

Yoav Goldberg and Susan Zhang argue that low-level transformer mechanics do not explain practical LLM behavior

Architectural knowledge fails to predict empirical model outputs.

19399619843.8K

Original post

alz@alz_zyd_

the rather funny thing about LLMs is that knowing how transformers work teaches you literally nothing about why LLMs do what they do, or how to use them better

Antonio Lupetti@antoniolupetti

"Transformers" by Daniel Jurafsky and James H. Martin is one of the clearest and most mathematically grounded introductions to the Transformer architecture I have ever read.

Chapter 8 introduces the Transformer as the standard architecture behind modern large language models. What makes this chapter particularly interesting is its step-by-step presentation of the underlying mechanisms: contextual embeddings, self-attention, query, key and value vectors, scaled dot-product attention, multi-head attention, residual streams, feedforward layers, layer normalization, masking, and the parallel matrix formulation of attention.

In particular, the treatment of attention as a weighted sum of contextual representations is especially valuable. The chapter first develops an intuitive, simplified view of attention and then gradually derives the full formulation using the Q, K, and V matrices. This approach makes it easier to understand what is actually happening inside the architecture from an algebraic and matrix-based perspective, rather than simply viewing the usual block diagrams.

I think it is an excellent resource for anyone interested in understanding how Transformers work from linguistic, mathematical, and computational perspectives.

https://web.stanford.edu/~jurafsky/slp3/8.pdf

7:55 PM · Jun 18, 2026 · 43.3K Views

Sentiment

Positive users argue transformer knowledge aids prompting and corrects misconceptions about LLMs while some negative users contend it provides little practical value for usage.

Pos

66.7%

Neg

33.3%

6 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

STANFORD.EDUVia

Posts from X

Most Activity

VIEWS2.2KREPLIES2

Alex Imas@alexolegimas

@alz_zyd_ But neuroscience actually does!

15h2.2K233

BOOKMARKS5RETWEETS1

Bytes and United@BytesAndUnited

@alz_zyd_ "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."

8h642145

LIKES29

JFPuget 🇫🇷🇺🇦🇨🇦🇬🇱@JFPuget

@alz_zyd_ It can help dismiss beliefs people have about LLMs, like they are sentient, always on, always learning, thinking on their own when not used, etc.

That the emerging behavior is not well understood is true, but that's a different story.

9h1.3K293

Susan Zhang@suchenzang

poasting in pairs

alz@alz_zyd_

the rather funny thing about LLMs is that knowing how transformers work teaches you literally nothing about why LLMs do what they do, or how to use them better

1h1.3K81

Susan Zhang@suchenzang

@alz_zyd_ straight through estimator through the heart

i'll be here all night

11h1.5K14

yash@yashetal

@alz_zyd_ That's why we have mech interpretability

14h93251

Agustin Lebron@AgustinLebron3

@alz_zyd_ Also, knowing how an internal combustion engine works doesn't teach you what a car does, or how to drive.

2h45031

Dersu@tak3sh8

@alz_zyd_ About 15years ago, I remember how I thought that understand the details of how to fit a logistic regression efficiently (ie. Newton method etc) would help me do well at Kaggle competitions..

6h4674

Joy Buchanan@aboutJoy

@alz_zyd_ “Intelligence is an emergent property of matter”

15h8901

Benny Husted@BennyHusted

@alexolegimas @alz_zyd_ How

14h2012

Daniel Samanez@DanielSamanez3

@alexolegimas @alz_zyd_ Hmmm

Physics more than neuroscience

Neuroscience cannot fully explain intelligence cognitiom

12h1812

chocolatecat@chocolatecat020

@alz_zyd_ Depends on the insight of the person

7h2541

alz@alz_zyd_

@ipvkyte which teaches you literally nothing about why LLMs do what they do, or how to use them better

10h7

Ben Ogorek@BenOgorek

@alz_zyd_ Yeah it's basically a "How Things Work" episode on PBS.

Like, "oh, that's how they make chaulk!"

3h1721

Ritalin@Ritalin1275553

@JFPuget @alz_zyd_ you don't need to know how transformers work for any of that. The point is the architecture was around for years before anyone realized it could be scaled up into imitating thinking. so the low-level mechanism of how it works isn't really relevant to the emergent properties

7h1051

MS@mattstern2

@alz_zyd_ Disagree. The basics help with prompting strategies.

7h303

yon@yonfreecss

Probably because these are interfaces that tap into and scale basal behavioral competencies derived from the way information is structured in our universe, Michael levin talks about how they were able to identify competencies present in simple sorting algorithms, this is probably an extrapolation of the same concept

2h190

Łukasz Stafiniak@lukstafi

@JFPuget @alz_zyd_ What do you mean by "sentient"? Do you mean "with mental states", or "phenomenally conscious", or "self aware", or "with valenced phenomenal experiences", or "with evaluative states"? I hope you see "Transformers, therefore no evaluative states" is false.

3h561

xgda@xgda4

@alz_zyd_ I think it does

2h33

Henryk Abram@HenrykAbram

It depends on how you look at it. Knowledge of the architecture helps me greatly in effectively managing the quality of the results and optimizing the promotion process itself. The meaning of the first three tokens is a direct consequence of these architectural mechanisms, as is why the model's execution can yield information that isn't present in the training data, and how to manage creativity/hallucination, which are two sides of the same coin. Plus, as a bonus, you get a thorough understanding of what "magic" is behind the models working and that it's not just a statistic :)

2h9