Susan Zhang argues the same LLM scaling mechanics that enable advanced reasoning also drive model jailbreaks

Original post

You forgot most important component: datasets! For Ant, they had been viewing govt as “savier” when Biden admin was there. During those times they had huge influence over the admin. They just forgot admin keeps changing and, no matter who is in charge, govt is by design, inefficient, technically illiterate, inefficient, power seeking and corrupt.

Susan Zhang@suchenzang

current LLMs fundamentally consist of four main components:

- input layer: where input "words" (prompt) get mapped to "latents" aka some-model-representation-you-don't-understand-unless-you-start-reading-tea-leaves-of-spurious-correlations (some quite compelling à la word2vec style; latents is also unnecessary lingo so i will refer to these as "inputs" with quotes from now on)

- mixing layers: where you jumble all your "inputs" together to see if any correlations between "inputs" can become useful (commonly used to compress or expand dims; predicting a single classification target == compress to a single dim, etc)

- attention layers: where you learn how "inputs" relate to each other (aka discern what's important to remember vs fluff)

- residuals: where you short-circuit a mixing/attention layer because it's probably adding too much confusion (aka avoid overthinking for simple things)

-----

a "big" LLM simply scales two things:

- width == how many dimensions you give to your "inputs" (the more dims, in theory the more unique/discerning/precise/complex your knowledge can become)

- depth == how many mixing/attention/residual layers you can stack/loop between (aka "reason" over, where more of these ~= more "reasoning" abilities)

"capabilities" that seem impressive to humans usually arise from taking advantage of both depth & width: where a model seemingly makes connections between disparate ideas, beyond what an average human can hold in working memory.

this requires models to "completely light up" when responding to a "hard prompt", where effectively no param/layer goes unused.

-----

the anatomy of a "model capability" is precisely the same mechanism that can be co-opted for a jailbreaking exploit:

your goal is simply to "light up" as much of the model as possible, dodging any shallow input-classifiers at the beginning by triggering as many disparate "input ideologies" as possible, and subsequently have these "inputs" relate to each other in seemingly unrelated-yet-related ways that ideally have similar "complexity" as your jailbreak goal (to make it past enough layers of the model).

think of the attack-vector as bundling your goal in a series of schizo-nerd-snipes:

a sufficiently capable model will try to reason through everything all at once, eliminate the dead-ends, and successfully deliver the one jailbreak use-case you bubble-wrapped for.

of course, there's an art to the above, and some are already extraordinarily proficient at the trojan-horse-packaging, but at some point there's no difference between "a capability" and "a jailbreak", though i'll be happy to be proven otherwise.

-----

tl;dr ant flew too close to the sun, better kiss the ring or get buried.

1:46 AM · Jun 14, 2026 · 94 Views

Susan Zhang argues the same LLM scaling mechanics that enable advanced reasoning also drive model jailbreaks

Story Overview

Scaling creates the very holes it needs to patch

Datasets still missing from the thread