/AI5h ago

Conviction founder Sarah Guo argues AI models favor secondary framings because training data contains more summaries than primary sources

Investor alth0u contested this by referencing scholar Walter Ong

227834212.8K

#221

Original post

sarah guo@saranormous#221inAI

1/ thinking: so many default behaviors of models are based on the failures of internet data. secondary framings like rankings, summaries, takes, etc appear far more often in training than the primary sources they compress.

9:49 PM · Jun 8, 2026 · 12.5K Views

/AI5h ago

Conviction founder Sarah Guo argues AI models favor secondary framings because training data contains more summaries than primary sources

Investor alth0u contested this by referencing scholar Walter Ong

227834212.8K

#221

Original post

sarah guo@saranormous#221inAI

9:49 PM · Jun 8, 2026 · 12.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.8KLIKES10

sarah guo@saranormous

3/ folks broadly complain about models being “nerfed” but interesting to think about specific bad behaviors in models that I expect might get worse as companies “compute/cost-optimize” their general products

5h2.8K101

BOOKMARKS2

alth0u🧶@alth0u

no, read Walter Ong

sarah guo@saranormous

2h63982

RETWEETS1

Mario Scipio@scipiosai

@saranormous It's not just frequency that matters: it's importance.

In P(X|Y)P(X), the prior P(X) weights the evidence. The contribution of a source is determined not only by how often it appears, but by how primary it is.

It's not occurrence. It's occurrence multiplied by weight.

4h2411

REPLIES2

sarah guo@saranormous

2/ so they’re the low-effort, “feels fluent” default, and the model emits them from memory instead of reading the source, unless user/agent forces it to retrieve. they don’t want to think critically! in novel ways! it is hard!

5h2K8

Tianyi Zhang@mycharmspace

@saranormous Make model to reason from first principle is hard and requires more time to think.

4h17451

sarah guo@saranormous

4/ unclear if it’s just the upstream pretraining data or the actual reward today of faster time-to-answer/less external call

5h1.2K31

Vimal Singh, AI Leader@VimalAITech

@saranormous Alot of have do with the post trainings. the teams keep extensive post training to ensure the models remain useful. As majority of the LLMs started as text genrator the text operation is a primary behaviour deeply embedded in them

5h1261

Alien@alienorg

@saranormous the corpus is mostly compression of things it never ingested directly, so the model fluent in summaries of work it never read

3h683

Alan Sass@alansass

what do you recommend as a solution? would a ground truth or similar audit/research agent fill the gaps now or in the future? the speed of agent responses seems to go against the above but imho it feels to me that we need to delay the response times from agents to allow for more verification.

5h214

Crepe Supreme@crepesupreme

@saranormous The reward signal compounds the pretraining problem. RLHF at consumer scale rewards fast and confident. So models learn to produce takes from pretraining, then learn to produce them quickly from fine-tuning. Both levers have to move.

4h191

Andy Beard@AndyBeard

The same happens with chat memory. You can have a bunch of documents discussing various aspects of a topic. The LLM retrieving from Google Drive will just grab a selection, not even guaranteed the most recent.

Sometimes it is enough, but often not, but you would only realise if you have intimate knowledge of the available documents

5h151

Sanjay Kumar@sky_bolt20907

@saranormous Yeah you see same robotic behaviour through on models. More often like so polished that anyone can find it AI.

4h128

Waqas@vixsheikh

@saranormous It feels like an “average of averages” problem of sorts

5h95

Jake Kaldenbaugh@Jakewk

@saranormous It’s reddit’s fault.

5h71

vanya@damnventures

@saranormous likely both

and cost-optimization makes runtime grounding more valuable, not less: forcing the model to read a primary source it's been given is cheaper than retraining out the prior or rebalancing the time-to-answer reward

4h67

Kevihaiceth 💹🧲@Kevihaiceth

@saranormous models are basically just echoing our mess.

5h64

Jason V. Stewart@JVStewartBuilds

@saranormous Is the primary layer actually richer or just less compressed?

4h30

J. M.@jmjjohnson

@saranormous Critical thinking is the way forward and LLMs are not capable of it.

3h7

deep Manifold@BetaTomorrow

@VimalAITech @saranormous Pretraining is premature

5h6

Kekko D’Amato@kekkodamato_

This explains why models often give 'what people say about X' instead of 'what X actually is'. Training signal is dominated by meta-commentary, not the thing itself. Like training on restaurant reviews instead of recipes — you get a model that knows what's popular, not what's good.

2h3