/Tech1d ago

Conviction founder Sarah Guo argues AI models favor secondary framings because training data contains more summaries than primary sources

Investor alth0u contested this by referencing scholar Walter Ong

2415248236.9K
Original post
sarah guo@saranormous#222inTech

1/ thinking: so many default behaviors of models are based on the failures of internet data. secondary framings like rankings, summaries, takes, etc appear far more often in training than the primary sources they compress.

9:49 PM · Jun 8, 2026 · 31.7K Views
Sentiment

Many users criticized AI models for defaulting to fluent but unoriginal summaries from internet data biases instead of engaging primary sources or critical thinking.

Pos
0.0%
Neg
100.0%
6 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS5.2KBOOKMARKS10LIKES38
alth0u🧶@alth0u

no, read Walter Ong

sarah guo@saranormous

1/ thinking: so many default behaviors of models are based on the failures of internet data. secondary framings like rankings, summaries, takes, etc appear far more often in training than the primary sources they compress.

1dViews 5.2KLikes 38Bookmarks 10
RETWEETS1
Mario Scipio@scipiosai

@saranormous It's not just frequency that matters: it's importance.

In P(X|Y)P(X), the prior P(X) weights the evidence. The contribution of a source is determined not only by how often it appears, but by how primary it is.

It's not occurrence. It's occurrence multiplied by weight.

1dViews 24Likes 1Bookmarks 1
REPLIES2
sarah guo@saranormous

2/ so they’re the low-effort, “feels fluent” default, and the model emits them from memory instead of reading the source, unless user/agent forces it to retrieve. they don’t want to think critically! in novel ways! it is hard!

1dViews 2KLikes 8
sarah guo@saranormous

3/ folks broadly complain about models being “nerfed” but interesting to think about specific bad behaviors in models that I expect might get worse as companies “compute/cost-optimize” their general products

1dViews 2.8KLikes 10Bookmarks 1
Tianyi Zhang@mycharmspace

@saranormous Make model to reason from first principle is hard and requires more time to think.

1dViews 174Likes 5Bookmarks 1
sarah guo@saranormous

4/ unclear if it’s just the upstream pretraining data or the actual reward today of faster time-to-answer/less external call

1dViews 1.2KLikes 3Bookmarks 1

@saranormous Alot of have do with the post trainings. the teams keep extensive post training to ensure the models remain useful. As majority of the LLMs started as text genrator the text operation is a primary behaviour deeply embedded in them

1dViews 126Likes 1
Alien@alienorg

@saranormous the corpus is mostly compression of things it never ingested directly, so the model fluent in summaries of work it never read

1dViews 68Likes 3
Alan Sass@alansass

what do you recommend as a solution? would a ground truth or similar audit/research agent fill the gaps now or in the future? the speed of agent responses seems to go against the above but imho it feels to me that we need to delay the response times from agents to allow for more verification.

1dViews 214
Crepe Supreme@crepesupreme

@saranormous The reward signal compounds the pretraining problem. RLHF at consumer scale rewards fast and confident. So models learn to produce takes from pretraining, then learn to produce them quickly from fine-tuning. Both levers have to move.

1dViews 191
Andy Beard@AndyBeard

The same happens with chat memory. You can have a bunch of documents discussing various aspects of a topic. The LLM retrieving from Google Drive will just grab a selection, not even guaranteed the most recent.

Sometimes it is enough, but often not, but you would only realise if you have intimate knowledge of the available documents

1dViews 151
Sanjay Kumar@sky_bolt20907

@saranormous Yeah you see same robotic behaviour through on models. More often like so polished that anyone can find it AI.

1dViews 128
Waqas@vixsheikh

@saranormous It feels like an “average of averages” problem of sorts

1dViews 95
vanya@damnventures

@saranormous likely both

and cost-optimization makes runtime grounding more valuable, not less: forcing the model to read a primary source it's been given is cheaper than retraining out the prior or rebalancing the time-to-answer reward

1dViews 67
Jason V. Stewart@JVStewartBuilds

@saranormous Is the primary layer actually richer or just less compressed?

1dViews 30
J. M.@jmjjohnson

@saranormous Critical thinking is the way forward and LLMs are not capable of it.

1dViews 7
deep Manifold@BetaTomorrow

@VimalAITech @saranormous Pretraining is premature

1dViews 6
Kekko D’Amato@kekkodamato_

This explains why models often give 'what people say about X' instead of 'what X actually is'. Training signal is dominated by meta-commentary, not the thing itself. Like training on restaurant reviews instead of recipes — you get a model that knows what's popular, not what's good.

1dViews 3
Load more posts