/Tech1d ago

Conviction founder Sarah Guo argues AI models favor secondary framings because training data contains more summaries than primary sources

Investor alth0u contested this by referencing scholar Walter Ong

2415248236.9K

#222

Original post

sarah guo@saranormous#222inTech

1/ thinking: so many default behaviors of models are based on the failures of internet data. secondary framings like rankings, summaries, takes, etc appear far more often in training than the primary sources they compress.

9:49 PM · Jun 8, 2026 · 31.7K Views

/Tech1d ago

Conviction founder Sarah Guo argues AI models favor secondary framings because training data contains more summaries than primary sources

Investor alth0u contested this by referencing scholar Walter Ong

2415248236.9K

#222

Original post

sarah guo@saranormous#222inTech

9:49 PM · Jun 8, 2026 · 31.7K Views

Sentiment

Many users criticized AI models for defaulting to fluent but unoriginal summaries from internet data biases instead of engaging primary sources or critical thinking.

Pos

0.0%

Neg

100.0%

6 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS5.2KBOOKMARKS10LIKES38

alth0u🧶@alth0u

no, read Walter Ong

sarah guo@saranormous

1d5.2K3810

RETWEETS1

Mario Scipio@scipiosai

@saranormous It's not just frequency that matters: it's importance.

In P(X|Y)P(X), the prior P(X) weights the evidence. The contribution of a source is determined not only by how often it appears, but by how primary it is.

It's not occurrence. It's occurrence multiplied by weight.

1d2411

REPLIES2

sarah guo@saranormous

2/ so they’re the low-effort, “feels fluent” default, and the model emits them from memory instead of reading the source, unless user/agent forces it to retrieve. they don’t want to think critically! in novel ways! it is hard!

1d2K8

sarah guo@saranormous

3/ folks broadly complain about models being “nerfed” but interesting to think about specific bad behaviors in models that I expect might get worse as companies “compute/cost-optimize” their general products

1d2.8K101

Tianyi Zhang@mycharmspace

@saranormous Make model to reason from first principle is hard and requires more time to think.

1d17451

sarah guo@saranormous

4/ unclear if it’s just the upstream pretraining data or the actual reward today of faster time-to-answer/less external call

1d1.2K31

Vimal Singh, AI Leader@VimalAITech

@saranormous Alot of have do with the post trainings. the teams keep extensive post training to ensure the models remain useful. As majority of the LLMs started as text genrator the text operation is a primary behaviour deeply embedded in them

1d1261

Alien@alienorg

@saranormous the corpus is mostly compression of things it never ingested directly, so the model fluent in summaries of work it never read

1d683

Alan Sass@alansass

what do you recommend as a solution? would a ground truth or similar audit/research agent fill the gaps now or in the future? the speed of agent responses seems to go against the above but imho it feels to me that we need to delay the response times from agents to allow for more verification.

1d214

Crepe Supreme@crepesupreme

@saranormous The reward signal compounds the pretraining problem. RLHF at consumer scale rewards fast and confident. So models learn to produce takes from pretraining, then learn to produce them quickly from fine-tuning. Both levers have to move.

1d191

Andy Beard@AndyBeard

The same happens with chat memory. You can have a bunch of documents discussing various aspects of a topic. The LLM retrieving from Google Drive will just grab a selection, not even guaranteed the most recent.

Sometimes it is enough, but often not, but you would only realise if you have intimate knowledge of the available documents

1d151

Sanjay Kumar@sky_bolt20907

@saranormous Yeah you see same robotic behaviour through on models. More often like so polished that anyone can find it AI.

1d128

Waqas@vixsheikh

@saranormous It feels like an “average of averages” problem of sorts

1d95

Jake Kaldenbaugh@Jakewk

@saranormous It’s reddit’s fault.

1d71

vanya@damnventures

@saranormous likely both

and cost-optimization makes runtime grounding more valuable, not less: forcing the model to read a primary source it's been given is cheaper than retraining out the prior or rebalancing the time-to-answer reward

1d67

Kevihaiceth 💹🧲@Kevihaiceth

@saranormous models are basically just echoing our mess.

1d64

Jason V. Stewart@JVStewartBuilds

@saranormous Is the primary layer actually richer or just less compressed?

1d30

J. M.@jmjjohnson

@saranormous Critical thinking is the way forward and LLMs are not capable of it.

1d7

deep Manifold@BetaTomorrow

@VimalAITech @saranormous Pretraining is premature

1d6

Kekko D’Amato@kekkodamato_

This explains why models often give 'what people say about X' instead of 'what X actually is'. Training signal is dominated by meta-commentary, not the thing itself. Like training on restaurant reviews instead of recipes — you get a model that knows what's popular, not what's good.

1d3