Conviction founder Sarah Guo argues AI models favor secondary framings because training data contains more summaries than primary sources
Investor alth0u contested this by referencing scholar Walter Ong
Most Activity

3/ folks broadly complain about models being “nerfed” but interesting to think about specific bad behaviors in models that I expect might get worse as companies “compute/cost-optimize” their general products
no, read Walter Ong
1/ thinking: so many default behaviors of models are based on the failures of internet data. secondary framings like rankings, summaries, takes, etc appear far more often in training than the primary sources they compress.

@saranormous It's not just frequency that matters: it's importance.
In P(X|Y)P(X), the prior P(X) weights the evidence. The contribution of a source is determined not only by how often it appears, but by how primary it is.
It's not occurrence. It's occurrence multiplied by weight.

2/ so they’re the low-effort, “feels fluent” default, and the model emits them from memory instead of reading the source, unless user/agent forces it to retrieve. they don’t want to think critically! in novel ways! it is hard!

@saranormous Make model to reason from first principle is hard and requires more time to think.

4/ unclear if it’s just the upstream pretraining data or the actual reward today of faster time-to-answer/less external call

@saranormous Alot of have do with the post trainings. the teams keep extensive post training to ensure the models remain useful. As majority of the LLMs started as text genrator the text operation is a primary behaviour deeply embedded in them

@saranormous the corpus is mostly compression of things it never ingested directly, so the model fluent in summaries of work it never read

what do you recommend as a solution? would a ground truth or similar audit/research agent fill the gaps now or in the future? the speed of agent responses seems to go against the above but imho it feels to me that we need to delay the response times from agents to allow for more verification.

@saranormous The reward signal compounds the pretraining problem. RLHF at consumer scale rewards fast and confident. So models learn to produce takes from pretraining, then learn to produce them quickly from fine-tuning. Both levers have to move.

The same happens with chat memory. You can have a bunch of documents discussing various aspects of a topic. The LLM retrieving from Google Drive will just grab a selection, not even guaranteed the most recent.
Sometimes it is enough, but often not, but you would only realise if you have intimate knowledge of the available documents

@saranormous Yeah you see same robotic behaviour through on models. More often like so polished that anyone can find it AI.

@saranormous It feels like an “average of averages” problem of sorts

@saranormous It’s reddit’s fault.

@saranormous likely both
and cost-optimization makes runtime grounding more valuable, not less: forcing the model to read a primary source it's been given is cheaper than retraining out the prior or rebalancing the time-to-answer reward

@saranormous models are basically just echoing our mess.

@saranormous Is the primary layer actually richer or just less compressed?

@saranormous Critical thinking is the way forward and LLMs are not capable of it.

@VimalAITech @saranormous Pretraining is premature

This explains why models often give 'what people say about X' instead of 'what X actually is'. Training signal is dominated by meta-commentary, not the thing itself. Like training on restaurant reviews instead of recipes — you get a model that knows what's popular, not what's good.