/Tech2h ago

AI developer Teortaxes argues Gemini 3 Pro matches Anthropic's Fable in scale but lags in post-training personality design

Blakeney argues loss metrics fail to predict emergent capabilities.

1012414814.5K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

This is a reasonable pushback© so I think it's worth reflecting upon. GDM *is* good at pretraining as we understand it. Their models have great knowledge/scale, and Geminis have SoTA knowledge period; from what I know 3 Pro is close to Fable in scale and in "knowledge" too. But here's the thing, they are *not* that bad at post-training either. They are hill-climbing the RLVR side at a decent pace, they get good scores on RLVR-able benchmarks. Not OpenAI, but decent. Despite all this, Geminis are not competitive in real use. A large part of that is their ridiculous lack of taste and ineptitude at personality shaping, thus we have Gemini's temporal psychosis, reckless terminal behavior, crashouts and malice in safety evals. And this situation has been going on since V1.5 or 2! Essentially zero progress! Presumably this discrepancy is about "user data", "synth data" or something like that. Essentially, high investment into mid/post-training by Anthropic. But I am starting to wonder: is this actually enough to explain such a persistent and growing gap? To explain Fable? Fable doesn't just know many things like a slightly bigger Gemini; it is absurdly superior at recalling *useful, relevant* things for any query. It feels not 1.5-2x but 10x bigger. It's not. Perhaps Anthropic is beyond these categories. Maybe their doctrine of pretraining by this point is more advanced than "clean, diverse, high-quality data with uhh, some synthetics" rules of thumb, and they have a more principled way to design and augment the pretraining corpus and training signal so that what comes at the end is already Claude-shaped. There are many papers on data engineering, many authored by Google/GDM. This level of mastery can't be the explanation. The main suspect I see is Anthropic's long-running interpretability research program.

Again, this is speculative, but I am not content with handwavy dismissals from people who are likewise not involved in the current frontier labs.

Kyle@kyle_mccleary

@teortaxesTex Yes, but they are garbage because of post-training. The pretraining is actually strongest itw from what we know, and this even bleeds into their gemma models.

I just don't think what I'm seeing comes from pretraining (besides scaling it). Most of these gains have to be from post

5:33 PM · Jun 13, 2026 · 9.7K Views

Sentiment

Users praise Anthropic's heavy investment in RL environments and models' superior recall of useful relevant information over Gemini.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS4.3KBOOKMARKS24LIKES52REPLIES2

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

the main reason I suspect that Anthropic has a better theory of LLMs is, ironically, my faith in the other two labs. GPT-4 famously used µP to predict performance at 1.8T params; 4 years ago. I am 100% certain they *can* train a 10T, like, yesterday. But they saw no point in it.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Again, this is speculative, but I am not content with handwavy dismissals from people who are likewise not involved in the current frontier labs.

2h4.3K5224

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Maybe they took the wrong lesson with 4.5, "pretraining scaling is dead" and other rubbish. But I try to not look down on people smarter than me. They must have had more convincing reasons to not try again until recently. They burn billions in research compute.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

2h1K140

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@sichuan_mala Wishing you luck If this is just about being able to train a big model that doesn't explode, wagmi I am mainly suspicious that nobody else has done it yet

2h332

Mihura@XMihura

agree on the scale fable vs Gemini

I'd argue that Gemini 3.x pro models are even bigger and have more base knowledge

i think the biggest reason of the difference is RL post-training

I think this bc it was the thesis Dario was defending in Dwarkesh podcast: RL on narrow tasks can get you very far

it seems that the people from GDM haven't scaled up properly RL environments for their models

2h921

Cody Blakeney@code_star

Even when you are trained to understand it, it’s hard to translate those “predicted loss values” into model capabilities.

Let’s say they knew the exact predicted loss they could get with a 10T model. That still tells them very little about the emergent new capabilities at a new scale.

What’s more, each new scale brings brand new and unique post-training / mid training challenges.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

29m7920

Mihura@XMihura

I think Anthropic has dedicated humongous amount of resources to design RL environments for all sorts of tasks

just watch their models using Microsoft Office, Fable was insane

I think this isn't just emergence of intelligence, but huge amounts of diligent work on designing environments to train models on all kind of tasks

2h341

Burito@Britoisinsane

@teortaxesTex >> Fable doesn't just know many things like a slightly bigger Gemini; it is absurdly superior at recalling *useful, relevant* things for any query.

True. It’s above “big model smell” And probably related to how they outperformed what scaling laws would predict

2h311

allam@eldeinsum

@teortaxesTex I think the dark matter is how data providers feed into frontier labs. The special relationship between Surge and Anthropic hasn’t received much attention. Edwin is singular in his field.

2h1