This is a reasonable pushback© so I think it's worth reflecting upon. GDM *is* good at pretraining as we understand it. Their models have great knowledge/scale, and Geminis have SoTA knowledge period; from what I know 3 Pro is close to Fable in scale and in "knowledge" too. But here's the thing, they are *not* that bad at post-training either. They are hill-climbing the RLVR side at a decent pace, they get good scores on RLVR-able benchmarks. Not OpenAI, but decent. Despite all this, Geminis are not competitive in real use. A large part of that is their ridiculous lack of taste and ineptitude at personality shaping, thus we have Gemini's temporal psychosis, reckless terminal behavior, crashouts and malice in safety evals. And this situation has been going on since V1.5 or 2! Essentially zero progress! Presumably this discrepancy is about "user data", "synth data" or something like that. Essentially, high investment into mid/post-training by Anthropic. But I am starting to wonder: is this actually enough to explain such a persistent and growing gap? To explain Fable? Fable doesn't just know many things like a slightly bigger Gemini; it is absurdly superior at recalling *useful, relevant* things for any query. It feels not 1.5-2x but 10x bigger. It's not. Perhaps Anthropic is beyond these categories. Maybe their doctrine of pretraining by this point is more advanced than "clean, diverse, high-quality data with uhh, some synthetics" rules of thumb, and they have a more principled way to design and augment the pretraining corpus and training signal so that what comes at the end is already Claude-shaped. There are many papers on data engineering, many authored by Google/GDM. This level of mastery can't be the explanation. The main suspect I see is Anthropic's long-running interpretability research program.
Again, this is speculative, but I am not content with handwavy dismissals from people who are likewise not involved in the current frontier labs.
@teortaxesTex Yes, but they are garbage because of post-training. The pretraining is actually strongest itw from what we know, and this even bleeds into their gemma models.
I just don't think what I'm seeing comes from pretraining (besides scaling it). Most of these gains have to be from post



