Neither Grok nor Gemini have ever had the worlds most power model. They held a slight lead in useless benchmarks.
If you ever chose either as daily drivers for serious dev work, I do not trust your judgement at all.
T3 Stack creator Theo Browne is pushing back on any narrative that Grok or Gemini models ever sat at the absolute top of AI capability, framing their reported leads as narrow wins on benchmarks that fail to reflect day-to-day development demands.
Neither Grok nor Gemini have ever had the worlds most power model. They held a slight lead in useless benchmarks.
If you ever chose either as daily drivers for serious dev work, I do not trust your judgement at all.
Gemini 2.5 Pro did reach the top of the LMSYS leaderboard for a stretch after its March 2025 update, yet the exact length of that lead and whether it translated to broader superiority remain unclear from available records.
Theo argues that choosing models based on those leaderboard spikes signals the wrong priorities, since meaningful performance in actual software projects has followed different patterns than the brief ranking spikes.
Many users agreed with Theo's claim that Grok and Gemini never led with the most powerful AI models, while others called the models useless or distrusted the assessment behind the statement.

@ChickenSamosaa You should read the tweet again, here I grabbed it for you.

@theo this is more accurate tbh

@theo grok is extremely lobotomized all the time and gemini has a Jason Bourne style identity crisis every time you ask it what time it is

@themmyleke Both Gemini and Grok are massively overpriced and their subsidization was never as good as Codex or Claude Code.

@igetrugd Agreed
@theo Gemini 2.5 pro was a leader for a brief period of time
Neither Grok nor Gemini have ever had the worlds most power model. They held a slight lead in useless benchmarks.
If you ever chose either as daily drivers for serious dev work, I do not trust your judgement at all.

@dragosroua Going to be blunt because you should hear this: this is enough information for me to know I would never hire you

@theo You are speaking from a place of wealth lol.
3.1 pro was good enough as an implementation tool. I put it and codex in a letagents room, got codex to review and draw up plans and had Gemini write the code. That way I didn’t burn out limits.
But 3.5 is unusable I agree

METR evals DeepSWE Any actual contributions to real projects
I seriously do not understand how anyone can use Gemini for dev work unless they are getting it for free. It’s not like “oh it’s 5% worse”, it is literally unusable, constantly looping on nonsense and never making working code.

@ChickenSamosaa “…for serious dev work”
I know reading the whole sentence is hard but you really should try it out some time

@dragosroua I’m sorry but “all models are head to head now” is actually the dumbest thing I’ve seen anyone say on this app in years

@theo Gemini is quite ok. Used it for a very complex localization feature and it executed well.
Claude and ChatGPT have good UI, VERY good marketing and improved harnesses, but purely from inference perspective all models are head to head now.

@theo
trust

@konopka_tg Sure, but everything that makes it “most powerful” requires tool calls, which only 2 labs are good at doing for long tasks.

@theo You surely have some metrics for this that you can share. Something more than “I just think it is like this, period”. If you just think it is like that, it’s perfectly valid, can’t shit on people taste. But if you have metrics, I’m ready to look into them.

@theo Not everyone's a troll dude. Relax.

@theo I had a Gemini Ultra plan, that thing never ran out as fast as Codex I promise you and it was a family plan which had three of my buddies on it as well.

@theo Hey, shitposting on X dot com is serious work

@theo Yes definitely, only the real powerful model competition is actually between OpenAI and Claude to launch world’s most powerful SOTA model.

@theo Any LLM based on transformers is just a ginarmous GIGO machine: you feed it garbage it gives you back garbage. This is an outrageous simplification but useful for the conversation. If you know what questions to ask, you get better results. As for metrics…
T3 Stack creator Theo Browne is pushing back on any narrative that Grok or Gemini models ever sat at the absolute top of AI capability, framing their reported leads as narrow wins on benchmarks that fail to reflect day-to-day development demands.
Neither Grok nor Gemini have ever had the worlds most power model. They held a slight lead in useless benchmarks.
If you ever chose either as daily drivers for serious dev work, I do not trust your judgement at all.