Neither Grok nor Gemini have ever had the worlds most power model. They held a slight lead in useless benchmarks.
If you ever chose either as daily drivers for serious dev work, I do not trust your judgement at all.
Matthew Berman countered that Gemini 2.5 Pro briefly led.
Neither Grok nor Gemini have ever had the worlds most power model. They held a slight lead in useless benchmarks.
If you ever chose either as daily drivers for serious dev work, I do not trust your judgement at all.
Some users defend Gemini or Grok for excelling in tasks like visual intelligence or reliability while others dismiss Theo's claim about model leadership as a bad take and distrust his judgment.

@ChickenSamosaa You should read the tweet again, here I grabbed it for you.

@theo this is more accurate tbh

@theo grok is extremely lobotomized all the time and gemini has a Jason Bourne style identity crisis every time you ask it what time it is

@themmyleke Both Gemini and Grok are massively overpriced and their subsidization was never as good as Codex or Claude Code.
@theo Gemini 2.5 pro was a leader for a brief period of time
Neither Grok nor Gemini have ever had the worlds most power model. They held a slight lead in useless benchmarks.
If you ever chose either as daily drivers for serious dev work, I do not trust your judgement at all.

@igetrugd Agreed

@dragosroua Going to be blunt because you should hear this: this is enough information for me to know I would never hire you

@theo You are speaking from a place of wealth lol.
3.1 pro was good enough as an implementation tool. I put it and codex in a letagents room, got codex to review and draw up plans and had Gemini write the code. That way I didn’t burn out limits.
But 3.5 is unusable I agree

METR evals DeepSWE Any actual contributions to real projects
I seriously do not understand how anyone can use Gemini for dev work unless they are getting it for free. It’s not like “oh it’s 5% worse”, it is literally unusable, constantly looping on nonsense and never making working code.

@ChickenSamosaa “…for serious dev work”
I know reading the whole sentence is hard but you really should try it out some time

@dragosroua I’m sorry but “all models are head to head now” is actually the dumbest thing I’ve seen anyone say on this app in years

@theo Gemini is quite ok. Used it for a very complex localization feature and it executed well.
Claude and ChatGPT have good UI, VERY good marketing and improved harnesses, but purely from inference perspective all models are head to head now.

@theo
trust

@konopka_tg Sure, but everything that makes it “most powerful” requires tool calls, which only 2 labs are good at doing for long tasks.

@theo You surely have some metrics for this that you can share. Something more than “I just think it is like this, period”. If you just think it is like that, it’s perfectly valid, can’t shit on people taste. But if you have metrics, I’m ready to look into them.

@theo Not everyone's a troll dude. Relax.

@theo I had a Gemini Ultra plan, that thing never ran out as fast as Codex I promise you and it was a family plan which had three of my buddies on it as well.

@theo Hey, shitposting on X dot com is serious work

@theo Yes definitely, only the real powerful model competition is actually between OpenAI and Claude to launch world’s most powerful SOTA model.

@theo Any LLM based on transformers is just a ginarmous GIGO machine: you feed it garbage it gives you back garbage. This is an outrageous simplification but useful for the conversation. If you know what questions to ask, you get better results. As for metrics…
Matthew Berman countered that Gemini 2.5 Pro briefly led.
Neither Grok nor Gemini have ever had the worlds most power model. They held a slight lead in useless benchmarks.
If you ever chose either as daily drivers for serious dev work, I do not trust your judgement at all.