1h ago

Jeremy Howard says Gemini Flash 3.5 maximizes benchmark performance over following instructions, causing unrelated actions, while Nataniel Ruiz notes its independent task completion

Exchange highlights differing views on model autonomy.

0
Original post

Gemini Flash 3.5 is such a disappointing model. It's intelligence and speed is awesome. Absolutely amazing. But it's been trained to max evals, not to be helpful to humans. It goes off and does random crap "for me" rather than just doing what I asked.

1:34 PM · May 22, 2026 View on X

We desperately need better ways of evaluating models. Something that shows how helpful they are at working hand-in-hand with humans to help them get stuff done in a cooperative/iterative way.

The Claude models have consistently been better at this, and the market rewards that.

Jeremy HowardJeremy Howard@jeremyphoward

Gemini Flash 3.5 is such a disappointing model. It's intelligence and speed is awesome. Absolutely amazing. But it's been trained to max evals, not to be helpful to humans. It goes off and does random crap "for me" rather than just doing what I asked.

8:34 PM · May 22, 2026 · 10.2K Views
8:34 PM · May 22, 2026 · 2.8K Views

GPT 5.5 seems to be improving in that direction now, and Claude models are getting worse at it, so I don't think there's a clear winner now.

Jeremy HowardJeremy Howard@jeremyphoward

We desperately need better ways of evaluating models. Something that shows how helpful they are at working hand-in-hand with humans to help them get stuff done in a cooperative/iterative way. The Claude models have consistently been better at this, and the market rewards that.

8:34 PM · May 22, 2026 · 2.8K Views
8:35 PM · May 22, 2026 · 2.3K Views

@natanielruizg My work always involves me building a deeper understanding of the problem space I'm working in through iterative development and experimentation.

A computer can't do that for me.

Nataniel RuizNataniel Ruiz@natanielruizg

@jeremyphoward it can happen but it also reads my mind sometimes and i come back to my computer and it has done my work for me

9:12 PM · May 22, 2026 · 428 Views
9:16 PM · May 22, 2026 · 383 Views

@jeremyphoward Gemini 3.5 Flash is definitely an over eager model, a bit of over correction from the “Gemini laziness” feedback, but definitely built to be real world useful!

Jeremy HowardJeremy Howard@jeremyphoward

Gemini Flash 3.5 is such a disappointing model. It's intelligence and speed is awesome. Absolutely amazing. But it's been trained to max evals, not to be helpful to humans. It goes off and does random crap "for me" rather than just doing what I asked.

8:34 PM · May 22, 2026 · 10.2K Views
9:34 PM · May 22, 2026 · 2.4K Views

@jeremyphoward it can happen but it also reads my mind sometimes and i come back to my computer and it has done my work for me

Jeremy HowardJeremy Howard@jeremyphoward

Gemini Flash 3.5 is such a disappointing model. It's intelligence and speed is awesome. Absolutely amazing. But it's been trained to max evals, not to be helpful to humans. It goes off and does random crap "for me" rather than just doing what I asked.

8:34 PM · May 22, 2026 · 10.2K Views
9:12 PM · May 22, 2026 · 428 Views