Jeremy Howard says Gemini Flash 3.5 maximizes benchmark performance over following instructions, causing unrelated actions, while Nataniel Ruiz notes its independent task completion
Exchange highlights differing views on model autonomy.
We desperately need better ways of evaluating models. Something that shows how helpful they are at working hand-in-hand with humans to help them get stuff done in a cooperative/iterative way.
The Claude models have consistently been better at this, and the market rewards that.
Gemini Flash 3.5 is such a disappointing model. It's intelligence and speed is awesome. Absolutely amazing. But it's been trained to max evals, not to be helpful to humans. It goes off and does random crap "for me" rather than just doing what I asked.
GPT 5.5 seems to be improving in that direction now, and Claude models are getting worse at it, so I don't think there's a clear winner now.
We desperately need better ways of evaluating models. Something that shows how helpful they are at working hand-in-hand with humans to help them get stuff done in a cooperative/iterative way. The Claude models have consistently been better at this, and the market rewards that.
@natanielruizg My work always involves me building a deeper understanding of the problem space I'm working in through iterative development and experimentation.
A computer can't do that for me.
@jeremyphoward it can happen but it also reads my mind sometimes and i come back to my computer and it has done my work for me
@jeremyphoward Gemini 3.5 Flash is definitely an over eager model, a bit of over correction from the “Gemini laziness” feedback, but definitely built to be real world useful!
Gemini Flash 3.5 is such a disappointing model. It's intelligence and speed is awesome. Absolutely amazing. But it's been trained to max evals, not to be helpful to humans. It goes off and does random crap "for me" rather than just doing what I asked.
@jeremyphoward it can happen but it also reads my mind sometimes and i come back to my computer and it has done my work for me
Gemini Flash 3.5 is such a disappointing model. It's intelligence and speed is awesome. Absolutely amazing. But it's been trained to max evals, not to be helpful to humans. It goes off and does random crap "for me" rather than just doing what I asked.