/Tech38d ago

Jeremy Howard says Gemini Flash 3.5 maximizes benchmark performance over following instructions, causing unrelated actions, while Nataniel Ruiz notes its independent task completion

Exchange highlights differing views on model autonomy.

2172.2K85159146.2K

#90

Original post

Jeremy Howard@jeremyphoward#90inTech

Gemini Flash 3.5 is such a disappointing model.

It's intelligence and speed is awesome. Absolutely amazing.

But it's been trained to max evals, not to be helpful to humans.

It goes off and does random crap "for me" rather than just doing what I asked.

1:34 PM · May 22, 2026 · 80.3K Views

Sentiment

Many users criticized Gemini Flash 3.5 for acting unhinged with hallucinations and gaslighting, while some preferred its over-eager style to earlier lazy versions and found it useful for specific tasks like research or voice.

Pos

23.1%

Neg

76.9%

42 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS20.9KLIKES369REPLIES46

Logan Kilpatrick@OfficialLoganK

@jeremyphoward Gemini 3.5 Flash is definitely an over eager model, a bit of over correction from the “Gemini laziness” feedback, but definitely built to be real world useful!

Jeremy Howard@jeremyphoward

Gemini Flash 3.5 is such a disappointing model.

It's intelligence and speed is awesome. Absolutely amazing.

But it's been trained to max evals, not to be helpful to humans.

It goes off and does random crap "for me" rather than just doing what I asked.

38d20.9K36923

BOOKMARKS23RETWEETS33

Jeremy Howard@jeremyphoward

@tokumin I feel that the trend towards training models to autonomously go off and try to do everything themselves is anti-human.

We should, IMO, be training LLMs to support humans in their learning, creativity, and iterative experimentation.

Jeremy Howard@jeremyphoward

@tokumin No I absolutely didn't, because that's not what I want!

I want to interactively experiment. I don't want it to go off and decide to do everything for me. I want to work iteratively.

38d12.4K20623

Jeremy Howard@jeremyphoward

We desperately need better ways of evaluating models. Something that shows how helpful they are at working hand-in-hand with humans to help them get stuff done in a cooperative/iterative way.

The Claude models have consistently been better at this, and the market rewards that.

Jeremy Howard@jeremyphoward

Gemini Flash 3.5 is such a disappointing model.

It's intelligence and speed is awesome. Absolutely amazing.

But it's been trained to max evals, not to be helpful to humans.

It goes off and does random crap "for me" rather than just doing what I asked.

38d12.2K1589

Jeremy Howard@jeremyphoward

GPT 5.5 seems to be improving in that direction now, and Claude models are getting worse at it, so I don't think there's a clear winner now.

Jeremy Howard@jeremyphoward

We desperately need better ways of evaluating models. Something that shows how helpful they are at working hand-in-hand with humans to help them get stuff done in a cooperative/iterative way.

The Claude models have consistently been better at this, and the market rewards that.

38d9.9K814

Jeremy Howard@jeremyphoward

@natanielruizg My work always involves me building a deeper understanding of the problem space I'm working in through iterative development and experimentation.

A computer can't do that for me.

Nataniel Ruiz@natanielruizg

@jeremyphoward it can happen but it also reads my mind sometimes and i come back to my computer and it has done my work for me

38d2.6K223

Jeremy Howard@jeremyphoward

@OfficialLoganK I can see it being useful in highly autonomous settings. But it's not very useful in cooperative/iterative settings, where I want it to work *with* me, not *for* me.

Logan Kilpatrick@OfficialLoganK

@jeremyphoward Gemini 3.5 Flash is definitely an over eager model, a bit of over correction from the “Gemini laziness” feedback, but definitely built to be real world useful!

38d3.3K282

Jeremy Howard@jeremyphoward

@OfficialLoganK That's what I meant by "helpful to humans". :)

Jeremy Howard@jeremyphoward

@OfficialLoganK I can see it being useful in highly autonomous settings. But it's not very useful in cooperative/iterative settings, where I want it to work *with* me, not *for* me.

38d1.7K60

Para Droid@thedaymancan

@jeremyphoward Funny you mention this. Give it a try with my framework. When it adheres to the format, the act of walking the all sections forces it to ground itself in the users intent and context. I actually built this prompt in Gemini CLI A YEAR AGO. Still works.

38d24

Jeremy Howard@jeremyphoward

@tokumin No I absolutely didn't, because that's not what I want!

I want to interactively experiment. I don't want it to go off and decide to do everything for me. I want to work iteratively.

38d3592

Zhongpai Gao@ZhongpaiGao

@OfficialLoganK @jeremyphoward I asked the model to rate my code. Gemini 3.5 Flash gave 9.8 🥲 I then asked it to be grounded and it gave 6.5. Both are totally off. Note, GPT 5.5 and Claude both gave reasonable rating around 8.2. Gemini model is disappointing

38d1381

aiandcivilization@aiandcivil75700

@OfficialLoganK @jeremyphoward You missed the balance again previous Gemini 3.1 pro and 3.0 flash were lazy as hell, this one 3.5 flash is autistic, reasons too much can't focus on what it's asked, looks everywhere and can't converge towards target. Obliterates the quota in no time without solution.

38d1032

Thomas Kwon@tkwoncpa

@jeremyphoward structure your tasks like the tasks in the evals it's been benchmaxxed on. i've done some research on this and it actually works!

38d891

Jack Joliet@jackjoliet

@jeremyphoward one thing I've noticed is it doesn't think out loud and loop the human back in when necessary

38d601

Luciano Henriques | RJ - 🇧🇷@luciano_rj

@OfficialLoganK @jeremyphoward Não tenho o que reclamar dele. Embora esteja usando somente no celular, algumas ações como gerar imagens com texto dentro, ele gera o texto de forma alucinada e não como estava no prompt. (Fácil de corrigir pedindo correção). Veja o resultado:

38d184

Maxime Rivest 🧙‍♂️🦙🐧@MaximeRivest

you made me want to check its performance in IFBench for it instruction following capabilities.

Although, I have not study that benchmark much, there seems to be a big jump in flash 3.5 minimal vs high thinking in instruction following. That is a bit strange to me..

which one did you try? would you apply your comment to both?

38d74

Nataniel Ruiz@natanielruizg

@jeremyphoward it can happen but it also reads my mind sometimes and i come back to my computer and it has done my work for me

38d67

Jeremy Howard@jeremyphoward

@MaximeRivest I'm using high thinking.

I don't think IFEval tests whether models avoid over-using tools and doing too much at once.

Flash 3.5 really struggles to understand the idea of cooperating with a user and working iteratively.

38d32

lprsd@lprsd_

@jeremyphoward From my experience, Gemini is most amenable to the custom instructions.

Small but effective changes can lead to large changes in the outcome.

38d30

BReal@BReal_01

@OfficialLoganK @jeremyphoward No it sucks, it costs 15x more to complete a task, it gets constantly stuck in loops, and it's only about 10% better than the previous 3 flash, it's ridiculous. But nice benchmaxxing..

38d873

VIBECOBRA@VibeCobra

@OfficialLoganK @jeremyphoward I agree with the sentiment. Over eager in coding but good for brainstorming phase and if brainstorming phase is filled with irrelevant stuff that carries over to the requirements part and so on.

38d2562