14h ago

Gemini-3.5-Flash regains the top spot on the Toolathlon leaderboard after five months with a 56.5 percent Pass@1 score on 108 agent tasks

Gemini variants also hit 67.42 percent on Terminal-Bench 2.0 physics tasks.

0
Original post

Gemini returns and ranks No.1 on Toolathlon again after 5 months. Great achievements and congratulations! @GoogleDeepMind

10:54 AM · May 19, 2026 View on X
Reposted by

ouch

Theo - t3.ggTheo - t3.gg@theo

I miss when Flash was the underrated goat model. I genuinely loved Flash 2 and genuinely tolerated 2.5. 3 was the start of the end. 3.5 is a useless model that should not be used for, well, anything as far as I can tell

4:08 AM · May 20, 2026 · 69.1K Views
5:17 AM · May 20, 2026 · 30.3K Views

I miss the old flashes too, I didn’t make it to its retirement party, it flashed by - work of love dedication to the pursuit of algorithmic efficiency.

Theo - t3.ggTheo - t3.gg@theo

I miss when Flash was the underrated goat model. I genuinely loved Flash 2 and genuinely tolerated 2.5. 3 was the start of the end. 3.5 is a useless model that should not be used for, well, anything as far as I can tell

4:08 AM · May 20, 2026 · 69.1K Views
5:32 AM · May 20, 2026 · 4.6K Views

I think 3.5 is fine just not good enough to be a code model.

rohan anilrohan anil@_arohan_

I miss the old flashes too, I didn’t make it to its retirement party, it flashed by - work of love dedication to the pursuit of algorithmic efficiency.

5:32 AM · May 20, 2026 · 4.6K Views
5:33 AM · May 20, 2026 · 633 Views

@suchenzang At least we still have principles

Susan ZhangSusan Zhang@suchenzang

ouch

5:17 AM · May 20, 2026 · 30.3K Views
6:11 AM · May 20, 2026 · 58.6K Views

Oh my god it scored worse than Composer 2! Not even 2.5! And it cost 4x more to run!!!

This might be the worst major lab model drop of all time. Llama 4 tier. Insane.

Michael TruellMichael Truell@mntruell

Gemini Flash 3.5 is now on CursorBench, our main coding agent eval. We’ll keep updating the leaderboard as new models come out. https://cursor.com/evals

3:30 AM · May 20, 2026 · 443.2K Views
4:04 AM · May 20, 2026 · 383.2K Views

I miss when Flash was the underrated goat model. I genuinely loved Flash 2 and genuinely tolerated 2.5.

3 was the start of the end. 3.5 is a useless model that should not be used for, well, anything as far as I can tell

Theo - t3.ggTheo - t3.gg@theo

Oh my god it scored worse than Composer 2! Not even 2.5! And it cost 4x more to run!!! This might be the worst major lab model drop of all time. Llama 4 tier. Insane.

4:04 AM · May 20, 2026 · 383.2K Views
4:08 AM · May 20, 2026 · 69.1K Views

Video is up btw

Theo - t3.ggTheo - t3.gg@theo

I'm scared to make this video, but I feel like I have to. It's time to talk about Google.

3:55 AM · May 20, 2026 · 121.7K Views
6:38 AM · May 20, 2026 · 20.5K Views

flash 2 was last great google model.

Theo - t3.ggTheo - t3.gg@theo

I miss when Flash was the underrated goat model. I genuinely loved Flash 2 and genuinely tolerated 2.5. 3 was the start of the end. 3.5 is a useless model that should not be used for, well, anything as far as I can tell

4:08 AM · May 20, 2026 · 69.1K Views
6:50 AM · May 20, 2026 · 1.5K Views
Gemini-3.5-Flash regains the top spot on the Toolathlon leaderboard after five months with a 56.5 percent Pass@1 score on 108 agent tasks · Digg