14h ago

Gemini-3.5-Flash regains the top spot on the Toolathlon leaderboard after five months with a 56.5 percent Pass@1 score on 108 agent tasks

Gemini variants also hit 67.42 percent on Terminal-Bench 2.0 physics tasks.

3135.5K152774787.3K

——0——

Original post

#980@SCALING01OP

Junlong Li@LOCKONLVANGE

Gemini returns and ranks No.1 on Toolathlon again after 5 months. Great achievements and congratulations! @GoogleDeepMind

10:54 AM · May 19, 2026

Reposted by

#950@_LEWTUN

QUOTE POST

#58Susan Zhang@SUCHENZANG

ouch

Theo - t3.gg@theo

I miss when Flash was the underrated goat model. I genuinely loved Flash 2 and genuinely tolerated 2.5. 3 was the start of the end. 3.5 is a useless model that should not be used for, well, anything as far as I can tell

4:08 AM · May 20, 2026 · 69.1K Views

5:17 AM · May 20, 2026 · 30.3K Views

QUOTE POST

#83rohan anil@_AROHAN_

I miss the old flashes too, I didn’t make it to its retirement party, it flashed by - work of love dedication to the pursuit of algorithmic efficiency.

Theo - t3.gg@theo

4:08 AM · May 20, 2026 · 69.1K Views

5:32 AM · May 20, 2026 · 4.6K Views

#83rohan anil@_AROHAN_

I think 3.5 is fine just not good enough to be a code model.

rohan anil@_arohan_

I miss the old flashes too, I didn’t make it to its retirement party, it flashed by - work of love dedication to the pursuit of algorithmic efficiency.

5:32 AM · May 20, 2026 · 4.6K Views

5:33 AM · May 20, 2026 · 633 Views

#228Andreas Kirsch 🇺🇦@BLACKHC

@suchenzang At least we still have principles

Susan Zhang@suchenzang

ouch

5:17 AM · May 20, 2026 · 30.3K Views

6:11 AM · May 20, 2026 · 58.6K Views

QUOTE POST

#1829Theo - t3.gg@THEO

Oh my god it scored worse than Composer 2! Not even 2.5! And it cost 4x more to run!!!

This might be the worst major lab model drop of all time. Llama 4 tier. Insane.

Michael Truell@mntruell

Gemini Flash 3.5 is now on CursorBench, our main coding agent eval. We’ll keep updating the leaderboard as new models come out. https://cursor.com/evals

3:30 AM · May 20, 2026 · 443.2K Views

4:04 AM · May 20, 2026 · 383.2K Views

#1829Theo - t3.gg@THEO

I miss when Flash was the underrated goat model. I genuinely loved Flash 2 and genuinely tolerated 2.5.

3 was the start of the end. 3.5 is a useless model that should not be used for, well, anything as far as I can tell

Theo - t3.gg@theo

Oh my god it scored worse than Composer 2! Not even 2.5! And it cost 4x more to run!!! This might be the worst major lab model drop of all time. Llama 4 tier. Insane.

4:04 AM · May 20, 2026 · 383.2K Views

4:08 AM · May 20, 2026 · 69.1K Views

QUOTE POST

#1829Theo - t3.gg@THEO

Video is up btw

Theo - t3.gg@theo

I'm scared to make this video, but I feel like I have to. It's time to talk about Google.

3:55 AM · May 20, 2026 · 121.7K Views

6:38 AM · May 20, 2026 · 20.5K Views

QUOTE POST

#1941ben hylak@BENHYLAK

flash 2 was last great google model.

Theo - t3.gg@theo

4:08 AM · May 20, 2026 · 69.1K Views

6:50 AM · May 20, 2026 · 1.5K Views

Gemini-3.5-Flash regains the top spot on the Toolathlon leaderboard after five months with a 56.5 percent Pass@1 score on 108 agent tasks

Cluster engagement

Sentiment