Gemini 3.5 Flash ranks first on the CAIS Vision Index but fourth overall in text capabilities

QUOTE POST

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

Opus 4.5 came out on Nov 24, 2025 and, according to CAIS capabilities index, has ≈same-ish ARC-AGI-2 as DeepSeek-V4-Pro. V3.2 was 5.0%. Kimi 2.5 is 10.8% here and 11.8% on official leaderboard (V3.2 4.0%), so it's aligned. This is a gap of 6 months.

Lisan al Gaib@scaling01

Gemini 3.5 Flash ranking 4th on CAIS Text Capabilities and 1st on Vision and DeepSeek-V4-Pro is slightly ahead of Kimi-K2.6 on Text

8:48 PM · May 26, 2026 · 23.6K Views

10:45 PM · May 26, 2026 · 18.5K Views

REPLY

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

Opus 4.5 was the first of its line, and 4.6 got significantly stronger. I expect that V4.1 or V4-non-preview will likewise improve by >10%. In the 40s of ARC-AGI-2, interesting things begin happening. I've said many times, GPT 5.2 was the true RSI inflection point.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Opus 4.5 came out on Nov 24, 2025 and, according to CAIS capabilities index, has ≈same-ish ARC-AGI-2 as DeepSeek-V4-Pro. V3.2 was 5.0%. Kimi 2.5 is 10.8% here and 11.8% on official leaderboard (V3.2 4.0%), so it's aligned. This is a gap of 6 months.

10:45 PM · May 26, 2026 · 18.5K Views

10:47 PM · May 26, 2026 · 2.6K Views

REPLY

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

That said, they have Grok 4.2 at 55% and Grok 4.3 at 13%, despite reasonable alignment on other benches. Isn't that both v8-small base? I am not convinced their eval is accurate.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Opus 4.5 was the first of its line, and 4.6 got significantly stronger. I expect that V4.1 or V4-non-preview will likewise improve by >10%. In the 40s of ARC-AGI-2, interesting things begin happening. I've said many times, GPT 5.2 was the true RSI inflection point.

10:47 PM · May 26, 2026 · 2.6K Views

10:54 PM · May 26, 2026 · 2.1K Views

REPLY

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

@theo It seems to be vastly easier to close the gap in "day to day work" than in ARC-AGI

Theo - t3.gg@theo

@teortaxesTex Yeah but Opus 4.5 is actually useful for day to day work and Deepseek v4 Pro isn't 🙃

11:10 PM · May 26, 2026 · 7.6K Views

11:18 PM · May 26, 2026 · 1K Views

REPLY

#1829Theo - t3.gg@THEO

@teortaxesTex Yeah but Opus 4.5 is actually useful for day to day work and Deepseek v4 Pro isn't 🙃

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Opus 4.5 came out on Nov 24, 2025 and, according to CAIS capabilities index, has ≈same-ish ARC-AGI-2 as DeepSeek-V4-Pro. V3.2 was 5.0%. Kimi 2.5 is 10.8% here and 11.8% on official leaderboard (V3.2 4.0%), so it's aligned. This is a gap of 6 months.

10:45 PM · May 26, 2026 · 18.5K Views

11:10 PM · May 26, 2026 · 7.6K Views

Gemini 3.5 Flash ranks first on the CAIS Vision Index but fourth overall in text capabilities

Sentiment

Cluster engagement