12d ago

Opus Strengthens Lead Amid DeepSeek V4's FrontierSWE Gains

0
Original post

Both V4 and Kimi K2.6 gain bigly in best@5 mode, with V4 equaling Gemini 3.1 Pro on net (not that it's a great feat). This suggests to me they're much less usemaxxed/have had less RL. Their median action is not pushed as close to optimal. Over dozens of steps, this adds up.

4:11 PM · May 4, 2026 View on X

@scaling01 We know that current generation DS and Kimi can improve a lot with higher k, because they've had less RL pressure. Presumably this is even truer for previous generation. I don't think they're benchmaxed on public ARC-AGI set

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Both V4 and Kimi K2.6 gain bigly in best@5 mode, with V4 equaling Gemini 3.1 Pro on net (not that it's a great feat). This suggests to me they're much less usemaxxed/have had less RL. Their median action is not pushed as close to optimal. Over dozens of steps, this adds up.

11:11 PM · May 4, 2026 · 8.7K Views
7:24 PM · May 14, 2026 · 284 Views

@scaling01 …Though actually this is just compared to Gemini, Ant and OpenAI models gain even more dominance

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 We know that current generation DS and Kimi can improve a lot with higher k, because they've had less RL pressure. Presumably this is even truer for previous generation. I don't think they're benchmaxed on public ARC-AGI set

7:24 PM · May 14, 2026 · 284 Views
7:28 PM · May 14, 2026 · 170 Views

Both V4 and Kimi K2.6 gain bigly in best@5 mode, with V4 equaling Gemini 3.1 Pro on net (not that it's a great feat). This suggests to me they're much less usemaxxed/have had less RL. Their median action is not pushed as close to optimal. Over dozens of steps, this adds up.

11:11 PM · May 4, 2026 · 8.7K Views

Interesting that Opus only grows stronger though

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Both V4 and Kimi K2.6 gain bigly in best@5 mode, with V4 equaling Gemini 3.1 Pro on net (not that it's a great feat). This suggests to me they're much less usemaxxed/have had less RL. Their median action is not pushed as close to optimal. Over dozens of steps, this adds up.

11:11 PM · May 4, 2026 · 8.7K Views
11:15 PM · May 4, 2026 · 1.3K Views

What do you think @scaling01? How much is the advantage of giga-models like Mythos, as per your thesis about path dependency of early moves, just density of stochastic errors, which exponentially expand the space to walk to solution and so cut success rate?

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Both V4 and Kimi K2.6 gain bigly in best@5 mode, with V4 equaling Gemini 3.1 Pro on net (not that it's a great feat). This suggests to me they're much less usemaxxed/have had less RL. Their median action is not pushed as close to optimal. Over dozens of steps, this adds up.

11:11 PM · May 4, 2026 · 8.7K Views
6:06 AM · May 5, 2026 · 1.5K Views

incredible. Below glm-5.1, below kimi k2.6, below gemma4-31b, below gpt-oss-120b (high), below GLM-5, below deepseek-v3.2-speciale (!!) and barely different from Flash-Max. What the hell. Inferior to smaller models derivative of previous generation. It's over?

9:50 AM · May 5, 2026 · 18.6K Views

as of now, no DeepSeek has officially exceeded o1-preview on this proxy of RSI potential.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

incredible. Below glm-5.1, below kimi k2.6, below gemma4-31b, below gpt-oss-120b (high), below GLM-5, below deepseek-v3.2-speciale (!!) and barely different from Flash-Max. What the hell. Inferior to smaller models derivative of previous generation. It's over?

9:50 AM · May 5, 2026 · 18.6K Views
10:37 AM · May 5, 2026 · 2.7K Views

> DeepSeek V4 - Outstanding at bug-fixing Everyone says so. V4 is really such a strange thing. Every model that DeepSeek makes, they makes for themselves; products are almost incidental. But why… bug-fixing? Because they expect *more* nightmare mode engineering?

6:02 PM · May 5, 2026 · 12.7K Views

I mean, yeas it is a fundamental component of software engineering workflow. But nobody says "its great at writing kernels". Broadly nobody claims it's exceptional at generating any code from scratch. No, it's a 1M tokens context super cheap bug eater.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> DeepSeek V4 - Outstanding at bug-fixing Everyone says so. V4 is really such a strange thing. Every model that DeepSeek makes, they makes for themselves; products are almost incidental. But why… bug-fixing? Because they expect *more* nightmare mode engineering?

6:02 PM · May 5, 2026 · 12.7K Views
6:04 PM · May 5, 2026 · 2.9K Views
Opus Strengthens Lead Amid DeepSeek V4's FrontierSWE Gains · Digg