Opus Strengthens Lead Amid DeepSeek V4's FrontierSWE Gains

QUOTE POST

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

@scaling01 We know that current generation DS and Kimi can improve a lot with higher k, because they've had less RL pressure. Presumably this is even truer for previous generation. I don't think they're benchmaxed on public ARC-AGI set

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Both V4 and Kimi K2.6 gain bigly in best@5 mode, with V4 equaling Gemini 3.1 Pro on net (not that it's a great feat). This suggests to me they're much less usemaxxed/have had less RL. Their median action is not pushed as close to optimal. Over dozens of steps, this adds up.

11:11 PM · May 4, 2026 · 8.7K Views

7:24 PM · May 14, 2026 · 284 Views

REPLY

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

@scaling01 …Though actually this is just compared to Gemini, Ant and OpenAI models gain even more dominance

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 We know that current generation DS and Kimi can improve a lot with higher k, because they've had less RL pressure. Presumably this is even truer for previous generation. I don't think they're benchmaxed on public ARC-AGI set

7:24 PM · May 14, 2026 · 284 Views

7:28 PM · May 14, 2026 · 170 Views

QUOTE POST

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

Both V4 and Kimi K2.6 gain bigly in best@5 mode, with V4 equaling Gemini 3.1 Pro on net (not that it's a great feat). This suggests to me they're much less usemaxxed/have had less RL. Their median action is not pushed as close to optimal. Over dozens of steps, this adds up.

11:11 PM · May 4, 2026 · 8.7K Views

REPLY

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

Interesting that Opus only grows stronger though

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Both V4 and Kimi K2.6 gain bigly in best@5 mode, with V4 equaling Gemini 3.1 Pro on net (not that it's a great feat). This suggests to me they're much less usemaxxed/have had less RL. Their median action is not pushed as close to optimal. Over dozens of steps, this adds up.

11:11 PM · May 4, 2026 · 8.7K Views

11:15 PM · May 4, 2026 · 1.3K Views

REPLY

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

What do you think @scaling01? How much is the advantage of giga-models like Mythos, as per your thesis about path dependency of early moves, just density of stochastic errors, which exponentially expand the space to walk to solution and so cut success rate?

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Both V4 and Kimi K2.6 gain bigly in best@5 mode, with V4 equaling Gemini 3.1 Pro on net (not that it's a great feat). This suggests to me they're much less usemaxxed/have had less RL. Their median action is not pushed as close to optimal. Over dozens of steps, this adds up.

11:11 PM · May 4, 2026 · 8.7K Views

6:06 AM · May 5, 2026 · 1.5K Views

QUOTE POST

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

> pretty efficient …that's not the word. What in the world are these stats? it has vastly lower token use *and* tool error rate than both Kimi K2.6 (a very good model) and GPT 5.2?

5:26 PM · May 5, 2026 · 6.6K Views

QUOTE POST

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

incredible. Below glm-5.1, below kimi k2.6, below gemma4-31b, below gpt-oss-120b (high), below GLM-5, below deepseek-v3.2-speciale (!!) and barely different from Flash-Max. What the hell. Inferior to smaller models derivative of previous generation. It's over?

9:50 AM · May 5, 2026 · 18.6K Views

REPLY

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

as of now, no DeepSeek has officially exceeded o1-preview on this proxy of RSI potential.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

incredible. Below glm-5.1, below kimi k2.6, below gemma4-31b, below gpt-oss-120b (high), below GLM-5, below deepseek-v3.2-speciale (!!) and barely different from Flash-Max. What the hell. Inferior to smaller models derivative of previous generation. It's over?

9:50 AM · May 5, 2026 · 18.6K Views

10:37 AM · May 5, 2026 · 2.7K Views

QUOTE POST

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

> DeepSeek V4 - Outstanding at bug-fixing Everyone says so. V4 is really such a strange thing. Every model that DeepSeek makes, they makes for themselves; products are almost incidental. But why… bug-fixing? Because they expect *more* nightmare mode engineering?

6:02 PM · May 5, 2026 · 12.7K Views

REPLY

#400Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

I mean, yeas it is a fundamental component of software engineering workflow. But nobody says "its great at writing kernels". Broadly nobody claims it's exceptional at generating any code from scratch. No, it's a 1M tokens context super cheap bug eater.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> DeepSeek V4 - Outstanding at bug-fixing Everyone says so. V4 is really such a strange thing. Every model that DeepSeek makes, they makes for themselves; products are almost incidental. But why… bug-fixing? Because they expect *more* nightmare mode engineering?

6:02 PM · May 5, 2026 · 12.7K Views

6:04 PM · May 5, 2026 · 2.9K Views