DeepSeek V4 Scores Rise On Dim-Agent Benchmark Despite Russian Test Failures

VIEWS8.2KBOOKMARKS17LIKES66REPLIES5

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Interesting "model smell"-related difference between V4 and V4.1 (or V4-Preview and V4?): the new model can LARP *and* seriously do the job at the same time. When I tested @victor207755822's roleplay suffix with old expert-web, it basically faked reasoning. Now, it plans.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> During the process of benchmarking dim-agent, we discovered that DSv4's scores kept improving. Ah. This is the February-April playbook, when DeepSeek-Web (now known to be V4-Flash) kept getting better at long context. I guess they're deploying checkpoints after OPD rounds.

5h8.2K6617

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

This is terminal bench 2.1 @ pass 5 in V4 paper: "On the Terminal-Bench 2.0 Verified subset, DeepSeek-V4-Pro achieves a score of approximately 72.0". 2.1 is similar to 2.0-V.

@ashfold what does the current V4 (open weights) score at pass@5?

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> During the process of benchmarking dim-agent, we discovered that DSv4's scores kept improving. Ah. This is the February-April playbook, when DeepSeek-Web (now known to be V4-Flash) kept getting better at long context. I guess they're deploying checkpoints after OPD rounds.

6h2K101

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

The web model does the same low-quality LARP the html is riddled with errors the API has thought for 630 seconds, 31K tokens, and produced… this NOT voxels! But… it actually paid great attention to the prompt, it even tried to do the semi-transparent pouch with a sardine. The prompt is confusing LARP-model-written garbage anyway (I generally think this is not how agents should be tested, their strength is not in error-free generations)

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Interesting "model smell"-related difference between V4 and V4.1 (or V4-Preview and V4?): the new model can LARP *and* seriously do the job at the same time. When I tested @victor207755822's roleplay suffix with old expert-web, it basically faked reasoning. Now, it plans.

5h1.5K61

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

GLM is far better, 1/5 tokens (so same cost and ≈3x faster), even made the scuttling crab except it also has the bizarre rotating bicycle. They all mess up directions, but RANDOMLY. Overall, I'd say this DeepSeek is equal to «GLM 5.15».

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

The web model does the same low-quality LARP the html is riddled with errors the API has thought for 630 seconds, 31K tokens, and produced… this NOT voxels! But… it actually paid great attention to the prompt, it even tried to do the semi-transparent pouch with a sardine. The prompt is confusing LARP-model-written garbage anyway (I generally think this is not how agents should be tested, their strength is not in error-free generations)

5h1.4K21

zhangmo8@wegi8666

@teortaxesTex Where can I download dim-agent? 😍

6h35731

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

The API model is totally different btw it's a shitshow

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

that said, it's utterly retarded on my Russian test On par with some frankenmerge failure case, horrible

5h1.9K50

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

that said, it's utterly retarded on my Russian test On par with some frankenmerge failure case, horrible

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> During the process of benchmarking dim-agent, we discovered that DSv4's scores kept improving. Ah. This is the February-April playbook, when DeepSeek-Web (now known to be V4-Flash) kept getting better at long context. I guess they're deploying checkpoints after OPD rounds.

6h1.1K50

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

The API model might be better than before it's underbaked though, can fall into overthinking loops but also can do very curt and to-the-point responses different

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

The API model is totally different btw it's a shitshow

5h1.4K70

Han Cheng e/acc@ashfold

@wegi8666 @teortaxesTex You can get it here https://dimagent.com

6h281

Han Cheng e/acc@ashfold

@teortaxesTex I think the OSS version is not that good, cause we are pretty sure that the result is at about 70+- 2-3 weeks ago(same deepseek official api). We do not have gpus to test OSS version. And one more interest thing, we bench it at high effort (not even max).

6h1161

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@ashfold > We do not have gpus to test OSS version. OK, but there are providers online, who definitely serve the old open weights. So you think it's about 10-11 points of gain at least?

6h134

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@dummerspast39 no, but web one is I think

5h50

Binkg@dummerspast39

@teortaxesTex the api model is shit?

5h33

Han Cheng e/acc@ashfold

@teortaxesTex Yepp，since whales has lot of unpublish infra tech, we are not trust all other providers, especially for benching. 🤣

6h751

Lex Savage@lexsavege

@teortaxesTex

7h182

Binkg@dummerspast39

@teortaxesTex @victor207755822 good catch

5h221

Binkg@dummerspast39

@teortaxesTex ok thank you for clarifying

5h16

RadiantExitance@mothmothfan

@teortaxesTex @victor207755822 DS web now answer in Chinese from time to time, which it never did before. Maybe the higher tier of knowledge can only come by throwing away our indo-european linguistic shackles and fully embracing the sinitic age

5h1