DeepSeek and Peking University release DSpark, a speculative decoding framework boosting DeepSeek-V4 speed by up to 85% · Digg

DeepSeek and Peking University release DSpark, a speculative decoding framework boosting DeepSeek-V4 speed by up to 85% · Digg

Posts from X

Most Activity

VIEWS282KBOOKMARKS1.6KLIKES3KRETWEETS415REPLIES73

Daniel Han@danielhanchen

DeepSeek just released DSpark for V4 Flash & Pro, a new speculative decoding method boosting throughput by 51% to 400%!

DS also showed DSpark works well for other models like Gemma & Qwen

Github: https://github.com/deepseek-ai/DeepSpec Paper: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf HF: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark

1d282K3K1.6K

Yuchen Jin@Yuchenj_UW

DeepSeek is the GOAT. 🐳

They just published DSpark, a new speculative decoding method that boosts throughput by 51% to 400%.

They also open-sourced DeepSpec, the training framework behind it.

This is the real open AI.

14h153.3K1.9K574

Lisan al Gaib@scaling01

how can you not like deepseek

thank you lord wenfeng for continuing to make intelligence too cheap to meter

Lisan al Gaib@scaling01

DeepSeek just open-sources another piece of their training stack.

DeepSpec: a full-stack codebase for training and evaluating speculative decoding models

https://github.com/deepseek-ai/DeepSpec

19h218.4K2.3K556

elie@eliebakouch

new inference optimization method by @deepseek_ai with an extremely detailed paper, draft model and framework to train them. results in production for dsv4 lead to +50% for throughput and latency (can go to ~80% for latency, crazy).

full explanation of DSpark:

it's about speculative decoding and the idea builds upon DFlash (fully parallel) and Eagle (fully sequential) to create a "semi-parallel" method that keeps the advantages of both

the core equation you want to optimize is the "time to generate each token" which is: (time to draft + time to verify) / how many tokens are accepted

the advantage of the parallel variant (DFlash) is that it's fast, but when you increase the number of tokens you draft, acceptance rate drops pretty fast (makes sense since there is no dependency on the previous token).

fully sequential is nice but opposite issue: it's slower (you need a much smaller draft to get the same speed) but the autoregressive dependency means you can maintain good acceptance rate at a lot of tokens. since you have a much smaller draft head, the first token acceptance rate is often quite low

idea of DSpark is to combine both: a "heavy" parallel head (you only do it once) and then a small sequential step to bias the logit distribution with information about the previous token. this biasing is done with a small markov head (only depends on t-1)

they also get a confidence score out of the sequential head that allows them to adjust how many tokens they want to verify. verification can get expensive if the gpus are already at maximum utilization, so they use this confidence score to do some load balancing and predict the right number of tokens depending on gpu workload

one small detail: i would have liked to see production numbers if they used DFlash or Eagle instead of MTP-1, but as always, huge work by deepseek and i'm expecting to see this method widely adopted

1d68.2K915492

Lisan al Gaib@scaling01

DeepSeek just made their inference ~5x cheaper at 50 TPS

Lisan al Gaib@scaling01

how can you not like deepseek

thank you lord wenfeng for continuing to make intelligence too cheap to meter

19h152.9K1.6K309

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

DeepSeek releases their decoding module DSpark for V4 checkpoints, which improves a lot upon MTP-1, Eagle-3 and DFlash. Out of their vast goodwill, they also open source DeepSpec: "a codebase for training and evaluating draft models for speculative decoding".

Zhipeng Huang@nopainkiller

official dsv4 spec dec and draft model @teortaxesTex

github: https://github.com/deepseek-ai/DeepSpec

huggingface: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark/tree/main

1d105.6K697285

Lisan al Gaib@scaling01

DeepSeek just open-sources another piece of their training stack.

DeepSpec: a full-stack codebase for training and evaluating speculative decoding models

https://github.com/deepseek-ai/DeepSpec

1d150.9K686284

Lisan al Gaib@scaling01

and people were saying: "DeepSeek is not profitable at $0.87"

good one

one would think that OpenAI and Anthropic have similar MTP tech

but I don't even want to think about margins, makes me sick

Lisan al Gaib@scaling01

DeepSeek just made their inference ~5x cheaper at 50 TPS

16h49.6K53295

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Importantly, DeepSeek has disclosed their inference economics – for the first time since Open Source Week. Then, they were doing 14.8K generation on 8xH800 node (1,85K/GPU) at 20-22 tps. Whatever these GPUs are now, V4-Pro is *at least 3x cheaper to serve*. 2K/GPU needs >60 tps.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

DeepSeek releases their decoding module DSpark for V4 checkpoints, which improves a lot upon MTP-1, Eagle-3 and DFlash. Out of their vast goodwill, they also open source DeepSpec: "a codebase for training and evaluating draft models for speculative decoding".

20h35.9K333124

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

V4-Pro now goes at about 90 tokens/s in a 20K context. V4-Flash, 130 t/s (Opus in the same context just gave me 37) If V4 gets an update that makes it merely *close* to GLM, it'll rip through the market. DSpark is awesome

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

It's been very lame how the industry has been failing to adopt good speculative decoding as the baseline. Just like Whale forced everyone onto MTP, now they may succeed with semi-AR drafting. @zephyr_z9 @antirez @_xjdr @norpadon does this look less BS than the previous one?

22h37.5K48385

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

DSpark is actually a pretty complex and dense paper, and gives a hint of just how far modern neocloud/hyperscaler setups have diverged from just-use-sglang-broism. Naively serving Transformers is not going to cut it in 2026, there are so many expensive engineering layers…

Lisan al Gaib@scaling01

and people were saying: "DeepSeek is not profitable at $0.87"

good one

one would think that OpenAI and Anthropic have similar MTP tech

but I don't even want to think about margins, makes me sick

15h12.3K19255

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Man, do I love this company 3 years ago, me: "maybe OpenAI is using early exit decoding, not quantization" *crickets* nobody uses early exit decoding in prod now, DeepSeek: "we've done everything to accelerate inference, time to look into early exit decoding" Thanks based Whale

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 "just" Wenfeng is devious. What they "just" did was they shared their alpha with struggling neolabs. It's almost 2 months old, they have probably moved on to the next big thing

19h12.5K17158

Rohan Paul@rohanpaul_ai

Fantastic, @deepseek_ai just published their new inference optimization method.

Proposes DSpark, a semi-parallel speculative decoding system that gave DeepSeek-V4 about 60% to 85% faster per-user generation at matched throughput.

The biggest idea in DSpark is that faster inference is not just about drafting more tokens, but about deciding which drafted tokens are worth checking.

Speculative decoding already had the basic trick: a smaller draft model guesses several next tokens, then the real model checks them in 1 pass.

The problem is that long draft blocks often waste work, because later guesses are more likely to be wrong, and checking bad guesses still uses GPU capacity.

DSpark’s breakthrough is to make this process selective: it drafts a block, scores how likely each prefix is to survive, then verifies only the part that is likely to pay off.

The mechanism has 2 linked parts: a strong parallel draft model makes many token guesses quickly, then a tiny Markov head adjusts each guess using the token right before it.

That small sequential piece matters because pure parallel drafting are fast, but their later tokens decay because each position guesses without knowing what the earlier sampled token actually was.

i.e. Fully parallel drafters guesses every position too independently, which can create bad token combinations later in the block.

Then the confidence scheduler estimates how many drafted tokens should be checked for each request, based on both acceptance chance and current GPU load.

13h6.5K9347

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

This also gives me plenty of hope for V4.1 and beyond Consider, they don't need max speed inference for async RL rollouts. V4-Flash does… like 14K tokens/GPU at 100 tps. If that's 950DT, then one SuperPOD = 5T tokens/day. Or at least 1T at 20% utilization. data machine go brrr

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Importantly, DeepSeek has disclosed their inference economics – for the first time since Open Source Week. Then, they were doing 14.8K generation on 8xH800 node (1,85K/GPU) at 20-22 tps. Whatever these GPUs are now, V4-Pro is *at least 3x cheaper to serve*. 2K/GPU needs >60 tps.

19h9.4K12130

Zephyr@zephyr_z9

Has been in production since early May

Lisan al Gaib@scaling01

DeepSeek just made their inference ~5x cheaper at 50 TPS

17h29.4K17922

Charles 🎉 Frye@charles_irl

it’s hot spec summer

Lisan al Gaib@scaling01

DeepSeek just open-sources another piece of their training stack.

DeepSpec: a full-stack codebase for training and evaluating speculative decoding models

https://github.com/deepseek-ai/DeepSpec

16h9.1K7534

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

It's been very lame how the industry has been failing to adopt good speculative decoding as the baseline. Just like Whale forced everyone onto MTP, now they may succeed with semi-AR drafting. @zephyr_z9 @antirez @_xjdr @norpadon does this look less BS than the previous one?

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

DeepSeek releases their decoding module DSpark for V4 checkpoints, which improves a lot upon MTP-1, Eagle-3 and DFlash. Out of their vast goodwill, they also open source DeepSpec: "a codebase for training and evaluating draft models for speculative decoding".

1d43.5K8723

elie@eliebakouch

@deepseek_ai here is the full scheme by claude

paper: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf draft model: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark framework to train and evaluate: https://github.com/deepseek-ai/DeepSpec/tree/main

elie@eliebakouch

new inference optimization method by @deepseek_ai with an extremely detailed paper, draft model and framework to train them. results in production for dsv4 lead to +50% for throughput and latency (can go to ~80% for latency, crazy).

full explanation of DSpark:

it's about speculative decoding and the idea builds upon DFlash (fully parallel) and Eagle (fully sequential) to create a "semi-parallel" method that keeps the advantages of both

the core equation you want to optimize is the "time to generate each token" which is: (time to draft + time to verify) / how many tokens are accepted

the advantage of the parallel variant (DFlash) is that it's fast, but when you increase the number of tokens you draft, acceptance rate drops pretty fast (makes sense since there is no dependency on the previous token).

fully sequential is nice but opposite issue: it's slower (you need a much smaller draft to get the same speed) but the autoregressive dependency means you can maintain good acceptance rate at a lot of tokens. since you have a much smaller draft head, the first token acceptance rate is often quite low

idea of DSpark is to combine both: a "heavy" parallel head (you only do it once) and then a small sequential step to bias the logit distribution with information about the previous token. this biasing is done with a small markov head (only depends on t-1)

they also get a confidence score out of the sequential head that allows them to adjust how many tokens they want to verify. verification can get expensive if the gpus are already at maximum utilization, so they use this confidence score to do some load balancing and predict the right number of tokens depending on gpu workload

one small detail: i would have liked to see production numbers if they used DFlash or Eagle instead of MTP-1, but as always, huge work by deepseek and i'm expecting to see this method widely adopted

1d3.6K4626

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 "just" Wenfeng is devious. What they "just" did was they shared their alpha with struggling neolabs. It's almost 2 months old, they have probably moved on to the next big thing

Lisan al Gaib@scaling01

DeepSeek just made their inference ~5x cheaper at 50 TPS

19h16.6K12713

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

No, they'll just increase the batch size, have the same speed, and drive margins from 90% to 95%. You're welcome

cheaty@cheatyyyy

will this finally fix the shitty throughput on every western provider for the love of god

thank you deepseek

23h8.8K14310