DeepSeek just open-sources another piece of their training stack.
DeepSpec: a full-stack codebase for training and evaluating speculative decoding models
They also open-sourced DeepSpec, an MIT-licensed draft-model codebase
DeepSeek just open-sources another piece of their training stack.
DeepSpec: a full-stack codebase for training and evaluating speculative decoding models
Many users praised DeepSeek's open-sourcing of DSpark and DeepSpec for major throughput gains and enabling viable local inference, while some voiced disapproval over geopolitical ties or related tech discussions.
No Digg Deeper questions have been answered for this story yet.
DeepSeek just released DSpark for V4 Flash & Pro, a new speculative decoding method boosting throughput by 51% to 400%!
DS also showed DSpark works well for other models like Gemma & Qwen
Github: https://github.com/deepseek-ai/DeepSpec Paper: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf HF: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark
DeepSeek is the GOAT. 🐳
They just published DSpark, a new speculative decoding method that boosts throughput by 51% to 400%.
They also open-sourced DeepSpec, the training framework behind it.
This is the real open AI.
how can you not like deepseek
thank you lord wenfeng for continuing to make intelligence too cheap to meter
DeepSeek just open-sources another piece of their training stack.
DeepSpec: a full-stack codebase for training and evaluating speculative decoding models
https://github.com/deepseek-ai/DeepSpec
new inference optimization method by @deepseek_ai with an extremely detailed paper, draft model and framework to train them. results in production for dsv4 lead to +50% for throughput and latency (can go to ~80% for latency, crazy).
full explanation of DSpark:
it's about speculative decoding and the idea builds upon DFlash (fully parallel) and Eagle (fully sequential) to create a "semi-parallel" method that keeps the advantages of both
the core equation you want to optimize is the "time to generate each token" which is: (time to draft + time to verify) / how many tokens are accepted
the advantage of the parallel variant (DFlash) is that it's fast, but when you increase the number of tokens you draft, acceptance rate drops pretty fast (makes sense since there is no dependency on the previous token).
fully sequential is nice but opposite issue: it's slower (you need a much smaller draft to get the same speed) but the autoregressive dependency means you can maintain good acceptance rate at a lot of tokens. since you have a much smaller draft head, the first token acceptance rate is often quite low
idea of DSpark is to combine both: a "heavy" parallel head (you only do it once) and then a small sequential step to bias the logit distribution with information about the previous token. this biasing is done with a small markov head (only depends on t-1)
they also get a confidence score out of the sequential head that allows them to adjust how many tokens they want to verify. verification can get expensive if the gpus are already at maximum utilization, so they use this confidence score to do some load balancing and predict the right number of tokens depending on gpu workload
one small detail: i would have liked to see production numbers if they used DFlash or Eagle instead of MTP-1, but as always, huge work by deepseek and i'm expecting to see this method widely adopted
DeepSeek just made their inference ~5x cheaper at 50 TPS
how can you not like deepseek
thank you lord wenfeng for continuing to make intelligence too cheap to meter
DeepSeek releases their decoding module DSpark for V4 checkpoints, which improves a lot upon MTP-1, Eagle-3 and DFlash. Out of their vast goodwill, they also open source DeepSpec: "a codebase for training and evaluating draft models for speculative decoding".
official dsv4 spec dec and draft model @teortaxesTex
github: https://github.com/deepseek-ai/DeepSpec
huggingface: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark/tree/main
DeepSeek just open-sources another piece of their training stack.
DeepSpec: a full-stack codebase for training and evaluating speculative decoding models
https://github.com/deepseek-ai/DeepSpec
and people were saying: "DeepSeek is not profitable at $0.87"
good one
one would think that OpenAI and Anthropic have similar MTP tech
but I don't even want to think about margins, makes me sick
DeepSeek just made their inference ~5x cheaper at 50 TPS
Importantly, DeepSeek has disclosed their inference economics – for the first time since Open Source Week. Then, they were doing 14.8K generation on 8xH800 node (1,85K/GPU) at 20-22 tps. Whatever these GPUs are now, V4-Pro is *at least 3x cheaper to serve*. 2K/GPU needs >60 tps.
DeepSeek releases their decoding module DSpark for V4 checkpoints, which improves a lot upon MTP-1, Eagle-3 and DFlash. Out of their vast goodwill, they also open source DeepSpec: "a codebase for training and evaluating draft models for speculative decoding".
V4-Pro now goes at about 90 tokens/s in a 20K context. V4-Flash, 130 t/s (Opus in the same context just gave me 37) If V4 gets an update that makes it merely *close* to GLM, it'll rip through the market. DSpark is awesome
It's been very lame how the industry has been failing to adopt good speculative decoding as the baseline. Just like Whale forced everyone onto MTP, now they may succeed with semi-AR drafting. @zephyr_z9 @antirez @_xjdr @norpadon does this look less BS than the previous one?
DSpark is actually a pretty complex and dense paper, and gives a hint of just how far modern neocloud/hyperscaler setups have diverged from just-use-sglang-broism. Naively serving Transformers is not going to cut it in 2026, there are so many expensive engineering layers…
and people were saying: "DeepSeek is not profitable at $0.87"
good one
one would think that OpenAI and Anthropic have similar MTP tech
but I don't even want to think about margins, makes me sick
Man, do I love this company 3 years ago, me: "maybe OpenAI is using early exit decoding, not quantization" *crickets* nobody uses early exit decoding in prod now, DeepSeek: "we've done everything to accelerate inference, time to look into early exit decoding" Thanks based Whale
@scaling01 "just" Wenfeng is devious. What they "just" did was they shared their alpha with struggling neolabs. It's almost 2 months old, they have probably moved on to the next big thing
Fantastic, @deepseek_ai just published their new inference optimization method.
Proposes DSpark, a semi-parallel speculative decoding system that gave DeepSeek-V4 about 60% to 85% faster per-user generation at matched throughput.
The biggest idea in DSpark is that faster inference is not just about drafting more tokens, but about deciding which drafted tokens are worth checking.
Speculative decoding already had the basic trick: a smaller draft model guesses several next tokens, then the real model checks them in 1 pass.
The problem is that long draft blocks often waste work, because later guesses are more likely to be wrong, and checking bad guesses still uses GPU capacity.
DSpark’s breakthrough is to make this process selective: it drafts a block, scores how likely each prefix is to survive, then verifies only the part that is likely to pay off.
The mechanism has 2 linked parts: a strong parallel draft model makes many token guesses quickly, then a tiny Markov head adjusts each guess using the token right before it.
That small sequential piece matters because pure parallel drafting are fast, but their later tokens decay because each position guesses without knowing what the earlier sampled token actually was.
i.e. Fully parallel drafters guesses every position too independently, which can create bad token combinations later in the block.
Then the confidence scheduler estimates how many drafted tokens should be checked for each request, based on both acceptance chance and current GPU load.
This also gives me plenty of hope for V4.1 and beyond Consider, they don't need max speed inference for async RL rollouts. V4-Flash does… like 14K tokens/GPU at 100 tps. If that's 950DT, then one SuperPOD = 5T tokens/day. Or at least 1T at 20% utilization. data machine go brrr
Importantly, DeepSeek has disclosed their inference economics – for the first time since Open Source Week. Then, they were doing 14.8K generation on 8xH800 node (1,85K/GPU) at 20-22 tps. Whatever these GPUs are now, V4-Pro is *at least 3x cheaper to serve*. 2K/GPU needs >60 tps.
Has been in production since early May
DeepSeek just made their inference ~5x cheaper at 50 TPS
it’s hot spec summer
DeepSeek just open-sources another piece of their training stack.
DeepSpec: a full-stack codebase for training and evaluating speculative decoding models
https://github.com/deepseek-ai/DeepSpec
It's been very lame how the industry has been failing to adopt good speculative decoding as the baseline. Just like Whale forced everyone onto MTP, now they may succeed with semi-AR drafting. @zephyr_z9 @antirez @_xjdr @norpadon does this look less BS than the previous one?
DeepSeek releases their decoding module DSpark for V4 checkpoints, which improves a lot upon MTP-1, Eagle-3 and DFlash. Out of their vast goodwill, they also open source DeepSpec: "a codebase for training and evaluating draft models for speculative decoding".
@deepseek_ai here is the full scheme by claude
paper: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf draft model: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark framework to train and evaluate: https://github.com/deepseek-ai/DeepSpec/tree/main
new inference optimization method by @deepseek_ai with an extremely detailed paper, draft model and framework to train them. results in production for dsv4 lead to +50% for throughput and latency (can go to ~80% for latency, crazy).
full explanation of DSpark:
it's about speculative decoding and the idea builds upon DFlash (fully parallel) and Eagle (fully sequential) to create a "semi-parallel" method that keeps the advantages of both
the core equation you want to optimize is the "time to generate each token" which is: (time to draft + time to verify) / how many tokens are accepted
the advantage of the parallel variant (DFlash) is that it's fast, but when you increase the number of tokens you draft, acceptance rate drops pretty fast (makes sense since there is no dependency on the previous token).
fully sequential is nice but opposite issue: it's slower (you need a much smaller draft to get the same speed) but the autoregressive dependency means you can maintain good acceptance rate at a lot of tokens. since you have a much smaller draft head, the first token acceptance rate is often quite low
idea of DSpark is to combine both: a "heavy" parallel head (you only do it once) and then a small sequential step to bias the logit distribution with information about the previous token. this biasing is done with a small markov head (only depends on t-1)
they also get a confidence score out of the sequential head that allows them to adjust how many tokens they want to verify. verification can get expensive if the gpus are already at maximum utilization, so they use this confidence score to do some load balancing and predict the right number of tokens depending on gpu workload
one small detail: i would have liked to see production numbers if they used DFlash or Eagle instead of MTP-1, but as always, huge work by deepseek and i'm expecting to see this method widely adopted
@scaling01 "just" Wenfeng is devious. What they "just" did was they shared their alpha with struggling neolabs. It's almost 2 months old, they have probably moved on to the next big thing
DeepSeek just made their inference ~5x cheaper at 50 TPS
No, they'll just increase the batch size, have the same speed, and drive margins from 90% to 95%. You're welcome
will this finally fix the shitty throughput on every western provider for the love of god
thank you deepseek