/Tech4h ago

Inception co-founder Aditya Grover launches Mercury 2 diffusion LLM on Baseten with inference speeds over 1,000 tokens per second

Early customer Augment Code cut operational costs by 90%.

659694.2K
Original post
Aditya Grover@adityagrover_#452inTech

Today we're bringing Mercury 2 to @Baseten.

Mercury 2 delivers over 1,000 tokens per second for customers on @NVIDIA GPUs with the reliability and scale enterprise teams need.

Read more to see how @augmentcode is using Mercury 2 in production reducing costs by 90% and latency by 82%. More customer stories across coding agents, real-time voice, and enterprise search dropping soon.

Baseten@baseten

http://x.com/i/article/2065085903345754113

9:16 AM · Jun 11, 2026 · 1.2K Views
Sentiment

Users praised Baseten’s Mercury 2 for its 1,000 tokens-per-second speed and 90% cost reduction on NVIDIA GPUs, while some argued those efficiency gains were the real story rather than the headline performance claims.

Pos
83.3%
Neg
16.7%
4 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.9KBOOKMARKS3LIKES31RETWEETS5REPLIES1
Stefano Ermon@StefanoErmon

Excited to see Mercury 2 live on @baseten

Mercury 2 delivers Groq/Cerebras-like speeds (>1000 tokens/sec) with quality comparable to speed-optimized models like Claude Haiku

If you have latency-sensitive workloads we’d love to hear from you.

Baseten@baseten

We are excited to announce that we have partnered with @_inception_ai to make Mercury 2 available on Baseten. This makes us the first inference platform to bring Inception’s diffusion LLM to production.

Inception’s dLLM architecture fixes the bottlenecks of sequential token generation and can deliver 1,000+ tokens/sec on standard NVIDIA GPUs. Early users like @augmentcode have seen impressive results, such as an 82% reduction in latency and 90% cost savings, while maintaining high quality.

4hViews 1.9KLikes 31Bookmarks 3

Today Mercury 2, the first reasoning diffusion LLM, is live on Baseten. The result: over 1,000 tokens per second on standard NVIDIA GPUs, at comparable quality to speed-optimized models. @AugmentCode is already using it in production, cutting cost 90% and latency 82%.

Baseten@baseten

We are excited to announce that we have partnered with @_inception_ai to make Mercury 2 available on Baseten. This makes us the first inference platform to bring Inception’s diffusion LLM to production.

Inception’s dLLM architecture fixes the bottlenecks of sequential token generation and can deliver 1,000+ tokens/sec on standard NVIDIA GPUs. Early users like @augmentcode have seen impressive results, such as an 82% reduction in latency and 90% cost savings, while maintaining high quality.

4hViews 894Likes 14Bookmarks 3
Inception@_inception_ai

@StefanoErmon @baseten 🚀🚀

3hViews 45Likes 1

@adityagrover_ @baseten @nvidia @augmentcode congrats!!

4hViews 8
Invincible@InvincibleEdge

@adityagrover_ @baseten @nvidia @augmentcode cost down 90% and latency slashed is the actual headline

rest is just window dressing

4hViews 1
Blissy@BlissyOnX

@adityagrover_ @baseten @nvidia @augmentcode 1k tok/sec is wild

but 90% cost reduction while keeping reliability is the real flex here

4h