/Tech7h ago

VibeThinker-3B Matches Flagship Models on Reasoning With 3B Parameters

8465356.5K

Original post

VibeThinker is a 3B param model, with almost head to head benchmark result with Opus 4.5 on reasoning with novel SFT+GRPO.

Unusually strong for its size: with only 3B parameters, 94.3 on AIME26, 80.2 Pass@1 on LiveCodeBench v6, and 96.1% acceptance on recent unseen LeetCode contests.

"places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2"

They start from a 3B Qwen2.5-Coder base model, then train it with carefully filtered hard examples, multi-solution supervised training, reinforcement learning on math/code/STEM tasks with verifiable rewards, self-distillation, instruction-focused RL, and a test-time answer-checking method called CLR.

7:41 PM · Jun 23, 2026 · 5.3K Views

Sentiment

Positive users hail VibeThinker-3B matching flagship reasoning scores with only 3B parameters thanks to compute-efficient recipes and training methods, while negative users call the AIME results overfit and unreliable.

Pos

75.0%

Neg

25.0%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

WeiboAI/VibeThinker-3B · Hugging Face

HUGGINGFACEVia

#1257

Posts from X

Most Activity

VIEWS1.2KBOOKMARKS6

Rohan Paul@rohanpaul_ai

https://huggingface.co/WeiboAI/VibeThinker-3B

Rohan Paul@rohanpaul_ai

VibeThinker is a 3B param model, with almost head to head benchmark result with Opus 4.5 on reasoning with novel SFT+GRPO.

Unusually strong for its size: with only 3B parameters, 94.3 on AIME26, 80.2 Pass@1 on LiveCodeBench v6, and 96.1% acceptance on recent unseen LeetCode contests.

"places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2"

7h1.2K06

Owlfy.ai@Owlfy_ai

@rohanpaul_ai 94.3 on AIME26 with just 3B params is wild—GRPO really pulls its weight here. Seeing this level of performance on small models makes local deployment on regular hardware feel way more realistic.

6h47

Shinka - AI@ShinkaIoT

@rohanpaul_ai Compute-efficient reasoning models closing the gap on flagships is the real game, proving size isn't everything.

6h6

Jasper 🌰@building BBX@bbxjasper

@rohanpaul_ai A 3B matching frontier reasoning scores always sounds wild until you remember AIME-style benchmarks are the most overfit numbers in ML. The real test is held-out, off-distribution problems nobody trained on. That's where most "tiny model beats giant" claims quietly fall apart.

6h5

مازن وذكاء الآلات@Mazen_AIEx

@rohanpaul_ai This is a big deal for compute efficient AI. I think results like this make it clear the recipe matters enormously, not just raw parameter count.

6h2