/Tech4h ago

VibeThinker-3B reasoning model achieves 94.3 on AIME26, matching frontier models using a post-trained Qwen2.5-Coder base

Story Overview

VibeThinker-3B starts from the Qwen2.5-Coder-3B base and applies curriculum SFT, multi-domain RL, self-distillation, and a final RL instruct stage to reach 94.3 on AIME26 while also posting 96.1 percent acceptance on recent unseen LeetCode contests.

6187284582133.2K

#109

Original post

Chubby♨️@kimmonismus#1360inTech

Crazy: A 3B model is now reaching highly competitive results on verifiable reasoning tasks.

VibeThinker-3B scores 94.3 on AIME26, 80.2 Pass@1 on LiveCodeBench v6, and 96.1% on unseen LeetCode contests.

The gains appear to come primarily from post-training on top of Qwen2.5-Coder: curriculum SFT, multi-domain RL, offline self-distillation, and a final RL-based instruct stage.

The core implication: certain forms of verifiable reasoning may be highly compressible into small dense models.

Frontier-scale models still matter for broad knowledge and general-purpose capability, but compact reasoning models are becoming a serious complementary path.

Love to see it!

Francesco Bertolotti@f14bertolotti

Stellar performance from a 3B model. These results were achieved primarily through post-training refinements on Qwen2.5-Coder. The paper doesn't provide many details, but it appears they distill from RL ckpts and then do a final RL-based instruct RL.

🔗https://arxiv.org/abs/2606.16140

3:56 AM · Jun 16, 2026 · 57.7K Views

Performance Edge

Small model matches much larger systems on verifiable tasks

The 3B dense model hits 80.2 Pass@1 on LiveCodeBench v6 and 89.3 on HMMT25, landing in the same band as DeepSeek V3.2, Gemini 3 Pro, and GLM-5 on these benchmarks.

Open Question

Verifiable signals appear highly compressible

The work extends the idea that multi-step reasoning with clear feedback can be packed into compact cores, yet it stays silent on broad knowledge tasks and carries no independent third-party checks so far.

Sentiment

Positive users are excited by the 3B VibeThinker model's frontier results on AIME and coding benchmarks via post-training because it suggests small models can be fast and capable, while negative users dismiss the gains as benchmaxxed.

Pos

62.5%

Neg

37.5%

24 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS742LIKES9

Ahmad@TheAhmadOsman

@f14bertolotti This looks really good 👀

5h74291

BOOKMARKS1

SamBam@sambam_at

@f14bertolotti Link to the hugging face: https://huggingface.co/WeiboAI/VibeThinker-3B

2h251

RETWEETS12

Francesco Bertolotti@f14bertolotti

🔗https://arxiv.org/abs/2606.16140

9h83K242212

REPLIES3

Charis Mitsakis@cmitsakis

@kimmonismus I think small models are the future for agents because they can use tools to get the knowledge and the can run fast and cheap

3h1725

Joseph See@josephsee

@kimmonismus Did anyone try this yet? That would be crazy to get that close to the frontier on 3B.

3h4674

Henrique Massareli@hmassareli

@f14bertolotti well the problem is these benchmarks are all single turn, no agentic test.. on my personal tests the model is very poor on tool calling, and reading the code

27m871

Chubby♨️@kimmonismus

@cmitsakis Yes, small, specialized models

3h2695

Sathwik Tejaswi@SathwikTejaswi

@f14bertolotti Major benchmaxxxx vibes lmao

We need to make - publishing scores as a function of test time compute used - mainstream

The 3b might as well take 200k tokens to achieve these results and we'd never know lol

4h2221

Andreas Kirsch 🇺🇦@BlackHC

@f14bertolotti Hmm the performance seems too good to be true 🥺

6h5193

Francesco Bertolotti@f14bertolotti

@hmassareli I do not think that agentic-style behavior was their goal here. I think they wanted to see how far one could push mathematical reasoning in a small LLM. In that sense, I do not see this model as consumer-oriented. It is still an incredible result

18m301

Helina@Helina01029

@kimmonismus @ClementDelangue Yes small models are part of the future but now I just see Benchmaxxxing

3h2423

Francesco Bertolotti@f14bertolotti

@willfaustcuber @csworddd I do not think this has been tailored towards agentic stuff. It shows that a few B. of params are enough for fairly advanced mathematical reasoning. Provided they did not train on the test, but I do not think so.

1h39

Lucas@luksamuk

@kimmonismus What's the heck. Is there a HuggingFace repo?

1h15

Lucas Nguyen@hadesboun101

@kimmonismus I think we’re heading toward a future where AI can distill Fable 5 level intelligence into models that run entirely on phones. My guess is 2028. What do you think, Chubby?

3h1901

Pedro@PedroNeverFolds

@f14bertolotti Training on the test set is da way 😂

2h1801

Huawei Wang@wang60736

@f14bertolotti those numbers absolutely smokes! can't wait for the weight open source release.

4h1751

Qaiyyum Hakimi@qhkmdev9

@hmassareli @f14bertolotti I'm gonna try to fine tune so that it handles tool calling better

12m3

grzracz@grzracz

@josephsee @kimmonismus Extremely benchmaxxed in my experience, does not follow direction once set on a path (I asked it to design an IDE and instead it started answering with VSC/IntelliJ extensions I should install, continued to do that even after trying to explain further what I want)

2h372

Hussain Hashim | Building Sunday Back@itsthedonhashim

@kimmonismus @kimmonismus wow, those numbers are wild. didn't think a 3B model could pull that off. tech's moving crazy fast these days!

3h981

cornball.dev 🐳@dixiidev

@f14bertolotti ouuuu shii

4h242