/Tech11d ago

DeepSeek Team Releases Mellum2 12B MoE LLM With Checkpoints

667748743049K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)#501

Nikita Pavlichenko@nv_pavlichenko

Today we're releasing Mellum2: our first "serious" LLM.

This is a 12B A2.5B MoE LLM pre-trained on ~11T tokens and post-trained with RLVR. I'm proud to be leading the team that was working on it for the last 6 months.

We release base/SFT/RL checkpoints along with a tech report

6:24 AM · Jun 1, 2026 · 49K Views

Sentiment

Many users congratulated the JetBrains team on releasing Mellum2, praising the open-source 12B MoE model as amazing work and expressing eagerness to test it.

Pos

100.0%

Neg

0.0%

19 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Philipp Schmid@_philschmid

@nv_pavlichenko Congrats!

11d4082

BOOKMARKS3RETWEETS3

Jake@JakeKAllDay

@nv_pavlichenko you mention a qwen3-like architecture put then point to Gemma for SWA 3:1, what were the qwen-oriented attributes?

11d21833

LIKES8

Ivan Fioravanti ᯅ@ivanfioravanti

@nv_pavlichenko Congrats! I have to be honest, I was start to lose faith in JetBrains after being an hardcore users for years. Happy to see you are back on track on the AI coding way!

11d3558

REPLIES1

Javier@javi_22_dev

@nv_pavlichenko How many (and which) GPUs?

11d12

Nikita Pavlichenko@nv_pavlichenko

HF: https://huggingface.co/collections/JetBrains/mellum-2 Tech report: https://arxiv.org/pdf/2605.31268 Announcement: https://blog.jetbrains.com/ai/2026/06/mellum2-goes-open-source-a-fast-model-for-ai-workflows/

Btw, 100% European🇪🇺

11d871

Privaty@__Privaty__

@nv_pavlichenko cc @0xSero 👀

11d941

Nikita Pavlichenko@nv_pavlichenko

@bahamutru @justALEXWORTEGA Мы и старались сделать эффективную модель для инференса на гпу (это наш основной юзкейс). Согласен, что для локального инференса скорее всего это не самая эффективная модель, @liquidai делает оч крутые локальные модели + квен оптимизирует свои маленькие под локальный инференс

11d1252

Milan@influenist

@nv_pavlichenko @teortaxesTex 6months total training time? On which stack you trained ?

11d1281

NMB // v4@nmb_four

@nv_pavlichenko thanks for not mentioning _once_ how big it is in GB...

11d248

Nikita Pavlichenko@nv_pavlichenko

@nmb_four Well it’s simple math from param count. 12B * 16 bit if in full precision. Half of that if fp8

11d2354

Mitko Vasilev@iotcoi

@nv_pavlichenko best for Kotlin?

11d220

Bahamut@bahamutru

@nv_pavlichenko @justALEXWORTEGA Но если быть честным, на ноутбуке такое уже не запустить (LFM2.5-8B-A1B (A1.5B) гораздо ближе к реальности), нужно что-то меньше и быстрее. И гораздо умнее, не Qwen3.5-9b, а, минимум, Qwen3.6-27b. Тогда будет «вы можете программировать на ноутбуке». Не раньше.

11d147

April Miralles Victoria@aprilmvictoria

@nv_pavlichenko llama.cpp support?

11d111

Alex Volkov@altryne

@nv_pavlichenko Amazing, congrats on the release Nikita and JetBrains team!

Going to cover this on the next @thursdai_pod recording!

https://x.com/i/spaces/1jGXggzMNpzKZ

11d3523

Eric Alcaide@eric_alcaide

@nv_pavlichenko Congrats ! Happy to see it come out ! 💻

11d311

Nikita Pavlichenko@nv_pavlichenko

Finally, the eval numbers. One thing that I learnt that small Qwen3/3.5 are absolutely unusable in thinking mode (keept thinking forever).

11d72

Nikita Pavlichenko@nv_pavlichenko

We needed to build everything from scratch and learn how to do things: how to run architecture search, do data ablations, fix stability issues, spin up RL infra, fix runs that keep blowing up, and do evals.

All of that with a team of just ~7 people

11d72

Milan@influenist

@nv_pavlichenko 🇪🇺 100%

11d221

Nikita Pavlichenko@nv_pavlichenko

We wanted to build an alternative to 7/8/9B dense models that would be much faster.

We noticed that small OSS releases are mostly targeted local inference and are very inefficient in H100 inference setup.

This is why we decided to go sparse

11d55

Nikita Pavlichenko@nv_pavlichenko

It is, but gated deltanet is quite an overhead on small models and short contexts + qwen3.5 4b is quite unusable in thinking mode: thinks forever for 100k+ tokens

But yeah, this is our first try, we’ve got a long way to go. Building models is quite hard. Big fan of Qwen regardless

11d141