1d ago

Researchers Train Over 2000 MoE Models to Optimize Expert Size and Count

0
Original post

MoEs are everywhere, but the design space is confusing: total vs active experts? expert size? shared experts? routing? token dropping? We train >2000 MoE LMs 🫠 to investigate and bring you: 📄🔪🍰 Slicing and Dicing MoEs Tl;dr: it's all about expert size and count [1/9]

5:43 AM · May 18, 2026 View on X

Why? MoEs add interacting design axes. Most prior work only studies 1–2 at a time because of combinatorial explosion. We sweep grids.

How? We train MoE LMs from 10M to 300M active params, up to 6.6B total params. Comparisons are FLOP-matched via active FFN/expert params. [2/9]

Margaret LiMargaret Li@margs_li

MoEs are everywhere, but the design space is confusing: total vs active experts? expert size? shared experts? routing? token dropping? We train >2000 MoE LMs 🫠 to investigate and bring you: 📄🔪🍰 Slicing and Dicing MoEs Tl;dr: it's all about expert size and count [1/9]

12:43 PM · May 18, 2026 · 26.7K Views
12:43 PM · May 18, 2026 · 1.3K Views

🤔 Do total (inactive) experts help ❓ Yes 📈 At fixed compute, increasing total MoE params - via either bigger or more experts - consistently improves LM loss. Even when total expert params are 128× active expert params. ➡️ Add as many total expert parameters as possible. [3/9]

Margaret LiMargaret Li@margs_li

Why? MoEs add interacting design axes. Most prior work only studies 1–2 at a time because of combinatorial explosion. We sweep grids. How? We train MoE LMs from 10M to 300M active params, up to 6.6B total params. Comparisons are FLOP-matched via active FFN/expert params. [2/9]

12:43 PM · May 18, 2026 · 1.3K Views
12:43 PM · May 18, 2026 · 967 Views

🤔 Experts: finer-grained? coarser? fewer? more❓ More experts. Optimal granularity changes slowly with total params (or sparsity), but shifts more with active params. ➡️ Choose expert size primarily by active params, then spend memory on expert count. [4/9]

Margaret LiMargaret Li@margs_li

🤔 Do total (inactive) experts help ❓ Yes 📈 At fixed compute, increasing total MoE params - via either bigger or more experts - consistently improves LM loss. Even when total expert params are 128× active expert params. ➡️ Add as many total expert parameters as possible. [3/9]

12:43 PM · May 18, 2026 · 967 Views
12:44 PM · May 18, 2026 · 748 Views

🤔 Is finer granularity beneficial without sparse activation❓ Nope. We split a dense FFN into smaller always-active "experts", and performance degrades. ➡️ MoE gain is not just from smaller FFN blocks; it comes from sparse activation through inactive params. [5/9]

Margaret LiMargaret Li@margs_li

🤔 Experts: finer-grained? coarser? fewer? more❓ More experts. Optimal granularity changes slowly with total params (or sparsity), but shifts more with active params. ➡️ Choose expert size primarily by active params, then spend memory on expert count. [4/9]

12:44 PM · May 18, 2026 · 748 Views
12:44 PM · May 18, 2026 · 692 Views

🤔 Do more flexible expert configurations help? Generalists or heterogeneity❓ No. MoEs with heterogeneously sized experts interpolate between homogeneous MoEs. Shared generalist experts match or underperform no-generalist baselines. ➡️ Use homogeneous, non-shared experts. [6/9]

Margaret LiMargaret Li@margs_li

🤔 Is finer granularity beneficial without sparse activation❓ Nope. We split a dense FFN into smaller always-active "experts", and performance degrades. ➡️ MoE gain is not just from smaller FFN blocks; it comes from sparse activation through inactive params. [5/9]

12:44 PM · May 18, 2026 · 692 Views
12:44 PM · May 18, 2026 · 691 Views

🤔 What about routing ❓ Dropless routing yields a small but consistent gain. Extremely high or low load balancing (loss or loss-free) hurts performance, but many settings are near-optimal. ➡️ Just pick a sane setting that avoids token dropping. [7/9]

Margaret LiMargaret Li@margs_li

🤔 Do more flexible expert configurations help? Generalists or heterogeneity❓ No. MoEs with heterogeneously sized experts interpolate between homogeneous MoEs. Shared generalist experts match or underperform no-generalist baselines. ➡️ Use homogeneous, non-shared experts. [6/9]

12:44 PM · May 18, 2026 · 691 Views
12:44 PM · May 18, 2026 · 601 Views

Big thanks to my fantastic collaborators: @snehaark @DanielleRotherm @LukeZettlemoyer [9/9]

Margaret LiMargaret Li@margs_li

To summarize in one recipe 🥡 : 1️⃣ Find max FLOP/memory budget, set active/total expert parameters 2️⃣ Find optimal expert size from active parameters 3️⃣ Sanity checks of load balancing, minimize token dropping Easy as (slicing) 🥧. Code/data soon: http://github.com/hadasah/slicing_and_dicing [8/9]

12:44 PM · May 18, 2026 · 811 Views
12:44 PM · May 18, 2026 · 603 Views