I wonder what people really think about MoEs? It's ok, the voting is anonymous, you can select the option that you really think, deep inside your heart.
Meta's Lucas Beyer launches anonymous MoE poll, sparking debate over parameter scaling and routing efficiency
Story Overview
Meta AI researcher Lucas Beyer dropped an anonymous X poll to surface unfiltered opinions on Mixture of Experts models, quickly igniting replies that challenge whether intelligence truly tracks only the active parameters while total parameter counts drive loss and knowledge.
Active versus total parameters remain unsettled
Replies stress that loss curves scale with every parameter yet performance narratives often cite only the routed subset, leaving routing efficiency and any non-knowledge intelligence open to doubt.
Beyer already flagged MoE scaling limits
Months earlier the same researcher noted that MoE is not actually scaling pilled, framing the poll as a follow-up probe rather than a sudden shift in view.
Positive users praise MoEs for elegant balancing and VRAM efficiency while negative users call them ugly hacks or overhyped ensembles needing extra infra knowledge.
No Digg Deeper questions have been answered for this story yet.
Most Activity
I still wish we had something more globally aware than these routers MoEs are frustrating. What do you mean loss and knowledge scale with total params and "intelligence" with active? Wtf is non-knowledge-based intelligence in an LLM? That's not true humanlike sparsity.
I wonder what people really think about MoEs? It's ok, the voting is anonymous, you can select the option that you really think, deep inside your heart.

that this works okay vs learned routing is indictment enough

@teortaxesTex if the routing in the first few layers is *by construction* almost totally determined by the token embeddings, might as well use an arbitrary router instead of a learned router *for those*. or something!
@giffmana my infra guy handles it
I wonder what people really think about MoEs? It's ok, the voting is anonymous, you can select the option that you really think, deep inside your heart.

@giffmana Let's be honest, if it wasn't a "Yuck", we would have seen more in vision rather then V-MoE 🙃
@giffmana They just need more merch int love.
I wonder what people really think about MoEs? It's ok, the voting is anonymous, you can select the option that you really think, deep inside your heart.
@giffmana I hate them so much that I think you should just train smaller models
I wonder what people really think about MoEs? It's ok, the voting is anonymous, you can select the option that you really think, deep inside your heart.

@teortaxesTex "in the initial several layers" actually wait this makes sense algorithmically with respect to the definition of a resnet, right? so many of the parameters in a resnet are wasted on conditioning embeddings in residualspace then reconditioning for-unembeddings...

@giffmana loss free balancing is so elegant it makes MoEs elegant

@giffmana Is there a "there's something missing that will make it not yuck" option?

@sameQCU @teortaxesTex if you want to get rid of this you can use incredibly large n-gram embedding tables and superword input tokens but everybody is a coward. here is a random link from online https://arxiv.org/abs/2502.01637

@giffmana Conceptually elegant, but practically frustrating

@giffmana @CSProfKGD MoEs are a beautiful idea, but the way routing and balancing are implemented in real models makes me cry

@giffmana surprised there isn't more love for MoEs i think they're beautifuk

@teortaxesTex lol right i forgot about that!

@giffmana MoE's exist because we're GPU poor. There's something so clean about dense models, but none can afford it.

@giffmana MoEs are just ensemble models with better PR tbh

@giffmana I'm still waiting for your hot takes after my last year's talk on MoEs. Well, I guess I probably heard some of them during the time we worked together. :p

@sameQCU yes, that's their reasoning, obviously but why must routing be so useless in early layers? Why not construct a better general router? (no I don't know how, maybe we can't do this while we stick to one forward pass decoding)

@_ueaj @giffmana QB 😍