In my experience, all model routers underestimate the difficulty of non-math/coding tasks and assign them too little intelligence. This is worth addressing, as non-verifiable tasks (innovation, marketing, qualitative analysis) often benefit the most from using “smarter” AI models
Wharton's Ethan Mollick argues AI model routers systematically underestimate the difficulty of qualitative and non-coding tasks
Story Overview
Ethan Mollick flags a consistent blind spot in AI routing systems: they lean on math and coding benchmarks to decide model strength, which leaves qualitative work like innovation, marketing, and open-ended analysis with weaker models than the tasks actually need.
Where the benchmarks fall short
Routers tested mainly on verifiable IT problems tend to overrate lighter models when the output has no single right answer, so creative prompts often get underpowered by default.
Who feels the mismatch first
Users running non-coding projects notice the gap most, since those workflows stand to gain the biggest lift from stronger models yet receive the least intelligence under current routing logic.
Many users criticize model routers for underestimating non-math tasks like creative work, arguing they misroute to weaker models and cause costly rework instead of savings.
No Digg Deeper questions have been answered for this story yet.
Most Activity
It is worth being very, very careful about how you are approaching routing, especially when the systems are primarily tested on verifiable IT benchmarks, which may lead you to overestimate the ability of weaker models.
In my experience, all model routers underestimate the difficulty of non-math/coding tasks and assign them too little intelligence. This is worth addressing, as non-verifiable tasks (innovation, marketing, qualitative analysis) often benefit the most from using “smarter” AI models
This plus the fact primary access mode right now is agentic. You can't fragment a full session with code execution, complex parallel process into model routing, beyond bounded subagent delegation.
In my experience, all model routers underestimate the difficulty of non-math/coding tasks and assign them too little intelligence. This is worth addressing, as non-verifiable tasks (innovation, marketing, qualitative analysis) often benefit the most from using “smarter” AI models

@emollick @markletree Mark, how do you think about this at Coinbase re: your model gateway?

@emollick There’s no personal-user option for extended reasoning times beyond versions of Deep Research.
Which certainly don’t give you hours of focused thought.
I can understand this, but enterprise users should have that option.
This plus the fact primary access mode right now is agentic sessions. Requires good reliability and deep context constantly.
In my experience, all model routers underestimate the difficulty of non-math/coding tasks and assign them too little intelligence. This is worth addressing, as non-verifiable tasks (innovation, marketing, qualitative analysis) often benefit the most from using “smarter” AI models

@emollick i think that the error rate in making minor decision is so high even for big models, we still have errors that are the equivalent of "A=>!B so A=>B" that happen on opus when task scope is big enough, gpt 5.5's edge is that it doesn't fall for these as often.

@emollick issue with model routing conceptually, IMO, is that while it starts from an intuitive biological model (different parts of brain are specialized to different tasks), models are Not Usually Specialists

That said, lot to be done with smaller general agents: our primary problem is data shape for training, agentic pretraining has hardly been done so still far from saturating.

@emollick This needn't be the case but it usually is. And when you boil down this concept to a low enough level you just invent MoE again.
@emollick Data-model fit underrated for tasks where there's no right answer
It is worth being very, very careful about how you are approaching routing, especially when the systems are primarily tested on verifiable IT benchmarks, which may lead you to overestimate the ability of weaker models.

Most production routers rely on fast signals: embedding similarity, small BERT-style classifiers, matrix factorization on preference data, or simple heuristics
These work reasonably when difficulty correlates with surface features common in math/coding benchmarks
Qualitative and strategic tasks often look “simple” on the surface (short prompt, natural language) yet require deep world models or creative recombination that only frontier models reliably deliver

@emollick It could also be that an *average* non-math/coding task is just easier?
(Adding non-controlled bias to the router.)

@emollick this is why openai is still ahead for non coders. nothing comes even close to gpt pro for complex non coding tasks (except maybe fable but we did not get much time to try it).

@emollick The labs seem to be headed in the direction of letting the models choose the effor levels:

@emollick I think it's high time we have models that are niche-centric but are part of a general world model.

@emollick Have you seen any that are close to doing it well? I deprecated my own router yesterday after coming to the same conclusion. It's not hard making the call as I spin up each worktree, but I'd like something more proactive.

@emollick model yönlendiricileri nicel başarımları hafife alıyor. oysa asıl kıymet, "bilinmeyen bilinmeyenler" alanında yatıyor. yaratıcılık, strateji ve nitel analizde modellerin nasıl davrandığı, potansiyelini belirler.

@emollick Routers optimized on math and code benchmarks will systematically under-route creative and strategic work. The tasks hardest to evaluate are exactly the ones that need the best models.

@emollick How would you test this?

@emollick routers treat creative/strategic tasks like they're simple when they're actually the hardest to get right.. no clear answer to verify against