/Tech2h ago

Wharton's Ethan Mollick argues AI model routers systematically underestimate the difficulty of qualitative and non-coding tasks

Story Overview

Ethan Mollick flags a consistent blind spot in AI routing systems: they lean on math and coding benchmarks to decide model strength, which leaves qualitative work like innovation, marketing, and open-ended analysis with weaker models than the tasks actually need.

38347194022.9K

#184

Original post

Ethan Mollick@emollick#184inTech

In my experience, all model routers underestimate the difficulty of non-math/coding tasks and assign them too little intelligence. This is worth addressing, as non-verifiable tasks (innovation, marketing, qualitative analysis) often benefit the most from using “smarter” AI models

9:11 AM · Jun 28, 2026 · 17.6K Views

Open Question

Where the benchmarks fall short

Routers tested mainly on verifiable IT problems tend to overrate lighter models when the output has no single right answer, so creative prompts often get underpowered by default.

FYI

Who feels the mismatch first

Users running non-coding projects notice the gap most, since those workflows stand to gain the biggest lift from stronger models yet receive the least intelligence under current routing logic.

Sentiment

Many users criticize model routers for underestimating non-math tasks like creative work, arguing they misroute to weaker models and cause costly rework instead of savings.

Pos

25.0%

Neg

75.0%

8 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4.7KBOOKMARKS3LIKES31RETWEETS2REPLIES4

Ethan Mollick@emollick

It is worth being very, very careful about how you are approaching routing, especially when the systems are primarily tested on verifiable IT benchmarks, which may lead you to overestimate the ability of weaker models.

Ethan Mollick@emollick

2h4.7K313

Alexander Doria@Dorialexander

This plus the fact primary access mode right now is agentic. You can't fragment a full session with code execution, complex parallel process into model routing, beyond bounded subagent delegation.

Ethan Mollick@emollick

57m950132

Hristo Vassilev@hristo_vassilev

@emollick @markletree Mark, how do you think about this at Coinbase re: your model gateway?

1h181

Ralph Cerchione@Dry_Observer

@emollick There’s no personal-user option for extended reasoning times beyond versions of Deep Research.

Which certainly don’t give you hours of focused thought.

I can understand this, but enterprise users should have that option.

1h39

Alexander Doria@Dorialexander

This plus the fact primary access mode right now is agentic sessions. Requires good reliability and deep context constantly.

Ethan Mollick@emollick

1h40010

Nyanpasu@NyanpasuKA

@emollick i think that the error rate in making minor decision is so high even for big models, we still have errors that are the equivalent of "A=>!B so A=>B" that happen on opus when task scope is big enough, gpt 5.5's edge is that it doesn't fall for these as often.

1h4

ASRL-1125@glass-mountain-system 🔞@g1455mountain

@emollick issue with model routing conceptually, IMO, is that while it starts from an intuitive biological model (different parts of brain are specialized to different tasks), models are Not Usually Specialists

2h4

Alexander Doria@Dorialexander

That said, lot to be done with smaller general agents: our primary problem is data shape for training, agentic pretraining has hardly been done so still far from saturating.

52m1321

ASRL-1125@glass-mountain-system 🔞@g1455mountain

@emollick This needn't be the case but it usually is. And when you boil down this concept to a low enough level you just invent MoE again.

2h2

davidlee@davidlee

@emollick Data-model fit underrated for tasks where there's no right answer

Ethan Mollick@emollick

1h5600

Pratyush Choudhury (PC)@177pc

Most production routers rely on fast signals: embedding similarity, small BERT-style classifiers, matrix factorization on preference data, or simple heuristics

These work reasonably when difficulty correlates with surface features common in math/coding benchmarks

Qualitative and strategic tasks often look “simple” on the surface (short prompt, natural language) yet require deep world models or creative recombination that only frontier models reliably deliver

1h55

Petr Baudis@xpasky

@emollick It could also be that an *average* non-math/coding task is just easier?

(Adding non-controlled bias to the router.)

2h34

fipaddict@fipaddict

@emollick this is why openai is still ahead for non coders. nothing comes even close to gpt pro for complex non coding tasks (except maybe fable but we did not get much time to try it).

1h26

MakerMatters?@MakerMatters

@emollick The labs seem to be headed in the direction of letting the models choose the effor levels:

1h21

Rufus@Rufus87078959

@emollick I think it's high time we have models that are niche-centric but are part of a general world model.

1h19

Jason Gilbertson@jgilbertson47

@emollick Have you seen any that are close to doing it well? I deprecated my own router yesterday after coming to the same conclusion. It's not hard making the call as I spin up each worktree, but I'd like something more proactive.

2h15

veloX@veloXxn

@emollick model yönlendiricileri nicel başarımları hafife alıyor. oysa asıl kıymet, "bilinmeyen bilinmeyenler" alanında yatıyor. yaratıcılık, strateji ve nitel analizde modellerin nasıl davrandığı, potansiyelini belirler.

1h14

Youssef El Manssouri@yoemsri

@emollick Routers optimized on math and code benchmarks will systematically under-route creative and strategic work. The tasks hardest to evaluate are exactly the ones that need the best models.

1h14

davidnoelromas@davidnoelromas

@emollick How would you test this?

1h14

Anoy@Anoyroyc

@emollick routers treat creative/strategic tasks like they're simple when they're actually the hardest to get right.. no clear answer to verify against

1h12