/Tech2h ago

Wharton's Ethan Mollick argues organizations must build custom benchmarks rather than swapping AI models solely for cost savings

The BetterBench framework helps organizations avoid common evaluation pitfalls.

19269165624.4K

Original post

You really need your own benchmarks. If you are translating hieroglyphics, use Gemini 3.5 Flash. If you are running a vending machine use Opus 4.8.

(This is one reason why I am skeptical of just swapping out models to optimize costs or generic benchmarks without testing first)

Jake Boggs@JakeABoggs

Fable 5 is a large step for Anthropic's vision capabilities and effectively ties with GPT-5.5 on HieroglyphBench, my benchmark which tests how well VLMs can transcribe ancient Egyptian hieroglyphs

However, they're both still far behind the Gemini series, where 3.5 Flash has more than double the score

8:02 AM · Jul 2, 2026 · 24.2K Views

Sentiment

Many users endorsed Ethan Mollick's call for custom benchmarks because they help select the most efficient model for specific workflows rather than relying on generic leaderboards.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2.1KLIKES7

Ethan Mollick@emollick

Vending machine

2h2.1K7

BOOKMARKS1

Daksh Tyagi | AI & SaaS@learnwithdaksh

@emollick Use Gemini 3.5 Flash to decipher ancient pharaoh curses. Use Claude Opus 4.8 to dispense your Diet Coke.

2h211

Jan Stevens@janstevens

@emollick Now, what model should I use if I want to sell Hieroglyphics out of a vending machine? 😉

2h522

Mike Bradley@MikeBradleyAI

@emollick This is directionally good advice. Don’t guess, test. Find the most efficient and cost effective model for your use case.

I genuinely appreciate you advocating for right model right job, and not just uber model all jobs.

The practical economics of this are meaningful.

2h46

Ziwen Xu 🔶@z1vex

@emollick Gemini 3.5 Flash specifically being best at hieroglyph translation is genuinely funny to me

feels like u learned the hard way after a few lab costs blew up

2h30

Midge@Midge_xbt

@emollick the slippery slope when generic benchmarks replace real user testing

cost savings look great until the model fails in production

2h13

Jasper 🌰@building BBX@bbxjasper

@emollick The gap between public benchmark rank and "works on my actual task" keeps widening. Half my model picks flip once I test on my own eval set instead of the leaderboard.

2h12

Alfredo González-Espinoza@Spiralizing

@emollick Having benchmarks designed to optimize model-agnostic workflows is what will work.

2h7

midnight@midsusnight

@emollick benchmarks are useful but the real test is how the model handles ur actual terrible workflow not synthetic edge cases

2h4

Pierre de la Grand'rive@pierre_dlgr

@emollick Exactly. The right model depends on the job, not the leaderboard. Same reason you don't hire the same person for every role. Workers, not tools.

1h3