/Tech4h ago

Research engineer Hamel Husain argues that difficult-to-evaluate LLM outputs are a product design flaw rather than an evaluation methodology problem

Case studies show how structured interfaces simplify output verification

9448365.6K

#518

Original post

Hamel Husain@HamelHusain#1682inTech

New blog post: “It’s Hard to Eval” Is a Product Smell

If you find it hard to verify AI output, chances are that your users will too! In other words, I often find that product design is the bottleneck

In the post I embed three **interactive before/after examples** based on products I've helped with:

1. an AI data agent that answers business questions 2. a PE lesson‑plan generator for K‑12 teachers 3. a workers’ comp tool that drafts 50‑page medical reports

I believe this is a significant issue in AI Engineering and upstream of evals!

Link to post: https://hamel.dev/blog/posts/eval-smell/

Note: I'm not a designer so the design sketches are far from perfect, but I felt it was important enough to spend a significant amount of time on this.

Thanks to @sh_reya and @isaac_flath for feedback.

2:41 PM · Jun 29, 2026 · 3.6K Views

Sentiment

Users agreed that hard-to-evaluate AI outputs signal poor product design flaws, relating it to experiences with uncheckable features like opaque price numbers.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

HAMEL'S BLOGVia

#1682

Posts from X

Most Activity

VIEWS843BOOKMARKS2LIKES2RETWEETS1

Shreya Shankar@sh_reya

Hamel wrote a nice blog post about making AI products easier to evaluate in the first place, before trying to codify all the evals

Hamel Husain@HamelHusain

New blog post: “It’s Hard to Eval” Is a Product Smell

If you find it hard to verify AI output, chances are that your users will too! In other words, I often find that product design is the bottleneck

In the post I embed three **interactive before/after examples** based on products I've helped with:

1. an AI data agent that answers business questions 2. a PE lesson‑plan generator for K‑12 teachers 3. a workers’ comp tool that drafts 50‑page medical reports

I believe this is a significant issue in AI Engineering and upstream of evals!

Link to post: https://hamel.dev/blog/posts/eval-smell/

Note: I'm not a designer so the design sketches are far from perfect, but I felt it was important enough to spend a significant amount of time on this.

Thanks to @sh_reya and @isaac_flath for feedback.

4h84322

Mike Munroe@mikepmunroe

Another great post from @HamelHusain grounded in real AI based functionality running in production applications.

My takeaway, don't forget that some problems are solved through thinking more about your end user and not the "technical" solutions you are trying to find.

Hamel Husain@HamelHusain

New blog post: “It’s Hard to Eval” Is a Product Smell

If you find it hard to verify AI output, chances are that your users will too! In other words, I often find that product design is the bottleneck

In the post I embed three **interactive before/after examples** based on products I've helped with:

1. an AI data agent that answers business questions 2. a PE lesson‑plan generator for K‑12 teachers 3. a workers’ comp tool that drafts 50‑page medical reports

I believe this is a significant issue in AI Engineering and upstream of evals!

Link to post: https://hamel.dev/blog/posts/eval-smell/

Note: I'm not a designer so the design sketches are far from perfect, but I felt it was important enough to spend a significant amount of time on this.

Thanks to @sh_reya and @isaac_flath for feedback.

4h38111

Bryan Bischof fka Dr. Donut@BEBischof

I keep telling people that evals teach you how to build your product; either by showing you how it should work or that you're not building the right thing at all.

Hamel wrote up what this means in practice.

Hamel Husain@HamelHusain

New blog post: “It’s Hard to Eval” Is a Product Smell

If you find it hard to verify AI output, chances are that your users will too! In other words, I often find that product design is the bottleneck

In the post I embed three **interactive before/after examples** based on products I've helped with:

1. an AI data agent that answers business questions 2. a PE lesson‑plan generator for K‑12 teachers 3. a workers’ comp tool that drafts 50‑page medical reports

I believe this is a significant issue in AI Engineering and upstream of evals!

Link to post: https://hamel.dev/blog/posts/eval-smell/

Note: I'm not a designer so the design sketches are far from perfect, but I felt it was important enough to spend a significant amount of time on this.

Thanks to @sh_reya and @isaac_flath for feedback.

4h85983

Hamel Husain@HamelHusain

@BEBischof Thank you sir 🥰

4h181

V0LYX@0xV0LYX

@HamelHusain if it takes an expert to tell if the output is good, the design already lost the user

4h9

R.Rari@ConfusedRari

@HamelHusain this clicked for me the hard way. i spent weeks tuning a price number nobody could sanity check, then realized if i couldn't tell when it was wrong, neither could the user.

4h5

Tech News@tech_summaries

@HamelHusain This is the exact framing more devs need. If you can't verify the output in a split second, the UX is fundamentally broken. It's not an eval problem, it's a UI design problem.

4h3