Raindrop AI co-founder Ben Hylak launches howtoeval.com, a practical guide for assessing production AI agents
An interactive diagnostic quiz helps builders identify agent deployment risks.
No Digg Deeper questions have been answered for this story yet.
Most Activity
are you a benchmark-maxxer or floor-raiser?
introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents.
from personal experience, and from working with the best companies in the world.
there's even a quiz. link below.
http://howtoeval.com
introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents.
from personal experience, and from working with the best companies in the world.
there's even a quiz. link below.
weekend sorted.
introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents.
from personal experience, and from working with the best companies in the world.
there's even a quiz. link below.
some of the takeaways:
- lab evals are not product evals - agent evals are just e2e tests. make them code. - most products should focus on raising the floor vs. increasing capability
introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents.
from personal experience, and from working with the best companies in the world.
there's even a quiz. link below.

@benhylak worth the plug.
https://github.com/openai/openai-cookbook/blob/main/examples/partners/macro_evals_for_agentic_systems/macro_evals_for_agentic_systems.ipynb
@benhylak @benhylak I’ve been working on agents for a minute, but have been struggling to communicate the nuances of their evals with the clarity of your guide. This is so well written. Thanks for sharing this.
http://howtoeval.com
Very nice read, especially if you don't skip the quiz
introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents.
from personal experience, and from working with the best companies in the world.
there's even a quiz. link below.
@benhylak Good writeup 🙌
introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents.
from personal experience, and from working with the best companies in the world.
there's even a quiz. link below.
why are my followers so lazy, just read it now
introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents.
from personal experience, and from working with the best companies in the world.
there's even a quiz. link below.

@benhylak Great piece Ben. Wrote about this from the product side a few months back...curious where your thinking overlaps or diverges: https://noufalsoghyar.substack.com/p/ai-evals-in-product-building

@benhylak nice framing. my same experience on forecasting models.

@benhylak @buckymoore Lfg

@benhylak Would love your take on “should an agent handle this?” too - always balancing that with “can I just create a feature with Claude code in half the time with some event triggering it / some human involvement”

@breenemachine yes want to write a LOT more about this.

@theonkxr added, thanks. this will be the most up-to-date eval resource that has ever been made.

@benhylak king

@benhylak needed. saved this.

@benhylak my hypothetical company is a benchmark maxxer

@seandotio agreed + glad to help. doesn't have to be confusing.