Companies check their own work through various internal but independent functional units: QA, security red teams, model risk management in banks. **I think it’s time for AI evaluation to become one such unit.** Orgs deploying AI should stand up cross-functional eval teams with their own reporting line. Many reasons:
1) Evals as IP / moat. It’s now widely recognized that evals are the new IP. So it makes sense to have teams whose primary focus is on creating and widening this moat.
2) Evals are harder than you think. This is less well recognized but as someone whose research centers on AI evals this has been my consistent experience. It can't be an afterthought and must be a center of excellence.
3) Evals are inherently cross-functional and require a distinct set of skills. They are judgment heavy, require both AI expertise and deep domain expertise, as well as customer understanding and sophisticated thinking about risk. To do them well, you need competence in data science & stats, business operations, product/customer experience, IT, risk management, and even compliance (depending on the sector).
4) In-house but independent eval teams keep companies honest. A climate where teams are getting top-down mandates to hit deployment targets and show results has resulted in a culture of companies fooling themselves. It is extremely easy to knowingly or unknowingly to do evals poorly, making your AI deployment look much more successful than it is. Eval teams who don’t share the deploying teams’ KPIs are the best defense against this.










