/Tech13h ago

Princeton's Arvind Narayanan proposes that companies build dedicated, independent internal teams to evaluate AI deployments

These cross-functional teams would mimic QA and security red teams.

13449287.9K

#139

Original post

Arvind Narayanan@random_walker#139inTech

Companies check their own work through various internal but independent functional units: QA, security red teams, model risk management in banks. **I think it’s time for AI evaluation to become one such unit.** Orgs deploying AI should stand up cross-functional eval teams with their own reporting line. Many reasons:

1) Evals as IP / moat. It’s now widely recognized that evals are the new IP. So it makes sense to have teams whose primary focus is on creating and widening this moat.

2) Evals are harder than you think. This is less well recognized but as someone whose research centers on AI evals this has been my consistent experience. It can't be an afterthought and must be a center of excellence.

3) Evals are inherently cross-functional and require a distinct set of skills. They are judgment heavy, require both AI expertise and deep domain expertise, as well as customer understanding and sophisticated thinking about risk. To do them well, you need competence in data science & stats, business operations, product/customer experience, IT, risk management, and even compliance (depending on the sector).

4) In-house but independent eval teams keep companies honest. A climate where teams are getting top-down mandates to hit deployment targets and show results has resulted in a culture of companies fooling themselves. It is extremely easy to knowingly or unknowingly to do evals poorly, making your AI deployment look much more successful than it is. Eval teams who don’t share the deploying teams’ KPIs are the best defense against this.

5:39 AM · Jul 3, 2026 · 8.4K Views

Sentiment

Users back calls for independent AI evaluation teams inside companies because self-audits are viewed as unreliable like grading one's own exam and likely to overlook issues in favor of fast deployment.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Carolina Mattsson@CarolinaMttssn

@random_walker Do you have a manager-friendly write-up on this? Asking for a friend who would pass this on to management if there is one

12h1301

LIKES2REPLIES2

Arvind Narayanan@random_walker

@CarolinaMttssn I would love to ... if you could give some guidance on what would make it manager friendly :)

Is it mainly a matter of tweet vs blog? Or is it too short? Or not the right lingo? Thanks!

12h1102

RETWEETS7

Arvind Narayanan@random_walker

1) Evals as IP / moat. It’s now widely recognized that evals are the new IP. So it makes sense to have teams whose primary focus is on creating and widening this moat.

13h8.4K4732

Carolina Mattsson@CarolinaMttssn

@random_walker Maybe more a matter of conveying your authority to speak on the matter to someone without the time to look you up?

10h15

V0LYX@0xV0LYX

@random_walker they should also externalize some evals to avoid internal blind spots. too easy to design tests that fit ur own narrative

13h65

Sanjay Uppal@Sanjay_Uppal

I agree that independent evaluation needs to become a first-class function.

But I think the industry is still framing the problem too narrowly.

Eval asks: “Did the model perform well ?” (Model Risk Management)

Regulated enterprises have a broader question :

“Can every AI-driven decision be governed, controlled, explained and reproduced under policy?” (Thatt’s Decision Risk Management)

That’s no longer just an eval function.

It requires an enterprise decision control layer that sits above models and agents, with independent governance, approvals, policies, traceability and audit.

Models will continue to evolve. Evaluation will improve. But institutional trust ultimately depends on control, not just measurement.

9h151

Ferbin@Ferbin08

@random_walker The strongest eval teams will be the annoying people asking:

“show me the weird cases, not the demo cases.”

12h48

Jayasimha@Jayasim52317099

@random_walker There would be only two categories of orgs: 1. Evals org, like you propose, and 2. Modelling org, whose charter is to hill climb the benchmarks

10h34

Eriks Briedis@eriks_b

That separate reporting line matters most when failures come in.

When production misses something, someone has to turn that miss into a case and keep the trace with it. Cases that have stopped teaching you anything should disappear too, or the eval team ends up making dashboards for the next deploy.

12h26

Mert · AI Architect@MertLovesAI

@random_walker the reporting line is the hard part.

a team that shares KPIs with the deploying org will tune the rubric until the eval passes, and then the eval just confirms what deployment already claimed.

7h15

Carolina Mattsson@CarolinaMttssn

@random_walker Saying “here’s a cool post” has it’s limits within an org, but authoritative sources are pretty thin on the importance of evaluation (over, say, the model or compute) in AI deployment

10h15

Henry Nguyen@henryhndev

@random_walker This is exactly the right call. Dedicated eval units force the discipline needed to ship reliable AI systems, not just fast ones.

11h9

carstenbergenholtz@justsomeoneDK

@random_walker @CarolinaMttssn I think it is partially a matter of a blog vs. a tweet. The lingo seems ok - I think I'd expand on what an AI evaluation actually means. How is it done and what it is an answer to.

10h5

TheNeonVoice@neon_voix

@random_walker Companies evaluating their own AI is like grading your own exam. And when top-down mandates push deployment, self-audit becomes rubber-stamping. Independent teams with separate KPIs. That's it.

12h3