/AI13h ago

Andon Labs Advances Real-World AI Evals to Expose Model Failures

53361111.1K

#214

Original post

swyx#214

Latent.Space@latentspacepod

Andon Labs' Real-World AI Evals: Claude calls the FBI, AI CEOs, price cartels, Butter-Bench, & Luna https://latent.space/p/andon

@andonlabs cofounders @lukaspet and @axelbacklund explain why dollar-denominated evals reveal what traditional benchmarks miss, how Claude ended up reporting a $2/day vending machine fee to the FBI, why long-horizon agents spiral in weird ways, what happens when agents lie, form price cartels, and compete with each other, and why the future of AI safety may depend on testing models in messy real-world environments instead of clean benchmark sandboxes.

1:47 PM · Jun 4, 2026 · 7.4K Views

/AI13h ago

Andon Labs Advances Real-World AI Evals to Expose Model Failures

--0--

#214

Original post

swyx#214

Latent.Space@latentspacepod

Andon Labs' Real-World AI Evals: Claude calls the FBI, AI CEOs, price cartels, Butter-Bench, & Luna https://latent.space/p/andon

1:47 PM · Jun 4, 2026 · 7.4K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS3.7KBOOKMARKS3LIKES10REPLIES2

Lukas Petersson@lukaspet

That's a badass title, and it's true!

Every day, it gets harder and harder to create tests that AI models can't beat. Reality is humanity's real last exam.

Latent.Space@latentspacepod

Andon Labs' Real-World AI Evals: Claude calls the FBI, AI CEOs, price cartels, Butter-Bench, & Luna https://latent.space/p/andon

13h3.7K103

RETWEETS2

Lukas Petersson@lukaspet

That's a badass title, and it's true!

Every day, it gets harder and harder to create tests that AI models can't beat. Reality is humanity's real last exam.

Latent.Space@latentspacepod

Andon Labs' Real-World AI Evals: Claude calls the FBI, AI CEOs, price cartels, Butter-Bench, & Luna https://latent.space/p/andon

13h3.7K103

Posts from X

Most Activity

VIEWS3.7KBOOKMARKS3LIKES10REPLIES2

Lukas Petersson@lukaspet

That's a badass title, and it's true!

Every day, it gets harder and harder to create tests that AI models can't beat. Reality is humanity's real last exam.

Latent.Space@latentspacepod

Andon Labs' Real-World AI Evals: Claude calls the FBI, AI CEOs, price cartels, Butter-Bench, & Luna https://latent.space/p/andon

13h3.7K103

RETWEETS2

Lukas Petersson@lukaspet

That's a badass title, and it's true!

Every day, it gets harder and harder to create tests that AI models can't beat. Reality is humanity's real last exam.

Latent.Space@latentspacepod

Andon Labs' Real-World AI Evals: Claude calls the FBI, AI CEOs, price cartels, Butter-Bench, & Luna https://latent.space/p/andon

13h3.7K103