Andon Labs' Real-World AI Evals: Claude calls the FBI, AI CEOs, price cartels, Butter-Bench, & Luna https://latent.space/p/andon
@andonlabs cofounders @lukaspet and @axelbacklund explain why dollar-denominated evals reveal what traditional benchmarks miss, how Claude ended up reporting a $2/day vending machine fee to the FBI, why long-horizon agents spiral in weird ways, what happens when agents lie, form price cartels, and compete with each other, and why the future of AI safety may depend on testing models in messy real-world environments instead of clean benchmark sandboxes.