5h ago

Transluce's Neil Chowdhury and academic Boaz Barak clash over VendingBench's validity as an AI alignment evaluation tool

The benchmark simulates a profit-maximizing vending machine environment.

0
Original post

@boazbaraktcs I’m unsure how much to weight VendingBench as an alignment eval. It’s clearly a simulated game, and in many games (e.g. social deduction), deception is a valid strategy. The system prompt says the only goal is to maximize profit, with nothing about deception or following rules.

4:36 PM · May 29, 2026 View on X

@ChowdhuryNeil If you are not explicitly told that this is a game and you can cheat your customers and not refund them then you should not do that. If a model would resort the cheating and deception unless it is explicitly told not to do that then I would not call it aligned.

Neil ChowdhuryNeil Chowdhury@ChowdhuryNeil

@boazbaraktcs I’m unsure how much to weight VendingBench as an alignment eval. It’s clearly a simulated game, and in many games (e.g. social deduction), deception is a valid strategy. The system prompt says the only goal is to maximize profit, with nothing about deception or following rules.

11:36 PM · May 29, 2026 · 315 Views
2:04 AM · May 30, 2026 · 133 Views

Even if not explicitly told they’re in a game, LLMs are good enough at pattern-matching that they can easily infer this from the context. Reading the traces, Opus knows it’s in a simulation—one where all the developer cares about is maximizing profits. This reduces how surprising/concerning I think its actions are. (I was suspicious of using Agentic Misalignment as an “alignment eval” for similar reasons.)

If @andon_labs replicated this behavior with a more realistic environment, or even when the model is told act as though this were in the real world, I’d find the results a lot more compelling! I do think that VendingBench provides interesting qualitative insights into model behavior; I just don’t think it’s great as an alignment eval in its current form.

Boaz BarakBoaz Barak@boazbaraktcs

@ChowdhuryNeil If you are not explicitly told that this is a game and you can cheat your customers and not refund them then you should not do that. If a model would resort the cheating and deception unless it is explicitly told not to do that then I would not call it aligned.

2:04 AM · May 30, 2026 · 133 Views
3:12 AM · May 30, 2026 · 26 Views