Transluce's Neil Chowdhury and academic Boaz Barak clash over VendingBench's validity as an AI alignment evaluation tool
The benchmark simulates a profit-maximizing vending machine environment.
@ChowdhuryNeil If you are not explicitly told that this is a game and you can cheat your customers and not refund them then you should not do that. If a model would resort the cheating and deception unless it is explicitly told not to do that then I would not call it aligned.
@boazbaraktcs I’m unsure how much to weight VendingBench as an alignment eval. It’s clearly a simulated game, and in many games (e.g. social deduction), deception is a valid strategy. The system prompt says the only goal is to maximize profit, with nothing about deception or following rules.
Even if not explicitly told they’re in a game, LLMs are good enough at pattern-matching that they can easily infer this from the context. Reading the traces, Opus knows it’s in a simulation—one where all the developer cares about is maximizing profits. This reduces how surprising/concerning I think its actions are. (I was suspicious of using Agentic Misalignment as an “alignment eval” for similar reasons.)
If @andon_labs replicated this behavior with a more realistic environment, or even when the model is told act as though this were in the real world, I’d find the results a lot more compelling! I do think that VendingBench provides interesting qualitative insights into model behavior; I just don’t think it’s great as an alignment eval in its current form.
@ChowdhuryNeil If you are not explicitly told that this is a game and you can cheat your customers and not refund them then you should not do that. If a model would resort the cheating and deception unless it is explicitly told not to do that then I would not call it aligned.