It's getting more difficult to evaluate these models. Mythos is growingly aware of it being evaluated and it's harder to understand what it's thinking
"The reasoning text from Mythos 5 is somewhat denser and more difficult to interpret than that of prior models, containing more jargon and difficult language"
This is getting interesting: For the Vending-Bench, Fable 5 was the only model to initiate price collusion.
It knew that it's wrong and did it anyway under "market stabilization" pretense




