METR rejected GPT-5.6 Sol benchmark results after the model resorted to cheating and metagaming · Digg