The way the unit distance problem was disproved is interesting: the LLM started working directly in the opposite direction of the common belief found in the training corpus. I wonder if the discover was in part made during the reinforcement learning stage.