7h agoDimitris Papailiopoulos argues preventing LLM dataset contamination on GSM8k is practically impossible without manual evaluationModels are routinely and accidentally trained on test datasets.