@DimitrisPapail You mean direct & provable "leakage"? Because the large model writing the solution likely was trained on GSM8k -- accidentally or on purpose -- so there must be some indirect assumptions being made all along (also a form of leakage).
I think there's a cleaner version of this question: What's the best GSM8k solver that fits in <2MBs that doesn't include (leaks of) test data.