/AI16h ago

Alex J. Champandard and Dimitris Papailiopoulos argue that preventing GSM8k benchmark contamination is impractical without manual filtering

Models frequently ingest benchmark questions accidentally or intentionally.

--0--
Original post

@DimitrisPapail You mean direct & provable "leakage"? Because the large model writing the solution likely was trained on GSM8k -- accidentally or on purpose -- so there must be some indirect assumptions being made all along (also a form of leakage).

I think there's a cleaner version of this question: What's the best GSM8k solver that fits in <2MBs that doesn't include (leaks of) test data.

12:42 AM · Jun 1, 2026 · 341 Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS116LIKES2

@alexjc the big model is trained on it you're write, and you can't control for that, unless you do things mostly manually which is infeasible.

@DimitrisPapail You mean direct & provable "leakage"? Because the large model writing the solution likely was trained on GSM8k -- accidentally or on purpose -- so there must be some indirect assumptions being made all along (also a form of leakage).

11hViews 116Likes 2Bookmarks 0