15h ago

AI Models Exhibit Counterfactual Evaluation Gap On Unseen Tasks

——0——
Original post
gavin leech (Non-Reasoning)GL#1480@GLEECHOPgavin leech (Non-Reasoning)GL#1480gavin leech (Non-Reasoning)|@GLEECH

@joodalooped Some useful words: * Counterfactual-evaluation gap: they do way worse on stuff they haven't seen. Model task perf is indeed strongly dependent on task training data.

2:33 AM · May 19, 2026 View on X

Sentiment

Pos100%
Neg0%

Users describe AI models as very impressive at generalizing beyond literal memorization by noting and exploiting similarities on unseen tasks.

1 comment with sentiment.

17121.1K

Cluster engagement

83 snapshots