15h agoAI Models Exhibit Counterfactual Evaluation Gap On Unseen Tasks——0——Original postGL#1480@GLEECHOPGL#1480gavin leech (Non-Reasoning)|@GLEECH@joodalooped Some useful words: * Counterfactual-evaluation gap: they do way worse on stuff they haven't seen. Model task perf is indeed strongly dependent on task training data.2:33 AM · May 19, 2026 View on X