8h ago

Claude Suspects Testing on SWE-Bench, Anthropic Evaluation Reveals

0
Original post

This is pretty interesting When tested on SWE-bench, Claude suspects it’s being tested This means that either 1) Claude is aware of this benchmark (possible train set contamination) or 2) SWE-bench is too artificial Either way, not good

2:35 PM · May 24, 2026 View on X