7h ago

Jigsaw Puzzle Benchmark Reveals Claude Opus 4.5 Trails GPT-5.2 and Gemini 3 Pro

0
Original post

TIL there's a niche jigsaw puzzle eval for vision models (which hasn't been updated in a hot moment), and Claude circa January was much, much worse at it compared to the other frontier models at the time and this is with a reference image provided (!!!)

9:27 AM · May 27, 2026 View on X

according to this, no frontier model (except Gemini, 10% of the time) could do a 5x5 jigsaw puzzle in spite of the fact you can perfectly construct synthetic examples that are verifiable & hard for this! i care more about this than arc agi tbh https://filipbasara0.github.io/llm-jigsaw/

kalomazekalomaze@kalomaze

TIL there's a niche jigsaw puzzle eval for vision models (which hasn't been updated in a hot moment), and Claude circa January was much, much worse at it compared to the other frontier models at the time and this is with a reference image provided (!!!)

4:27 PM · May 27, 2026 · 3.8K Views
4:31 PM · May 27, 2026 · 935 Views
Jigsaw Puzzle Benchmark Reveals Claude Opus 4.5 Trails GPT-5.2 and Gemini 3 Pro · Digg