7h ago

Jigsaw Puzzle Benchmark Reveals Claude Opus 4.5 Trails GPT-5.2 and Gemini 3 Pro

455184.7K

——0——

Original post

TIL there's a niche jigsaw puzzle eval for vision models (which hasn't been updated in a hot moment), and Claude circa January was much, much worse at it compared to the other frontier models at the time and this is with a reference image provided (!!!)

9:27 AM · May 27, 2026

#836kalomaze@KALOMAZE

according to this, no frontier model (except Gemini, 10% of the time) could do a 5x5 jigsaw puzzle in spite of the fact you can perfectly construct synthetic examples that are verifiable & hard for this! i care more about this than arc agi tbh https://filipbasara0.github.io/llm-jigsaw/

kalomaze@kalomaze

4:27 PM · May 27, 2026 · 3.8K Views

4:31 PM · May 27, 2026 · 935 Views

Jigsaw Puzzle Benchmark Reveals Claude Opus 4.5 Trails GPT-5.2 and Gemini 3 Pro

Sentiment

Cluster engagement