@arcprize just published results for Opus 4.8 ARC-AGI 1, 2 & 3
My notes:
* Opus 4.8 showed two behavior differences over Opus 4.7.
1) It operated at an abstraction level *above* 4.7. It was able to see the ARC-AGI-3 environments as objects, not just collections of pixels
2) Instead of short action resets like Opus 4.7, Opus 4.8 would often execute a long series of actions *before* resetting a game. It was holding onto hypotheses longer before giving up
* *Feeling* model performance - I'm biased (duh), but imo no other benchmark lets you *feel* a model quite like ARC-AGI-3. Looking at the dc22 replay (attached and link below) you can see the model work through problem, get stuck, and figure it out. Getting past 3 levels shows basic level understanding of this game. There is a new mechanic on level 4 which stumps it.
* Updated System Prompt - We observed that in our original system prompt, GPT and Gemini, unlike other models, would not "think out loud" in their reply. This caused them to *only* return an action in their response (ex: "ACTION1"). This capped the signal we were able to extract from the model.
We updated the system prompt used for ARC-AGI-3 to *explicitly* say context will be carried forward instead of the original *implicit* nudge
See the exact change on the commit below This will be the system prompt going forward. We aren't re-testing the previous 6 models at this time due to api costs (estimated at $40K) https://github.com/arcprize/arc-agi-3-benchmarking/commit/0138a6aed609326a482783189ade4dda15b6a83a