Claude Opus 4.8 scores 1.5% on ARC-AGI-3, tripling the previous record set by GPT-5.5 · Digg

Claude Opus 4.8 scores 1.5% on ARC-AGI-3, tripling the previous record set by GPT-5.5 · Digg

Posts from X

Most Activity

VIEWS158.1KBOOKMARKS240LIKES1.9KRETWEETS61REPLIES93

Lisan al Gaib@scaling01

Opus 4.8 just broke ARC-AGI-3

it tripled GPT-5.5's score

we are now at a breathtaking 1.5% human efficiency

28d158.1K1.9K240

Greg Kamradt@GregKamradt

How do we compare model perf in ARC-AGI-3?

In most benchmarks you just compare scores, but with ARC-AGI-3 you get reasoning logs across all the games you play

To compare Opus 4.8 to Opus 4.7 we used LLM as a judge

Using @AmpCode (my daily driver right now) I set up a skill to compare models, then it spawned a sub-agent per game per model

Each sub agent did a single-game analysis, then brought its notes back to the main agent

Very cool to see all of this come together. It would have taken 2-3 days of analysis by hand before

Greg Kamradt@GregKamradt

@arcprize just published results for Opus 4.8 ARC-AGI 1, 2 & 3

My notes: * Opus 4.8 showed two behavior differences over Opus 4.7.

1) It operated at an abstraction level *above* 4.7. It was able to see the ARC-AGI-3 environments as objects, not just collections of pixels

2) Instead of short action resets like Opus 4.7, Opus 4.8 would often execute a long series of actions *before* resetting a game. It was holding onto hypotheses longer before giving up

* *Feeling* model performance - I'm biased (duh), but imo no other benchmark lets you *feel* a model quite like ARC-AGI-3. Looking at the dc22 replay (attached and link below) you can see the model work through problem, get stuck, and figure it out. Getting past 3 levels shows basic level understanding of this game. There is a new mechanic on level 4 which stumps it.

* Updated System Prompt - We observed that in our original system prompt, GPT and Gemini, unlike other models, would not "think out loud" in their reply. This caused them to *only* return an action in their response (ex: "ACTION1"). This capped the signal we were able to extract from the model.

We updated the system prompt used for ARC-AGI-3 to *explicitly* say context will be carried forward instead of the original *implicit* nudge

See the exact change on the commit below This will be the system prompt going forward. We aren't re-testing the previous 6 models at this time due to api costs (estimated at $40K) https://github.com/arcprize/arc-agi-3-benchmarking/commit/0138a6aed609326a482783189ade4dda15b6a83a

28d6.6K367

Lisan al Gaib@scaling01

and they ended up implementing the same thing as me in february lmao

---

they added this hint: "include any context you want to carry forward in your reply"

but notice, other hints are cheating!111!

so instead of just saying the move, the models now makes notes exactly as in my memory agent setup

Greg Kamradt@GregKamradt

@arcprize just published results for Opus 4.8 ARC-AGI 1, 2 & 3

My notes: * Opus 4.8 showed two behavior differences over Opus 4.7.

1) It operated at an abstraction level *above* 4.7. It was able to see the ARC-AGI-3 environments as objects, not just collections of pixels

2) Instead of short action resets like Opus 4.7, Opus 4.8 would often execute a long series of actions *before* resetting a game. It was holding onto hypotheses longer before giving up

* *Feeling* model performance - I'm biased (duh), but imo no other benchmark lets you *feel* a model quite like ARC-AGI-3. Looking at the dc22 replay (attached and link below) you can see the model work through problem, get stuck, and figure it out. Getting past 3 levels shows basic level understanding of this game. There is a new mechanic on level 4 which stumps it.

* Updated System Prompt - We observed that in our original system prompt, GPT and Gemini, unlike other models, would not "think out loud" in their reply. This caused them to *only* return an action in their response (ex: "ACTION1"). This capped the signal we were able to extract from the model.

We updated the system prompt used for ARC-AGI-3 to *explicitly* say context will be carried forward instead of the original *implicit* nudge

See the exact change on the commit below This will be the system prompt going forward. We aren't re-testing the previous 6 models at this time due to api costs (estimated at $40K) https://github.com/arcprize/arc-agi-3-benchmarking/commit/0138a6aed609326a482783189ade4dda15b6a83a

28d8.6K3711

ARC Prize@arcprize

Analysis Note #1: Opus 4.8 discovered game mechanics more quickly than Opus 4.7

On ar25, Opus 4.8 derived the Level #1 reflection rule by frame 5 ("Blue moved LEFT 3, Orange moved RIGHT 3 ... mirror reflections about col 31")

It then proceeded to clear Level 1 in 24 actions

Opus 4.7 took 136 actions of probing to brute-force the same level and never verbalized the rule.

On dc22, Opus 4.8 hit "BREAKTHROUGH — the maze connects via toggles!" at frame 30 and cleared Level 1-3

Opus 4.7 never identified the agent across 295 actions and 17 RESETs

https://arcprize.org/replay/22a25f67-1171-406a-99f7-74a0d00e76d8?frame=6&quote=is+the+SAME+shape+but+has+**holes+%28%600%60%29**+that+must+be+**filled**+via+%60ACTION6&quoteFrame=6&quotePrefix=mary+hypothesis%3A**+The+orange+%604%60+is+the+SOLID+reference+template.+The+blue+%605%60+&quoteSuffix=+x+y%60+clicks+to+complete%2Fmatch+it.%0A%0ACurrent+blue+position+%28cols+15-23%29+%E2%80%94+hole+lo&reasoning=decision

Gif: Opus 4.8 playing ar25

28d3.3K394

Taelin@VictorTaelin

@arcprize seems like my perception is more correlated w/ arc-agi than with overall vibes on X

ARC Prize@arcprize

Anthropic Opus 4.8 is new SOTA on ARC-AGI-3

Score: 1.5%, ~$10K

ARC-AGI-3 analysis notes: * Opus 4.8 read the environment an abstraction *above* Opus 4.7, as objects & systems, not pictures * Opus 4.8 succeeded on early levels, but still committed to a wrong sub-goal

28d2.7K403

Joko@joko76ers

@scaling01 It's amazing how far benchmarks are from real life

28d1.4K40

ARC Prize@arcprize

Opus 4.8 ARC-AGI-2 Scores

- Opus 4.8 Low: 62.22% ($1.68/task) - Opus 4.8 Medium: 71.67% ($2.39/task) - Opus 4.8 High: 72.08% ($2.74/task) - Opus 4.8 Max: N/A*

* Opus 4.8 Max was unable to complete ARC-AGI-2 Semi-Private set due to api timeout errors. This score has been excluded

28d1.3K162

ARC Prize@arcprize

Opus 4.8 wasn’t able to make any progress on tr87 despite this being one of the games most models make progress on

https://arcprize.org/replay/c3b77031-b814-464e-b636-d9663ffd12a2

28d1.4K112

Lisan al Gaib@scaling01

@FeltSteam they changed the prompt

28d2.6K26

Lisan al Gaib@scaling01

article from feb:

Lisan al Gaib@scaling01

http://x.com/i/article/2024623133945106432

28d3.4K72

ARC Prize@arcprize

Testing notes:

* ARC-AGI-3 Prompt Update - We observed that in our original system prompt, GPT and Gemini, unlike other models, would not "think out loud" in their reply. This caused them to *only* return an action in their response (ex: "ACTION1"). This capped the signal we were able to extract from the model. We updated the system prompt used for ARC-AGI-3 to *explicitly* say context will be carried forward instead of the original *implicit* nudge

See the exact change on the commit below

This will be the system prompt going forward. The previous 6 models already tested will not be re-tested due to costs (estimated at $40K in api costs)

https://github.com/arcprize/arc-agi-3-benchmarking/commit/0138a6aed609326a482783189ade4dda15b6a83a

* We also noticed that Opus 4.8 used over $10K (our testing limit) in api costs for Semi Private testing (55 games). Going forward we will be implementing a hard max per game for costs

28d1.8K17

ARC Prize@arcprize

Analysis Note #2: Opus 4.8 won more early levels, but misattributed it win over Opus 4.7

Opus 4.8 introduces a failure mode Opus 4.7 wasn't reaching. It succeeded on early levels, then committed to a wrong sub-goal on the next

On dc22, Opus 4.8 cleared Level 1-3, then burned ~490 actions on Level 4 cycling through five mutually-contradictory mechanic theories and runs of identical repeated clicks

Opus 4.7 never got far enough to display this shape

Gif: Opus 4.8 playing dc22

28d2.6K14

ARC Prize@arcprize

Opus 4.8 ARC-AGI-1 Scores

- Opus 4.8 Low: 88.0% ($0.67/task) - Opus 4.8 Medium: 91.5% ($0.91/task) - Opus 4.8 High: 92.0% ($1.04/task) - Opus 4.8 Max: 92.5% ($2.33/task)

28d1.2K13

FeltSteam0@FeltSteam

@scaling01 Why did Opus do so much better?

28d2.6K41

ARC Prize@arcprize

Opus 4.8 ARC-AGI-3 Public Demo Scorecard Score on Public Demo: 4.9%

Note: Public Demo was intentionally designed to be easier than ARC-AGI-3 Semi-Private which explains the score delta (4.9% vs 1.4%)

https://arcprize.org/scorecards/model/anthropic-opus-4-8-high

28d1.3K12

Lisan al Gaib@scaling01

https://arcprize.org/leaderboard

28d3.8K71

The Tower@TheWhiteTower16

@scaling01 wonder what its actual score without the stupid scoring methodology

28d1.1K41

ARC Prize@arcprize

Opus 4.8's most performant Public Demo game

lp85

https://arcprize.org/replay/57ee8a2d-b4ea-4685-b5ca-e013f4fccedd

28d1.6K10

Philippe Tremblay@ptremblay

@arcprize where's GPT 5.5?

28d20561

Jj McMc@AccountMus629

@scaling01 Extrapolating from only these 2 data points and with a method i wont publish, i have determined that we will have AGI by thursday, june 4, 2026

28d1.2K31