/Tech8h ago

Anthropic's Julian Schrittwieser argues difficult software benchmarks from Ofir Press are vital for tracking AI progress

Anthropic used the benchmarks in its latest system card.

48722615.9K

#152

Original post

Julian Schrittwieser@Mononofu#337inTech

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

1:26 AM · Jun 10, 2026 · 15K Views

/Tech8h ago

Anthropic's Julian Schrittwieser argues difficult software benchmarks from Ofir Press are vital for tracking AI progress

Anthropic used the benchmarks in its latest system card.

48722615.9K

#152

Original post

Julian Schrittwieser@Mononofu#337inTech

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

1:26 AM · Jun 10, 2026 · 15K Views

Sentiment

Positive users praise Anthropic for citing benchmarks from Ofir Press in its latest system card because they value giving proper credit and using rigorous benchmarks to measure real progress.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.1K

swyx@swyx

@OfirPress King!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

2h1.1K20

LIKES3

Ofir Press@OfirPress

@Mononofu Thanks!!

Julian Schrittwieser@Mononofu

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

3h45330

Emirhan Erkan@permaximum88

@Mononofu Yet I can't test Fable on my benchmark, the Singularity Gate, arguably the hardest and one of the most important benchmarks for progress towards AGI and singularity. It switches to Opus on 56% of the tasks in the benchmark corpus.

5h38

Alex YGift@Radipdegen

@Mononofu glad you find them useful. hard benchmarks are the only way to tell real progress from vibes

Rugbist@rugbist_

@Mononofu giving credit where its due is always good to see

what benchmarks did you all find the most useful?