/Tech11h ago

Anthropic integrates five software engineering benchmarks developed by academic Ofir Press into its latest AI model system card

The evaluations include ProgramBench and SWE-bench variants.

410833321.3K
Original post
Julian Schrittwieser@Mononofu#382inTech

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

1:26 AM · Jun 10, 2026 · 18.7K Views
Sentiment

Users welcomed Anthropic citing hard benchmarks from Ofir Press in its system card, praising them as the best way to measure genuine AI progress instead of vague impressions.

Pos
100.0%
Neg
0.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2.3K
swyx@swyx

@OfirPress King!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

5hViews 2.3KLikes 2Bookmarks 0
LIKES3
Ofir Press@OfirPress

@Mononofu Thanks!!

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

6hViews 491Likes 3Bookmarks 0
Emirhan Erkan@permaximum88

@Mononofu Yet I can't test Fable on my benchmark, the Singularity Gate, arguably the hardest and one of the most important benchmarks for progress towards AGI and singularity. It switches to Opus on 56% of the tasks in the benchmark corpus.

8hViews 38
Alex YGift@Radipdegen

@Mononofu glad you find them useful. hard benchmarks are the only way to tell real progress from vibes

11h
Rugbist@rugbist_

@Mononofu giving credit where its due is always good to see

what benchmarks did you all find the most useful?

11h