/AI7h ago

Anthropic integrates five software engineering benchmarks developed by academic Ofir Press into its latest AI model system card

The evaluations include ProgramBench and SWE-bench variants.

47422011.7K
Original post
Julian Schrittwieser@Mononofu#362inAI

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

1:26 AM · Jun 10, 2026 · 11.9K Views
Sentiment

Users welcomed Anthropic citing hard benchmarks from Ofir Press in its system card, praising them as the best way to measure genuine AI progress instead of vague impressions.

Pos
100.0%
Neg
0.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS587
swyx@swyx

@OfirPress King!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

1hViews 587Likes 2Bookmarks 0
LIKES3
Ofir Press@OfirPress

@Mononofu Thanks!!

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

2hViews 425Likes 3Bookmarks 0
Emirhan Erkan@permaximum88

@Mononofu Yet I can't test Fable on my benchmark, the Singularity Gate, arguably the hardest and one of the most important benchmarks for progress towards AGI and singularity. It switches to Opus on 56% of the tasks in the benchmark corpus.

3hViews 38
Alex YGift@Radipdegen

@Mononofu glad you find them useful. hard benchmarks are the only way to tell real progress from vibes

6h
Rugbist@rugbist_

@Mononofu giving credit where its due is always good to see

what benchmarks did you all find the most useful?

7h