/AI7h ago

Anthropic integrates five software engineering benchmarks developed by academic Ofir Press into its latest AI model system card

The evaluations include ProgramBench and SWE-bench variants.

47422011.7K

#72

Original post

Julian Schrittwieser@Mononofu#362inAI

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

1:26 AM · Jun 10, 2026 · 11.9K Views

/AI7h ago

Anthropic integrates five software engineering benchmarks developed by academic Ofir Press into its latest AI model system card

The evaluations include ProgramBench and SWE-bench variants.

47422011.7K

#72

Original post

Julian Schrittwieser@Mononofu#362inAI

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

1:26 AM · Jun 10, 2026 · 11.9K Views

Sentiment

Users welcomed Anthropic citing hard benchmarks from Ofir Press in its system card, praising them as the best way to measure genuine AI progress instead of vague impressions.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

swyx@swyx

@OfirPress King!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

1h58720

LIKES3

Ofir Press@OfirPress

@Mononofu Thanks!!

Julian Schrittwieser@Mononofu

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

2h42530

Emirhan Erkan@permaximum88

@Mononofu Yet I can't test Fable on my benchmark, the Singularity Gate, arguably the hardest and one of the most important benchmarks for progress towards AGI and singularity. It switches to Opus on 56% of the tasks in the benchmark corpus.

3h38

Alex YGift@Radipdegen

@Mononofu glad you find them useful. hard benchmarks are the only way to tell real progress from vibes

Rugbist@rugbist_

@Mononofu giving credit where its due is always good to see

what benchmarks did you all find the most useful?