/Tech11h ago

Anthropic integrates five software engineering benchmarks developed by academic Ofir Press into its latest AI model system card

The evaluations include ProgramBench and SWE-bench variants.

410833321.3K

#81

Original post

Julian Schrittwieser@Mononofu#382inTech

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

1:26 AM · Jun 10, 2026 · 18.7K Views

/Tech11h ago

Anthropic integrates five software engineering benchmarks developed by academic Ofir Press into its latest AI model system card

The evaluations include ProgramBench and SWE-bench variants.

410833321.3K

#81

Original post

Julian Schrittwieser@Mononofu#382inTech

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

1:26 AM · Jun 10, 2026 · 18.7K Views

Sentiment

Users welcomed Anthropic citing hard benchmarks from Ofir Press in its system card, praising them as the best way to measure genuine AI progress instead of vague impressions.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.3K

swyx@swyx

@OfirPress King!

Ofir Press@OfirPress

Thanks Anthropic for using *five* of our benchmarks in the new system card.

5h2.3K20

LIKES3

Ofir Press@OfirPress

@Mononofu Thanks!!

Julian Schrittwieser@Mononofu

Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

6h49130

Emirhan Erkan@permaximum88

@Mononofu Yet I can't test Fable on my benchmark, the Singularity Gate, arguably the hardest and one of the most important benchmarks for progress towards AGI and singularity. It switches to Opus on 56% of the tasks in the benchmark corpus.

8h38

Alex YGift@Radipdegen

@Mononofu glad you find them useful. hard benchmarks are the only way to tell real progress from vibes

11h

Rugbist@rugbist_

@Mononofu giving credit where its due is always good to see

what benchmarks did you all find the most useful?

11h