Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!
Thanks Anthropic for using *five* of our benchmarks in the new system card.
Anthropic used the benchmarks in its latest system card.
Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!
Thanks Anthropic for using *five* of our benchmarks in the new system card.
Positive users praise Anthropic for citing benchmarks from Ofir Press in its latest system card because they value giving proper credit and using rigorous benchmarks to measure real progress.
@OfirPress King!
Thanks Anthropic for using *five* of our benchmarks in the new system card.
@Mononofu Thanks!!
Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!

@Mononofu Yet I can't test Fable on my benchmark, the Singularity Gate, arguably the hardest and one of the most important benchmarks for progress towards AGI and singularity. It switches to Opus on 56% of the tasks in the benchmark corpus.

@Mononofu glad you find them useful. hard benchmarks are the only way to tell real progress from vibes

@Mononofu giving credit where its due is always good to see
what benchmarks did you all find the most useful?
Anthropic used the benchmarks in its latest system card.
Thank you for making these benchmarks, hard benchmarks are incredibly useful for tracking progress in the field!
Thanks Anthropic for using *five* of our benchmarks in the new system card.