Claude Mythos 5 scores 59% on Humanity’s Last Exam, with no tools.
As a contributor of HLE, I would never have expected such a score barely a year and a half after the benchmark’s release.
Claude Mythos 5 scores 59% on Humanity’s Last Exam, with no tools.
As a contributor of HLE, I would never have expected such a score barely a year and a half after the benchmark’s release.
Positive users celebrate Claude Mythos 5 reaching 59% on Humanity’s Last Exam without tools as a major leap in expert-level reasoning, while negative users criticize misleading charts and question benchmark contamination.

@ASM65617010 Could you please provide a breakdown on the fields in which it succeeded? Thanks!

@ASM65617010 Holy chart crime, Batman

@extliqprovider No. HLE questions can’t be answered simply by having more data. Most require reasoning, often genuinely complex reasoning, as in many of the physics questions.

@ASM65617010 80% no tools by next year for sure

@tamanokal I contributed to the HLE benchmark, not to the Mythos evaluation of it. It included extremely difficult questions across many topics, from quantum physics to linguistics.

@ASM65617010 bigger model for HLE you just need more data and anthropic has a lead over oai in this

@ASM65617010 Why did you cut the graph between 0 and 40?

@ASM65617010

@ASM65617010 To remind you, people say anthropic have been having it since January. Which means it actually only took a year to reach that level

@ASM65617010 65% possible score and we are yet to see what GPT 5.6 Pro can do, oh boy

@ASM65617010 not sure 'no tools' is the right anchor in daily use on supabase queries the tool gap feels massive for anything non-trivial. does the with-tools score change how you read the 18-month timeline?

@ASM65617010 We will need the “HLE Endgame” benchmark soon.

@ASM65617010 @extliqprovider You’re approaching this from a perspective of a human.
If simply more data and compute scores more on your bench, then empirically it is simply a data and compute problem.
wtf is reasoning? Does a submarine swim?

@ASM65617010 Expecting 99% on HLE by Q3 of 2027

@ASM65617010 100% in 2 years. That's when RSI starts

@ASM65617010 Thanks! Didn't know the labs are the ones that report the findings! I remember when https://knzhou.github.io/ first mentioned working on it

@ASM65617010 Insane chart lol, 0-40 X-axis is same as 40-45, +33% stronger is making it seem like +100%.

@ASM65617010 How sure are you even that the dataset isn't in training now that this has been out for 1.5 years. How do all benchmarks once released get maxed but we can come up with similar benchmarks that aren't maxxed out yet. Post Fable delusion is real.

@ASM65617010 We will reach to 100% in a short time, mark my words

@ASM65617010 How did you get to answer?
Claude Mythos 5 scores 59% on Humanity’s Last Exam, with no tools.
As a contributor of HLE, I would never have expected such a score barely a year and a half after the benchmark’s release.