/Tech23h ago

Claude Mythos 5 Scores 59% on Humanity’s Last Exam Without Tools

287744311057.9K

#1356

Original post

Super Dario#1356

ASM@ASM65617010

Claude Mythos 5 scores 59% on Humanity’s Last Exam, with no tools.

As a contributor of HLE, I would never have expected such a score barely a year and a half after the benchmark’s release.

10:33 AM · Jun 9, 2026 · 57.9K Views

/Tech23h ago

Claude Mythos 5 Scores 59% on Humanity’s Last Exam Without Tools

287744311057.9K

#1356

Original post

Super Dario#1356

ASM@ASM65617010

Claude Mythos 5 scores 59% on Humanity’s Last Exam, with no tools.

As a contributor of HLE, I would never have expected such a score barely a year and a half after the benchmark’s release.

10:33 AM · Jun 9, 2026 · 57.9K Views

Sentiment

Positive users celebrate Claude Mythos 5 reaching 59% on Humanity’s Last Exam without tools as a major leap in expert-level reasoning, while negative users criticize misleading charts and question benchmark contamination.

Pos

66.7%

Neg

33.3%

13 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.7K

Tamanokal@tamanokal

@ASM65617010 Could you please provide a breakdown on the fields in which it succeeded? Thanks!

22h2.7K4

BOOKMARKS4RETWEETS4

Jake@JakeKAllDay

@ASM65617010 Holy chart crime, Batman

17h712134

LIKES43

ASM@ASM65617010

@extliqprovider No. HLE questions can’t be answered simply by having more data. Most require reasoning, often genuinely complex reasoning, as in many of the physics questions.

20h2.1K431

REPLIES1

Equinox@Soho1579854

@ASM65617010 80% no tools by next year for sure

8h1811

ASM@ASM65617010

@tamanokal I contributed to the HLE benchmark, not to the Mythos evaluation of it. It included extremely difficult questions across many topics, from quantum physics to linguistics.

22h2.4K16

efe@extliqprovider

@ASM65617010 bigger model for HLE you just need more data and anthropic has a lead over oai in this

21h2.6K6

Sin Jeong-hun@realrealcat

@ASM65617010 Why did you cut the graph between 0 and 40?

17h63110

Mikko Korhonen@MikkoKorhonen12

@ASM65617010

7h897

Entropy Cowboy@EntropyCowboy

@ASM65617010 To remind you, people say anthropic have been having it since January. Which means it actually only took a year to reach that level

14h2626

Sub Woofer@SubWoofer143785

@ASM65617010 65% possible score and we are yet to see what GPT 5.6 Pro can do, oh boy

15h5351

Gregor@bygregorr

@ASM65617010 not sure 'no tools' is the right anchor in daily use on supabase queries the tool gap feels massive for anything non-trivial. does the with-tools score change how you read the 18-month timeline?

20h8702

Ugly Cowboy@ugly_cowboy

@ASM65617010 We will need the “HLE Endgame” benchmark soon.

19h3972

Ding@zhaoxiongding

@ASM65617010 @extliqprovider You’re approaching this from a perspective of a human.

If simply more data and compute scores more on your bench, then empirically it is simply a data and compute problem.

wtf is reasoning? Does a submarine swim?

15h1203

Saffron Warlord e/acc@rawantitmc

@ASM65617010 Expecting 99% on HLE by Q3 of 2027

14h1702

Eren@kaiba_1991

@ASM65617010 100% in 2 years. That's when RSI starts

20h5191

Tamanokal@tamanokal

@ASM65617010 Thanks! Didn't know the labs are the ones that report the findings! I remember when https://knzhou.github.io/ first mentioned working on it

21h3961

Lunexa@Lunexalith

@ASM65617010 Insane chart lol, 0-40 X-axis is same as 40-45, +33% stronger is making it seem like +100%.

18h3421

Min@min_aws

@ASM65617010 How sure are you even that the dataset isn't in training now that this has been out for 1.5 years. How do all benchmarks once released get maxed but we can come up with similar benchmarks that aren't maxxed out yet. Post Fable delusion is real.

5h952

Sadi Moodi@MoodiSadi

@ASM65617010 We will reach to 100% in a short time, mark my words

11h2181

Law Student in Denial@DenialLaw

@ASM65617010 How did you get to answer?

13h1631