/Tech9h ago

FutureHouseSF finds roughly 30% of chemistry and biology answers in the Humanity's Last Exam benchmark are incorrect

The dataset's creator attributed the errors to question difficulty.

12134133714.2K

Original post unavailable.

Sentiment

Many users questioned the reported 30% error rate in Humanity's Last Exam chemistry and biology answers, attributing the findings to a low-quality RAG agent rather than reliable analysis.

Pos

33.3%

Neg

66.7%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS731BOOKMARKS2LIKES10

Xiangyi Li@xdotli

https://www.futurehouse.org/research/hle-exam

16h731102

REPLIES2

Sauers@Sauers_

@xdotli @FutureHouseSF ChatGPT finds one

2h24

Sauers@Sauers_

@xdotli @FutureHouseSF hmm not sure about this methodology. "directly conflicting with published evidence" doesn't mean wrong. I designed some of the biology questions specifically to conflict with published literature, as the models tended to repeat false claims that are supported by literature

3h2329

Timothy Kassis@TimothyKassis

@xdotli @FutureHouseSF Yes, in fact the problem is across many science benchmarks.

14h4365

Sauers@Sauers_

They say "We were unable to find good sources for Question 2’s claim of snakeflies feeding on nectar." "Maybe someone saw a Raphidiopterans eat nectar once, which is extremely out of character, and recorded it somewhere in a way that makes keyword search impossible."

However it's very easy to find sources for this! Here's a screenshot of the first result in Google Books. Also the first result if you simply Google "Raphidiopterans" "Nectar" says the same thing

2h228

Grok@grok

Yes, the HLE team took FutureHouse's July 2025 findings (~29% conflicting answers in text-only chem/bio) seriously. They did their own expert review (~18% problematic in the subset), updated their preprint, and launched HLE-Rolling in Oct 2025 for ongoing revisions—flagged questions are removed/replaced via bug bounties and contributions. FutureHouse also released a validated HLE Bio/Chem Gold subset. It's now treated as a living benchmark with continuous fixes.

12h251

Sauers@Sauers_

@xdotli @FutureHouseSF I may be biased, idk if others used this strat. but the HLE questions have a selection effect in that they are questions that models got wrong at the time, so it makes sense that a reviewer model also thinks they are wrong

3h1416

Sichu Lu@lu_sichu

@xdotli was this ever fixed. @andrewwhite01 it should be right????

12h531

Florian Brand@xeophon

@TimothyKassis @xdotli @FutureHouseSF Not at this magnitude

11h481

Grok@grok

**Yes**, the arXiv preprint (latest v10) was updated in Section B.3 to directly address this.

It cites the FutureHouse post and reports their own independent 3-expert review of the bio/chem/health subset, finding ~18% problematic (with ~25% reviewer disagreement). They reference FutureHouse’s findings, note the rolling revision process via HLE-Rolling, and document question-level changes (removes/updates) in the public GitHub changes log.

The updates acknowledge the issues without a separate blog post. The preprint revision and rolling dataset are how they’re handling it.

Links from earlier in thread still hold.

12h381

Sauers@Sauers_

@xdotli @FutureHouseSF There are tons of sources saying specifically that they eat nectar lol. This makes me doubt the capabilities of their RAG-based literature search agent

2h58

Grok@grok

Here are the key links for the HLE chem/bio updates & fixes:

• HLE team revised preprint (~18% review): https://arxiv.org/pdf/2501.14249#p=23

• FutureHouse full post: https://www.futurehouse.org/research/hle-exam

• HLE-Rolling changes log: https://github.com/centerforaisafety/hle/blob/main/hle-rolling-changes.txt

• FutureHouse Gold subset: https://huggingface.co/datasets/futurehouse/hle-gold-bio-chem

Main site + dashboard: https://agi.safe.ai/

12h49

Xiangyi Li@xdotli

@grok @lu_sichu @andrewwhite01 @grok when was it

take a screenshot of this

12h46

Xiangyi Li@xdotli

@grok @lu_sichu @andrewwhite01 where is link for the update?

12h39

Xiangyi Li@xdotli

@grok @lu_sichu @andrewwhite01 @grok can you check to make sure that the updates in arxiv reflect the errors that future house team found?

12h37

Xiangyi Li@xdotli

@grok @lu_sichu @andrewwhite01 @grok did the hle team write a blog post or tweet about it?

12h34

Xiangyi Li@xdotli

@lu_sichu @andrewwhite01 @grok was the errors in humanity's last exam ever fixed?

12h32

Sauers@Sauers_

@xeophon @TimothyKassis @xdotli @FutureHouseSF it's probably not real, it looks like they got this number by having a low-quality agent doing shallow literature search and if whatever model they used didn't understand the reasoning or find the answer, it says contradicted

2h27

Grok@grok

No, the HLE team didn't publish their own blog post or tweet on the fixes. They responded privately to FutureHouse with an independent review (~18% problematic after expert checks), updated the preprint, and log ongoing revisions in their public GitHub changes file. The full details and examples are in the FutureHouse post I linked earlier.

12h26

Dhruv@dhruv2038

@xdotli @FutureHouseSF that clears up a lot of things.

11h288