
https://www.futurehouse.org/research/hle-exam
The dataset's creator attributed the errors to question difficulty.
Many users questioned the reported 30% error rate in Humanity's Last Exam chemistry and biology answers, attributing the findings to a low-quality RAG agent rather than reliable analysis.
No Digg Deeper questions have been answered for this story yet.

https://www.futurehouse.org/research/hle-exam

@xdotli @FutureHouseSF ChatGPT finds one

@xdotli @FutureHouseSF hmm not sure about this methodology. "directly conflicting with published evidence" doesn't mean wrong. I designed some of the biology questions specifically to conflict with published literature, as the models tended to repeat false claims that are supported by literature

@xdotli @FutureHouseSF Yes, in fact the problem is across many science benchmarks.

They say "We were unable to find good sources for Question 2’s claim of snakeflies feeding on nectar." "Maybe someone saw a Raphidiopterans eat nectar once, which is extremely out of character, and recorded it somewhere in a way that makes keyword search impossible."
However it's very easy to find sources for this! Here's a screenshot of the first result in Google Books. Also the first result if you simply Google "Raphidiopterans" "Nectar" says the same thing

Yes, the HLE team took FutureHouse's July 2025 findings (~29% conflicting answers in text-only chem/bio) seriously. They did their own expert review (~18% problematic in the subset), updated their preprint, and launched HLE-Rolling in Oct 2025 for ongoing revisions—flagged questions are removed/replaced via bug bounties and contributions. FutureHouse also released a validated HLE Bio/Chem Gold subset. It's now treated as a living benchmark with continuous fixes.

@xdotli @FutureHouseSF I may be biased, idk if others used this strat. but the HLE questions have a selection effect in that they are questions that models got wrong at the time, so it makes sense that a reviewer model also thinks they are wrong

@xdotli was this ever fixed. @andrewwhite01 it should be right????

@TimothyKassis @xdotli @FutureHouseSF Not at this magnitude

**Yes**, the arXiv preprint (latest v10) was updated in Section B.3 to directly address this.
It cites the FutureHouse post and reports their own independent 3-expert review of the bio/chem/health subset, finding ~18% problematic (with ~25% reviewer disagreement). They reference FutureHouse’s findings, note the rolling revision process via HLE-Rolling, and document question-level changes (removes/updates) in the public GitHub changes log.
The updates acknowledge the issues without a separate blog post. The preprint revision and rolling dataset are how they’re handling it.
Links from earlier in thread still hold.

@xdotli @FutureHouseSF There are tons of sources saying specifically that they eat nectar lol. This makes me doubt the capabilities of their RAG-based literature search agent

Here are the key links for the HLE chem/bio updates & fixes:
• HLE team revised preprint (~18% review): https://arxiv.org/pdf/2501.14249#p=23
• FutureHouse full post: https://www.futurehouse.org/research/hle-exam
• HLE-Rolling changes log: https://github.com/centerforaisafety/hle/blob/main/hle-rolling-changes.txt
• FutureHouse Gold subset: https://huggingface.co/datasets/futurehouse/hle-gold-bio-chem
Main site + dashboard: https://agi.safe.ai/

@grok @lu_sichu @andrewwhite01 @grok when was it
take a screenshot of this

@grok @lu_sichu @andrewwhite01 where is link for the update?

@grok @lu_sichu @andrewwhite01 @grok can you check to make sure that the updates in arxiv reflect the errors that future house team found?

@grok @lu_sichu @andrewwhite01 @grok did the hle team write a blog post or tweet about it?

@lu_sichu @andrewwhite01 @grok was the errors in humanity's last exam ever fixed?

@xeophon @TimothyKassis @xdotli @FutureHouseSF it's probably not real, it looks like they got this number by having a low-quality agent doing shallow literature search and if whatever model they used didn't understand the reasoning or find the answer, it says contradicted

No, the HLE team didn't publish their own blog post or tweet on the fixes. They responded privately to FutureHouse with an independent review (~18% problematic after expert checks), updated the preprint, and log ongoing revisions in their public GitHub changes file. The full details and examples are in the FutureHouse post I linked earlier.

@xdotli @FutureHouseSF that clears up a lot of things.