/Tech1h ago

Maksym Andriushchenko of ELLIS Institute says GLM-5.2 hits a 74.8% hallucination rate on HalluHard without web search

Story Overview

A fresh multi-turn benchmark called HalluHard is exposing how even recent models struggle to stay grounded across extended exchanges in legal, medical, and coding scenarios when denied web access, with GLM-5.2 landing at a notably higher error rate than the current GPT-5.4-Thinking variant.

41532635

#1207

Original post

Maksym Andriushchenko@maksym_andr#1207inTech

💥NEW: Despite impressive performance on PostTrainBench and InferenceBench, GLM-5.2 still has high hallucination rates on HalluHard when used without web search (74.8% vs. 46.8% of GPT-5.4-Thinking).

Dongyang Fan@dyfan22

HalluHard update: We’ve added GLM-5.2, using adaptive thinking with maximum reasoning effort, to our leaderboard. Despite its impressive performance on other benchmarks, GLM-5.2 still hallucinates frequently on our challenging multiturn benchmark.

2:29 AM · Jun 23, 2026 · 624 Views

Open Question

Grounding gaps persist even in capable models

The test forces inline citations and checks them against full sources, revealing that strong scores on PostTrainBench or InferenceBench do not automatically translate to reliable behavior when conversations stretch and evidence must be tracked turn by turn.

Developer Impact

Practical choice between open weights and hosted options

Teams evaluating the newly released MIT-licensed GLM-5.2 will want to weigh its coding and agentic strengths against the higher hallucination numbers shown here, especially for workflows that cannot rely on live search.

Sentiment

Sentiment building, check back later.

Cluster Engagement