AI chatbots can answer fresh news well, but their weakest failures hide inside their confidence.
Best systems are surprisingly good at recent news when the question is clean and multiple choice.
But it also shows that this success is fragile, because the same systems get worse when they must answer freely, when the news is in Hindi, or when the user’s question contains a false assumption.
The best systems crossed 90% accuracy on multiple-choice questions about events reported only hours earlier, which means retrieval-augmented AI has moved from stale encyclopedia mode toward live information work.
That accuracy is not the same thing as reliability, because the systems were far worse when answers had to be produced freely
these models usually do not fail because they cannot “think,” but because they land on the wrong evidence.
More than 70% of errors came from retrieval failures or source divergence, where the system found something nearby but not exact, then answered faithfully from the wrong article, wrong language, wrong scope, or wrong timestamp.
----
Paper Link – arxiv. org/abs/2605.22785
Paper Title: "Evaluating Commercial AI Chatbots as News Intermediaries"