OpenAI Admits Newer Models Hallucinate Even More

In a technical report, the company said “more research is needed” to explain why hallucinations increase as reasoning capabilities scale

OpenAI Admits Newer Models Hallucinate Even More

According to OpenAI’s internal benchmarks, their newer models– o3 and o4 mini– hallucinate more often than older reasoning models like o1, o1-mini, and o3-mini, as well as traditional models such as GPT-4.

In fact, on OpenAI’s PersonQA benchmark, o3 hallucinated on 33% of queries — more than double the rate of o1 and o3-mini. O4-mini performed even worse, hallucinating 48% of the time.

Adding to the concern, OpenAI acknowledges it doesn’t fully understand the cause. In a technical report, the company said, "We also observed some performance differences comparing o1 and o3. Specifically, o3 tends to make more claims overall, leading to more accurate and more inaccurate/hallucinated claims."

It adds, “more research is needed” to explain why hallucinations increase as reasoning capabilities scale.

While AI hallucinations — where a model fabricates information — are a known challenge, previous iterations of OpenAI’s models had gradually reduced this issue. However, that has not been the case with o3 and o4-mini.

AI researchers like Gary Marcus has long warned about the hallucinatory behavior of large language models — and recent developments seem to validate his concerns.

Previously, explaining hallucinations in LLMs, in a X (previous Twitter) post, he said, "LLM "hallucinations" arise, regularly, because (a) they literally don't know the difference between truth and falsehood, (b) they don't have reliably reasoning processes to guarantee that their inferences are correct and (c) they are incapable of fact-checking their own work. Instead, everything that LLMs say – true or false – comes from the same process of statistically reconstructing what words are likely in some context."

According to reports, OpenAI is reportedly in discussions to acquire Windsurf, the maker of a popular AI-powered coding assistant, in a deal valued at around $3 billion.

If finalised, the acquisition would position OpenAI in direct competition with other AI coding tool providers like Cursor.