AI Startups

OpenAI Admits Newer Models Hallucinate Even More

In a technical report, the company said “more research is needed” to explain why hallucinations increase as reasoning capabilities scale

The Left Shift Bureau

20 Apr 2025 — 2 min read

According to OpenAI’s internal benchmarks, their newer models– o3 and o4 mini– hallucinate more often than older reasoning models like o1, o1-mini, and o3-mini, as well as traditional models such as GPT-4.

In fact, on OpenAI’s PersonQA benchmark, o3 hallucinated on 33% of queries — more than double the rate of o1 and o3-mini. O4-mini performed even worse, hallucinating 48% of the time.

Adding to the concern, OpenAI acknowledges it doesn’t fully understand the cause. In a technical report, the company said, "We also observed some performance differences comparing o1 and o3. Specifically, o3 tends to make more claims overall, leading to more accurate and more inaccurate/hallucinated claims."

It adds, “more research is needed” to explain why hallucinations increase as reasoning capabilities scale.

While AI hallucinations — where a model fabricates information — are a known challenge, previous iterations of OpenAI’s models had gradually reduced this issue. However, that has not been the case with o3 and o4-mini.

AI researchers like Gary Marcus has long warned about the hallucinatory behavior of large language models — and recent developments seem to validate his concerns.

Previously, explaining hallucinations in LLMs, in a X (previous Twitter) post, he said, "LLM "hallucinations" arise, regularly, because (a) they literally don't know the difference between truth and falsehood, (b) they don't have reliably reasoning processes to guarantee that their inferences are correct and (c) they are incapable of fact-checking their own work. Instead, everything that LLMs say – true or false – comes from the same process of statistically reconstructing what words are likely in some context."

So many people are confused about the relation between human cognitive errors and LLM hallucinations that I wrote this short explainer:

Humans say things that aren't true for many different reasons
• Sometimes they lie
• Sometimes they misremember things
• Sometimes they fail…
— Gary Marcus (@GaryMarcus) April 21, 2024

According to reports, OpenAI is reportedly in discussions to acquire Windsurf, the maker of a popular AI-powered coding assistant, in a deal valued at around $3 billion.

If finalised, the acquisition would position OpenAI in direct competition with other AI coding tool providers like Cursor.

OpenAI Admits Newer Models Hallucinate Even More

The Left Shift Bureau

Read more

Amazon Could Replace Android on Fire TV Devices with New Linux-Based OS

Perplexity Could Replace Google Gemini as Default Assistant on Android Phones

Intel Appoints Sachin Katti as CTO and Head of AI in Major Leadership Overhaul

Discrepancies Emerge Between OpenAI’s and Third-Party Benchmark Results