Discrepancies Emerge Between OpenAI’s and Third-Party Benchmark Results

The difference appears to stem from variations in model configurations and testing environments

Discrepancies Emerge Between OpenAI’s and Third-Party Benchmark Results

A discrepancy between OpenAI’s internal benchmarks and third-party testing of its o3 model is raising concerns about transparency in AI model evaluation.

When OpenAI unveiled o3 in December, it claimed the model could solve over 25% of problems on FrontierMath, a challenging benchmark for mathematical reasoning. However, recent independent testing by Epoch AI found o3 scored closer to 10%.

The difference appears to stem from variations in model configurations and testing environments.

OpenAI’s results were likely achieved using an internal version of o3 with higher compute power, whereas the public version released last week was optimised for speed and product use.

This distinction was confirmed by both Epoch AI and the ARC Prize Foundation, which benchmarked an earlier, more powerful version of o3.

“The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private),” Epoch wrote in a blog post.

Moreover, OpenAI staff acknowledged during a livestream that the production version of o3 is tuned for faster response times and real-world utility, which may result in lower benchmark performance.

Despite the gap, OpenAI maintains that o3 is a significant improvement in user experience. Similar controversies have surfaced involving xAI and Meta, emphasizing the need for greater transparency and consistency in model evaluations across the industry.

According to OpenAI’s internal benchmarks, their newer models– o3 and o4 mini– hallucinate more often than older reasoning models like o1, o1-mini, and o3-mini, as well as traditional models such as GPT-4.

In fact, on OpenAI’s PersonQA benchmark, o3 hallucinated on 33% of queries — more than double the rate of o1 and o3-mini. O4-mini performed even worse, hallucinating 48% of the time.

Adding to the concern, OpenAI acknowledges it doesn’t fully understand the cause. In a technical report, the company said, "We also observed some performance differences comparing o1 and o3. Specifically, o3 tends to make more claims overall, leading to more accurate and more inaccurate/hallucinated claims."

Disclaimer: This story was first reported by Techcrunch