News

GPT-5.5 Tops AI Benchmarks Despite 20% API Cost Increase and High Hallucination Rate

GPT-5.5 Tops AI Benchmarks Despite 20% API Cost Increase and High Hallucination Rate

GPT-5.5's API costs have risen by approximately 20% compared to GPT-5.4. While the nominal API price has doubled to $5 per million input tokens and $30 per million output tokens, benchmarking service Artificial Analysis indicates that GPT-5.5 uses about 40% fewer tokens. This efficiency brings the net price hike down to roughly 20%. This increase is still more modest than Anthropic's Opus 4.7, which maintains the same listed price but consumes 35% to 40% more tokens. The release of GPT-5.5 also reinstates OpenAI at the top of the AI rankings, leading the Artificial Analysis Intelligence Index with 60 points, three points ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview, both tied at 57.

Regarding price-performance, GPT-5.5 achieves the same score as Claude Opus 4.7 at maximum compute, but at only a quarter of the cost—around $1,200 compared to $4,800. Google's Gemini 3.1 Pro Preview demonstrates comparable numbers even cheaper, at approximately $900. However, benchmarks do not reveal the entire picture: our tests and developer feedback suggest that Gemini primarily excels in everyday versatility across Google products and in vision tasks, whereas the latest OpenAI and Anthropic models tend to outperform it on coding and agentic work.

The primary weakness of OpenAI's new model lies in its hallucination tendency. On Artificial Analysis' AA Omniscience benchmark, which rewards factual recall and penalizes incorrect answers, GPT-5.5 posts the highest accuracy of any model at 57%. Yet, its hallucination rate stands at a high 86%, significantly more than Claude Opus 4.7's 36% and Gemini 3.1 Pro Preview's 50%. The 14-point jump over GPT-5.4 on this benchmark was predominantly driven by improved factual recall, with only modest gains in reducing hallucinations. A crucial trait for an AI model is knowing when to pass or admit uncertainty. By this measure, GPT-5.5 appears to be a step backward rather than forward in managing factual reliability.

↗ Read original source