A recent developer-led comparison aimed to evaluate Google's Gemini 3.1 Pro and OpenAI's GPT-5.4 beyond standard benchmarks, focusing on their real-world performance across various tasks. The evaluation tracked quality, speed, and actual operational costs.
Test Setup
The test involved a total of 500 identical tasks distributed across four main categories:
- Coding: 150 tasks
- Reasoning/Math: 100 tasks
- Document Analysis: 150 tasks
- Creative Writing: 100 tasks
Both models received identical prompts. Quality was scored on a 1-5 scale via human evaluation (the primary tester plus two colleagues, averaged). Costs were meticulously tracked per-task, including cache hits.
Results Summary
Overall, GPT-5.4 marginally edged out Gemini 3.1 Pro on quality by 0.1 points (4.2 vs. 4.1). However, Gemini 3.1 Pro demonstrated a significant cost advantage, resulting in an overall saving of 31%.
| Category | GPT-5.4 Quality | Gemini 3.1 Pro Quality | Winner | GPT-5.4 Cost | Gemini Cost | Cost Savings |
|---|---|---|---|---|---|---|
| Coding (150 tasks) | 4.3 | 4.1 | GPT | $18.75 | $13.20 | 30% |
| Reasoning (100 tasks) | 4.1 | 4.2 | Gemini | $14.50 | $10.80 | 26% |
| Document Analysis (150 tasks) | 4.0 | 4.2 | Gemini | $22.50 | $14.40 | 36% |
| Creative Writing (100 tasks) | 4.4 | 4.0 | GPT | $12.00 | $8.40 | 30% |
| Overall | 4.2 | 4.1 | Tie | $67.75 | $46.80 | 31% |
Category Breakdown
Coding: GPT-5.4 Wins (Slightly)
GPT-5.4 scored 4.3 against Gemini's 4.1 in coding tasks. The notable differences were observed in:
- Multi-file refactoring: GPT showed a better understanding of relationships across multiple files.
- Edge case handling: GPT was more effective at identifying and handling edge cases in generated code.
- Simple functions: Quality was essentially identical, with the gap primarily appearing in complex tasks.
For straightforward coding tasks (e.g., CRUD operations, API integrations, utility functions), the quality difference is negligible, making Gemini a 30% cost-saving choice.
Reasoning: Gemini Wins
Gemini scored 4.2 compared to GPT's 4.1 on math and logic tasks. A key finding was that Gemini's "thinking mode" produced more thorough chain-of-thought reasoning without the separate billing OpenAI's o3 model often incurs. Gemini includes reasoning tokens in its standard output price ($12/M), whereas OpenAI charges for reasoning as hidden output tokens on o3 at $8/M, which can inflate bills by 3-10x.
Document Analysis: Gemini Wins Clearly
Gemini 3.1 Pro's 2M context window provided a distinct advantage here. For documents exceeding 200K tokens:
- GPT-5.4: Hits a 272K token surcharge, effectively doubling input pricing to $5.00/M.
- Gemini 3.1 Pro: Maintains a flat $2.00/M price up to 2M tokens, without any surcharges on its Pro version.
For instance, processing a 500K-token document costs $1.00 with Gemini, versus $2.50 with GPT-5.4, delivering 60% savings for comparable quality.
Creative Writing: GPT-5.4 Wins
GPT-5.4 achieved a score of 4.4 against Gemini's 4.0, marking the largest quality disparity across all categories. GPT generated more natural and varied prose, while Gemini's output, though competent, tended to be slightly formulaic. For applications where creative writing quality is paramount, GPT-5.4's premium is justifiable.
Pricing Math
| Metric | GPT-5.4 | Gemini 3.1 Pro | Difference |
|---|---|---|---|
| Input/M | $2.50 | $2.00 | Gemini 20% cheaper |
| Output/M | $15.00 | $12.00 | Gemini 20% cheaper |
| Cache hit/M | $0.25 | $0.20 | Gemini 20% cheaper |