OpenAI Unveils GPT-5.5: First Fully Retrained Base Model Since GPT-4.5, Elevating AI Agentic Capabilities and Multi-Step Task Handling

OpenAI has officially launched GPT-5.5, codenamed “Spud,” its first fully retrained base model since GPT-4.5. This new model is engineered to complete complex multi-step tasks with minimal human direction. It establishes new benchmarks in agentic coding, computer use, and knowledge work, while matching GPT-5.4’s per-token latency. API access, however, is delayed pending additional safety work.

For months, the AI industry has widely acknowledged Anthropic’s dominance in the enterprise market. Internal sources indicated that OpenAI had been in a "Code Red" state since at least December 2025, observing Anthropic's Annual Recurring Revenue (ARR) surge from $9 billion to $30 billion, which progressively eroded OpenAI's B2B market positioning.

On Thursday, OpenAI delivered its response. GPT-5.5 is now rolling out to Plus, Pro, Business, and Enterprise users within ChatGPT and Codex. The model is designed to operate with limited human direction, functioning across various applications including email, spreadsheets, and calendars.

The core thesis behind GPT-5.5 is "legibility." While previous models often required meticulously structured prompts and multi-step supervision, OpenAI states that 5.5 can take a "messy, multi-part task" and autonomously plan, utilize tools, verify its work, navigate ambiguities, and persist until the task is successfully completed.

Performance gains are concentrated across four critical areas: agentic coding, computer use, knowledge work, and early scientific research. OpenAI characterizes these as domains "where progress depends on reasoning across context and taking action over time."

The benchmark results are robust. GPT-5.5 achieves 82.7% on Terminal-Bench 2.0, a test for complex command-line workflows demanding planning, iteration, and tool coordination. On SWE-Bench Pro, which evaluates real-world GitHub issue resolution across four programming languages, it scores 58.6%, solving more tasks in a single pass than prior models. For GDPval, testing agents across 44 knowledge work occupations, it scores 84.9%. On OSWorld-Verified, measuring autonomous operation in real computer environments, it reaches 78.7%. Furthermore, on Tau2-bench Telecom, it achieves 98.0% without prompt tuning. Across all these benchmarks, OpenAI reports that GPT-5.5 improves upon GPT-5.4’s scores while consuming fewer tokens.

This efficiency claim holds significant commercial importance. Typically, larger, more capable models are slower to serve, presenting a cost-quality trade-off for enterprise clients. OpenAI asserts that GPT-5.5 matches GPT-5.4’s per-token latency in real-world serving, thereby delivering a substantial intelligence upgrade without a corresponding increase in response time. Additionally, it uses significantly fewer tokens to complete equivalent tasks in Codex, directly reducing the cost per task for enterprise deployments. Although GPT-5.5 is priced higher per token than GPT-5.4, OpenAI contends that the net outcome is superior results at a lower overall cost.