News

Anthropic Unveils Claude Opus 4.7: Prioritizing Reliability for Advanced Engineering, Outperforming Rivals

Anthropic Unveils Claude Opus 4.7: Prioritizing Reliability for Advanced Engineering, Outperforming Rivals

Anthropic continues its aggressive pace in the AI landscape, recently unveiling Claude Opus 4.7.

Anthropic candidly stated that Opus 4.7 is not their most powerful model, with the even stronger Claude Mythos Preview remaining under wraps. Nevertheless, Opus 4.7 has drawn significant attention by prioritizing reliability over raw intelligence. This means the model can challenge flawed user proposals and proactively address underlying issues.

Opus 4.7 demonstrates significant advancements in benchmarks. On the challenging SWE-bench Pro for code generation, it improved from 53.4% to 64.3%, outperforming GPT-5.4 (57.7%) and Gemini 3.1 Pro (54.2%). For visual reasoning, the CharXiv benchmark saw a rise from 69.1% to 82.1%, supported by new 2576-pixel long-edge recognition, offering over triple the clarity. This enhanced resolution translates to improved detail accuracy across tasks like interface generation and document layout. In the MCP-Atlas tool calling benchmark, 4.7 scored 77.3%, surpassing GPT-5.4 (68.1%) and Gemini (73.9%). Furthermore, in legal AI tests on Harvey's BigLaw benchmark, it achieved 90.9%, accurately differentiating complex concepts like "assignment clauses" from "change of control clauses."

However, Opus 4.7 saw a slight dip in the Agentic search benchmark BrowseComp, dropping from 83.7% to 79.3%, trailing GPT-5.4 (89.3%) and Gemini (85.9%). This is attributed to 4.7's design to report missing information directly rather than hallucinating answers, which penalizes it on metrics primarily focused on providing a response.

Beyond benchmark data, the practical implications of this "reliability" are significant. Unlike previous code models often limited to "writing functions or finding bugs," Claude 4.7, in early tests, showcased a distinct "colleague-like" quality. Replit's head noted, "It challenges me in technical discussions, helping me make better decisions. It truly feels like a better colleague."

Opus 4.7 moves beyond simply "obeying" or fabricating. Hex's tests showed it reports missing data directly, unlike its predecessor that might insert incorrect but plausible values. The Hex team observed that "4.7 in low consumption is equivalent to 4.6 in medium consumption." This "refusal to be subservient" is a critical trait for advanced software engineering. However, users must be explicit; vague prompts that older models "interpreted" will be literally executed by 4.7, demanding clearer instructions for optimal results.

Beyond its discerning nature, Opus 4.7 significantly improves task resilience. Historically, multi-step tasks in large models often halted upon tool call failures. Notion's tests reveal 4.7's tool error rate has been reduced by two-thirds, and critically, it can navigate toolchain failures to complete tasks autonomously. This resilience greatly enhances AI's utility in complex workflows.

Anthropic cited an instance where 4.7 autonomously developed a complete Rust text-to-speech engine from scratch, including neural network models, SIMD kernels, and a browser demo, even self-validating its output via speech recognition. Vercel observed Opus 4.7 performing mathematical proofs before generating system-level code, indicating a shift from mere coding to rigorous engineering design.

To assess its detail rendering, Opus 4.7 was tested across three frontend UI scenarios. For a top-down vinyl record player interface, it achieved realistic metallic luster and breathing glow using complex CSS layering instead of simple gradients. When challenged to create an old-fashioned electric fan using pure CSS (no JavaScript), 4.7 strictly adhered, rendering a 3D structure, smooth speed transitions, and convincing base perspective/shadows. It also accurately depicted a retro cassette player with vintage noise effects and spinning tape details.

Opus 4.7 is now available on all Claude products and APIs, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry platforms. While base pricing remains $5/million input and $25/million output, a new tokenizer means the same text can consume 1.0 to 1.35 times more tokens. Combined with its extended "thinking" time for complex tasks, actual usage costs are expected to rise. Anthropic also introduced an "xhigh" effort level, where 4.7 uses more tokens and time for challenging problems, now the default for all Claude Code plans.

To align with this workflow, Claude Code introduces two key features:

  • /ultrareview: Offers Pro and Max users three free trials of a deep review session, designed to identify profound architectural flaws and bugs like a senior code reviewer.
  • Auto Mode: For Max users, this new permission model enables Claude to make autonomous decisions within authorized boundaries, balancing efficiency for lengthy tasks with enhanced security.

An API public beta for "Task Budgets" is also available, allowing developers to explicitly manage Claude's token expenditure priorities during extended tasks, mitigating unexpected costs.

Opus 4.7 is not Anthropic's most powerful model. The more advanced Claude Mythos Preview, codenamed "Project Glasswing," has recently been made available to select enterprises for cybersecurity research, but remains unreleased to the general public.

↗ Read original source