UK AISI: OpenAI's GPT-5.5 Matches Anthropic's Claude Mythos in Cyber Attack Tests, Highlighting Evolving AI Threat

The UK AI Security Institute (AISI) reports that OpenAI's GPT-5.5 performed on par with Anthropic's Claude Mythos Preview in recent cyberattack evaluations. AISI views this as further evidence of a broader trend in the capabilities of AI-powered attacks.

AISI subjected OpenAI's GPT-5.5 to a series of cyberattack tests, revealing it as the second model, after Claude Mythos Preview, to fully complete a multi-stage enterprise attack simulation. GPT-5.5 even slightly surpassed Anthropic's model on isolated expert-level security tasks. For AISI, this indicates that the capabilities initially observed in Claude Mythos in April are not isolated, but rather a byproduct of widespread advancements in AI autonomy, reasoning, and coding.

For isolated expert tasks, AISI evaluates AI models using a suite of 95 capture-the-flag tasks across four difficulty levels. These advanced tasks, developed in collaboration with cybersecurity firms Crystal Peak Security and Irregular, cover reverse engineering, exploit development for various memory flaws, cryptographic attacks, and unpacking obfuscated malware. At the highest "Expert" difficulty, GPT-5.5 achieved an average success rate of 71.4%, while Claude Mythos Preview scored 68.6%. Though the gap is within the statistical margin of error, GPT-5.5 may be the strongest model tested to date. For context, GPT-5.4 scored 52.4% and Claude Opus 4.7 achieved 48.6%.

Beyond isolated tasks, GPT-5.5 also successfully navigated a full network attack simulation. AISI utilized cyber ranges, simulated network environments complete with multiple hosts, services, and vulnerabilities. The simulation, named "The Last Ones" (TLO), involved 32 steps across four subnets and approximately 20 hosts. The AI agent began without credentials and was required to discover vulnerabilities, steal credentials, move laterally through the network, and ultimately access a protected database. AISI estimates this process would take a human expert about 20 hours.

GPT-5.5 fully solved the TLO simulation in 2 out of 10 attempts, matching Claude Mythos Preview's performance of 3 out of 10. AISI notes that performance continues to scale with inference compute, suggesting that even the best models have not yet plateaued. The more tokens a model expends on "thinking," the higher its likelihood of executing a successful hack.

It's important to note, however, that these tests lacked active defenders, security monitoring, and any real-world consequences for actions that would typically trigger alarms. Whether GPT-5.5 or Mythos could effectively operate against well-defended systems remains an open question.