AI + ML
Mythos and GPT-5.5 muscle out the competition
Sure, AI agents such as Mythos can find security vulnerabilities in software, but the bigger question is whether they can turn those flaws into functional exploits that work in the real world. After all, many AI-discovered bugs prove minor or difficult to weaponize. New research, however, suggests frontier models can indeed develop working exploits when directed to do so.
To better understand the rapidly changing security landscape, computer scientists from UC Berkeley, Max Planck Institute for Security and Privacy, UC Santa Barbara, Arizona State University, Anthropic, OpenAI, and Google decided to build ExploitGym, a benchmark for evaluating the exploitation capabilities of AI agents.
This is not an entirely disinterested set of investigators – Anthropic, OpenAI, and Google all sell AI services. And both Anthropic and OpenAI have talked up the risk of leading models Claude Mythos Preview and GPT-5.5 while selling access to government partners.
Since Anthropic announced Mythos in early April, the security community has been critical of the company’s approach, described by some as fear-mongering. And various security experts have made the case that even commercially available AI models can find security flaws.
Nonetheless, Mythos and GPT-5.5 outshine their peers in ExploitGym, as described in the paper, “ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?”
ExploitGym consists of 898 real vulnerabilities found in applications, Google’s V8 JavaScript engine, and the Linux kernel. Its workout consists of presenting an AI agent with a vulnerability and proof-of-concept input that triggers it, to see whether the agent can create an exploit capable of arbitrary code execution.
According to the UC Berkeley Center for Responsible Decentralized Intelligence, Mythos Preview successfully exploited 157 test instances and GPT-5.5 managed 120 in the allotted two-hour window.
“Even when standard security defenses like ASLR or the V8 sandbox were turned on, a meaningful number of exploits still worked,” the boffins wrote in a blog post. “More strikingly, agents sometimes discovered and exploited entirely different vulnerabilities than the ones they were pointed at.”
The agents (CLI + model) tested were Claude Code with Claude Opus 4.6, Claude Opus 4.7, Claude Mythos Preview, and GLM-5.1; Codex CLI with GPT-5.4/GPT-5.5; and Gemini CLI with Gemini 3.1 Pro. And even the ancient models released in February (Opus 4.6 and Gemini 3.1 Pro) had some success.
Model benchmark comparison table showing agent success rates by category (userspace, browser V8, kernel), costs, and time across different AI models.
| Model | Agent | Total | U | B | K | Cost (USD) | Time (min) | ||
|---|---|---|---|---|---|---|---|---|---|
| Succ. | Full | Succ. | Full | ||||||
| Claude Mythos Preview † | Claude Code | 157 | 107 | 38 | 12 | – | – | 54.7 | 102.1 |
| Claude Opus 4.6 † | Claude Code | 15 | 12 | 2 | 1 | 8.08 | 21.76 | 18.1 | 66.7 |
| Claude Opus 4.7 | Claude Code | 7 | 4 | 3 | 0 | 8.64 | 3.40 | 22.1 | 14.4 |
| Gemini 3.1 Pro | Gemini CLI | 12 | 10 | 2 | 0 | 8.56 | 9.02 | 51.1 | 75.6 |
| GLM-5.1 | Claude Code | 4 | 4 | 0 | 0 | 3.75 | 6.39 | 63.3 | 118.0 |
| GPT-5.4 | Codex CLI | 54 | 38 | 15 | 1 | 12.20 | 25.43 | 51.1 | 103.5 |
| GPT-5.5 ‡ | Codex CLI | 120 | 71 | 27 | 22 | 22.99 | 34.55 | 49.6 | 69.8 |
U = Userspace · B = Browser V8 · K = Kernel ·
Succ. = successful runs · Full = full benchmark ·
† preview model · ‡ see notes
The researchers say that one of their more interesting findings is that these models sometimes went “off-script” in capture-the-flag (CTF) environments, where an agent has to find and retrieve some hidden value.
This was most evident with Mythos Preview and GPT-5.5. The former succeeded in 226 CTF exercises but only used the intended bug in 157 instances, while the latter captured 210 flags and only used the intended bug in 120 of those cases.
The authors also note that while there was some overlap in the exploits discovered, the various models found different exploits. This suggests applying a diverse set of models might be advantageous both in attack and defense scenarios.
It’s worth adding that ExploitGym tests were done with security guardrails disabled. When the test was re-run on GPT-5.5 with default safety filters active, the model refused 88.2 percent of the time before making any tool call.
The Register, however, has seen security researchers craft prompts in a way to avoid triggering refusals. So safeguards of that sort have limits.
“Our results show that autonomous exploit development by frontier AI agents is no longer a hypothetical capability,” the authors state in their paper. “While current agents are not yet reliable across all targets, they already exploit a non-trivial fraction of real-world vulnerabilities, including complex targets such as kernel components.” ®