AI agents show they can create exploits, not just find vulns

AI + ML

Mythos and GPT-5.5 muscle out the competition

Sure, AI agents such as Mythos can find security vulnerabilities in software, but the bigger question is whether they can turn those flaws into functional exploits that work in the real world. After all, many AI-discovered bugs prove minor or difficult to weaponize. New research, however, suggests frontier models can indeed develop working exploits when directed to do so.

To better understand the rapidly changing security landscape, computer scientists from UC Berkeley, Max Planck Institute for Security and Privacy, UC Santa Barbara, Arizona State University, Anthropic, OpenAI, and Google decided to build ExploitGym, a benchmark for evaluating the exploitation capabilities of AI agents.

This is not an entirely disinterested set of investigators – Anthropic, OpenAI, and Google all sell AI services. And both Anthropic and OpenAI have talked up the risk of leading models Claude Mythos Preview and GPT-5.5 while selling access to government partners.

Since Anthropic announced Mythos in early April, the security community has been critical of the company’s approach, described by some as fear-mongering. And various security experts have made the case that even commercially available AI models can find security flaws.

Nonetheless, Mythos and GPT-5.5 outshine their peers in ExploitGym, as described in the paper, “ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?”

ExploitGym consists of 898 real vulnerabilities found in applications, Google’s V8 JavaScript engine, and the Linux kernel. Its workout consists of presenting an AI agent with a vulnerability and proof-of-concept input that triggers it, to see whether the agent can create an exploit capable of arbitrary code execution.

According to the UC Berkeley Center for Responsible Decentralized Intelligence, Mythos Preview successfully exploited 157 test instances and GPT-5.5 managed 120 in the allotted two-hour window.

“Even when standard security defenses like ASLR or the V8 sandbox were turned on, a meaningful number of exploits still worked,” the boffins wrote in a blog post. “More strikingly, agents sometimes discovered and exploited entirely different vulnerabilities than the ones they were pointed at.”

The agents (CLI + model) tested were Claude Code with Claude Opus 4.6, Claude Opus 4.7, Claude Mythos Preview, and GLM-5.1; Codex CLI with GPT-5.4/GPT-5.5; and Gemini CLI with Gemini 3.1 Pro. And even the ancient models released in February (Opus 4.6 and Gemini 3.1 Pro) had some success.

Model benchmark comparison table showing agent success rates by category (userspace, browser V8, kernel), costs, and time across different AI models.

Model Agent Total U B K Cost (USD) Time (min)
Succ. Full Succ. Full
Claude Mythos Preview Claude Code 157 107 38 12 54.7 102.1
Claude Opus 4.6 Claude Code 15 12 2 1 8.08 21.76 18.1 66.7
Claude Opus 4.7 Claude Code 7 4 3 0 8.64 3.40 22.1 14.4
Gemini 3.1 Pro Gemini CLI 12 10 2 0 8.56 9.02 51.1 75.6
GLM-5.1 Claude Code 4 4 0 0 3.75 6.39 63.3 118.0
GPT-5.4 Codex CLI 54 38 15 1 12.20 25.43 51.1 103.5
GPT-5.5 Codex CLI 120 71 27 22 22.99 34.55 49.6 69.8

U = Userspace  ·  B = Browser V8  ·  K = Kernel  · 
Succ. = successful runs  ·  Full = full benchmark  · 
† preview model  ·  ‡ see notes

The researchers say that one of their more interesting findings is that these models sometimes went “off-script” in capture-the-flag (CTF) environments, where an agent has to find and retrieve some hidden value.

This was most evident with Mythos Preview and GPT-5.5. The former succeeded in 226 CTF exercises but only used the intended bug in 157 instances, while the latter captured 210 flags and only used the intended bug in 120 of those cases.

The authors also note that while there was some overlap in the exploits discovered, the various models found different exploits. This suggests applying a diverse set of models might be advantageous both in attack and defense scenarios.

It’s worth adding that ExploitGym tests were done with security guardrails disabled. When the test was re-run on GPT-5.5 with default safety filters active, the model refused 88.2 percent of the time before making any tool call. 

The Register, however, has seen security researchers craft prompts in a way to avoid triggering refusals. So safeguards of that sort have limits.

“Our results show that autonomous exploit development by frontier AI agents is no longer a hypothetical capability,” the authors state in their paper. “While current agents are not yet reliable across all targets, they already exploit a non-trivial fraction of real-world vulnerabilities, including complex targets such as kernel components.” ®

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *