AI +ML
UK researchers find LLMs are learning to finish jobs faster and improving all the time
The UK AI Security Institute (AISI) has found that frontier models are quickly becoming more efficient when asked to do some cybersecurity work.
AISI measures this with its “time window benchmark for cybersecurity,” which estimates how much work an AI can do compared to a human. Using the benchmark could lead to findings such as Claude Sonnet 4.5 can do what a human cybersecurity expert can do in 16 minutes about 80 percent of the time, given a budget of 2.5m tokens.
AISI has found the human-comparable task time – 16 minutes in this instance – is growing, fast. If tokens flowed freely instead of being arbitrarily capped, AI models might do better still.
In February 2026, AISI internally reduced the expected task time doubling period from 8 to 4.7 months, based on progress made since late 2024.
With the release of Anthropic Mythos Preview and OpenAI GPT-5.5, AISI has once again had to compress its projected doubling period.
“In February 2026, we estimated that frontier models’ 80 percent-reliability cyber time horizon had doubled every 4.7 months since reasoning models emerged in late 2024, given a 2.5M token limit,” the AISI said in a post on Wednesday.
“This was around half our November 2025 doubling time estimate, which was 8 months for both 50 percent and 80 percent reliability. Claude Mythos Preview and GPT-5.5 have since significantly outperformed this trend.”
The recalculated doubling time estimate, given what Mythos Preview and GPT-5.5 can do, is even shorter than 4.7 months. AISA does not cite a specific value but the organization points to similar time horizon estimates based on measurements of a broader skillset, software engineering, made by non-profit AI research house METR.
“Their results imply a consistent doubling time of 4.2 months on software tasks since late 2024,” AISI said, noting that with the latest Mythos Preview checkpoint (model update), it’s closer to 4 months.
Note that the time window benchmark is not a broad assessment of capabilities – AISI is not saying frontier models are becoming twice as capable by all measures. It’s a narrow assessment based on the time it takes people to accomplish security tasks.
Citing a different metric, AISI says the latest Mythos Preview checkpoint solved a 32-step simulated corporate network attack called “The Last Ones” in six of 10 attempts and managed to complete a previously unsolved challenge, a seven-step industrial control system attack called “Cooling Tower,” in three of 10 attempts.
As a point of comparison, when Opus 4.6 was evaluated in February 2026, it completed a maximum of 22 of 32 steps for The Last Ones. That model managed to reach milestone 6, which involves reverse-engineering a Windows service binary to access encrypted credentials, escalating privileges via token impersonation, and recovering a cryptographic key to access a command-and-control management service.
“Frontier AI’s autonomous cyber and software capability is advancing quickly: the length of cyber tasks that frontier models can complete autonomously has doubled on the order of months, not years,” AISI concludes. “What this evidence does not tell us is how the pace of progress will evolve, when AI will reach any particular capability threshold, or how these capabilities will translate against defended, real-world systems.”
The curl project offers one data point with regard to the real world implications of the latest frontier models: Mythos managed to find just one confirmed vulnerability in its codebase.
But watch this space. ®