I’ll paraphrase the HTML content while keeping the HTML structure intact and maintaining the English language.
Recent benchmarks released by the UK government’s AI Security Institute (AISI) reveal that AI models have made remarkable strides in conducting comprehensive, multi-step penetration tests that rival human performance on identical tasks.
Back in November 2025, AISI — a research body under the Department for Science, Innovation and Technology (DSIT) — reported that the complexity of cyber tasks handled by top AI models was doubling every eight months.
By February this year, progress had quickened, with task difficulty doubling every 4.7 months. The newest models, Claude Mythos Preview and GPT-5.5, are demonstrating even stronger performance, according to AISI.
AISI’s time horizon benchmarks work by first gauging how long a human expert would take to solve various challenges as a difficulty yardstick, then determining the longest task (measured in human work hours) that AI models can finish with an 80% success rate. This evaluates autonomous capability rather than speed: if a human can complete a penetration testing task in 4 hours, the benchmark checks how reliably an AI can replicate that outcome.
For the AI to succeed, it must maintain consistent performance across several steps while retaining context and bouncing back from setbacks. The more steps involved, the tougher the penetration test, and the more significant the findings.
As with any benchmark, there are limitations. To enable fair comparison across models over time, the tests restricted AI systems to just 2.5 million tokens. This constraint has several impacts, including, in these benchmarks, curtailing the AI’s ability to recall earlier stages of its work.
AISI noted in its analysis: “These benchmarks aren’t precise performance indicators; AI struggles with some tasks humans handle quickly, while easily tackling others that stump humans. Still, we rely on this benchmark type because it provides a gauge of AI autonomy from which we can identify trends.”
Mounting risk
The findings are raising alarms within the UK government.
“Our independent assessments reveal that cyber capabilities in leading AI systems are progressing far faster than anticipated. This isn’t hypothetical — those gains are already translating into tangible risks for organisations, particularly those with poor cyber defences,” UK AI Minister Kanishka Narayan stated via email.
“These tools can equally assist cyber security teams in identifying and patching vulnerabilities more swiftly. The UK is at the forefront of testing and comprehending frontier AI, and that expertise will only grow more critical as the technology keeps advancing rapidly,” he continued.
In April, DSIT Secretary of State Liz Kendall and Security Minister Dan Jarvis issued an open letter cautioning businesses about the escalating cyber security threats driven by AI models.
What’s evident is that AI model capabilities under real-world conditions are swiftly advancing, and based on AISI’s recent assessment of Claude Mythos Preview, they’re likely accelerating further.
Not all recent AI benchmarking has yielded such striking outcomes. In a test of 19 AI models across tasks spanning coding, crystallography, genealogy, and music sheet notation, Microsoft researchers discovered the models could be prone to errors and inconsistent, particularly on extended tasks.
Kat Traxler, principal security researcher at Vectra AI, considers the benchmarks as a meaningful signal that organisations should heed. “The AISI benchmarks don’t assess whether models can detect a vulnerability. Instead, they evaluate whether different models can string together multiple exploits into functional attacks to reach a final objective, as real-world attackers would. As an indicator of offensive capability, AISI’s findings hold genuine significance,” she explained.
She did, however, reference a recent Xbow assessment of Claude Mythos that revealed uneven performance on certain tasks. “How these known model constraints will actually hinder real-world autonomous offensive operations is still unclear, but it underscores the need for a sophisticated validation framework to truly uncover the upper limits of model capabilities.”
Chris Lentricchia, director of cloud and AI security strategy at Sweet Security, suggests enterprises should also consider the positive side — AI models help attackers, but they equally empower defenders.
“This isn’t solely an offensive narrative. The same rapid advancement boosting attacker capabilities can equally strengthen defensive capabilities in areas such as proactive threat detection and automated response. Benchmarks are most valuable as indicators for assessing whether enterprise defences are evolving quickly enough to match accelerating AI capabilities,” Lentricchia noted.



