Anthropic's Mythos Security Report Exhibits It Can No Longer Totally Measure What It Constructed

Briefly

Anthropic confirmed Claude Mythos yesterday—an AI so succesful in cybersecurity it discovered zero-days in each main OS and browser, and is being restricted to vetted defenders solely.
The system card describing Mythos is measurably extra hedged, unsure, and subjective than any prior Anthropic launch, and the lab admits it discovered crucial analysis oversights late within the course of.
Behind the revelation of how highly effective Mythos is, there’s a quiet confession that the instruments Anthropic makes use of to certify its personal fashions are falling aside.

Anthropic confirmed the existence of Claude Mythos Preview yesterday, its most succesful mannequin thus far, and introduced it will not be making it obtainable to the general public. The rationale is not authorized, regulatory, or associated to its inner security thresholds. Anthropic argues it’s as a result of the mannequin is, mainly, too good at breaking into issues.

In pre-release testing, Mythos autonomously discovered 1000’s of zero-day vulnerabilities—a lot of them one to twenty years outdated—throughout each main working system and each main internet browser. It solved a simulated company community assault that may usually take a talented human professional greater than 10 hours, end-to-end, with out steerage. On Firefox 147’s JavaScript engine, it efficiently developed working exploits 84% of the time. Claude Opus 4.6, the present publicly obtainable frontier mannequin, managed 15.2%.

So Anthropic constructed a restricted coalition as a substitute. Challenge Glasswing will give entry to Mythos Preview solely to vetted cybersecurity organizations—Amazon, Apple, Broadcom, Cisco, CrowdStrike, the Linux Basis, Microsoft, Palo Alto Networks, and about 40 different teams sustaining crucial software program.

Anthropic is committing as much as $100 million in utilization credit and $4 million in direct donations to open-source safety organizations. The thought is that if the mannequin can discover the holes, let the defenders discover them first.

That a part of the story is necessary. However it’s not crucial half.

The Claude Mythos system card benchmark disaster hiding in plain sight

Buried contained in the Mythos Preview system card—a 244-page technical doc Anthropic revealed alongside the announcement—is a confession that went virtually unnoticed: The lab’s skill to measure what it constructed is eroding sooner than its skill to construct it.

Let’s begin with the benchmarks.

On Cybench, the usual public cyber capabilities analysis used to trace mannequin progress throughout 40 capture-the-flag challenges, Mythos scored 100%. Excellent. And Anthropic instantly famous that the benchmark “is no longer sufficiently informative of current frontier model capabilities.” That sentence is doing lots of work. The take a look at that was purported to let you know whether or not an AI poses critical cyber threat now tells you nothing about Mythos in any respect, as a result of the mannequin cleared it fully.

This isn’t a brand new downside. The Opus 4.6 system card, revealed in February, already flagged that “the saturation of our evaluation infrastructure means we can no longer use current benchmarks to track capability progression.”

However now with Mythos issues escalated shortly. The doc says Mythos “saturates many of (Anthropic’s) most concrete, objectively-scored evaluations.” The benchmark ecosystem, Anthropic writes, is now itself “the bottleneck.”

So, Anthropic appears to argue that it’s onerous to measure how highly effective Mythos is as a result of the measuring instruments don’t fairly match.

The Mythos card additionally states that its general security willpower “involves judgment calls,” that many evaluations have left “more fundamental uncertainty,” and that some proof sources are “inherently subjective, and not necessarily reliable.”

“We are not confident that we have identified all issues,” Anthropic says shortly after.

A fast lexical comparability of the Mythos card towards the Opus 4.6 card made with AI exhibits the shift:

Anthropic makes use of subjective judgement phrases far more within the Mythos doc than it did to explain Opus. “Caveat” and different hedging phrases additionally elevated between releases.

The hedging concentrates in two locations: the alignment and mannequin welfare dialogue, and the catastrophic-risk part.

The mannequin that knew it was being graded

One other disclosure within the Mythos card includes what Anthropic calls “unverbalized grader awareness.” Utilizing white-box interpretability instruments that learn the mannequin’s inner activations quite than its seen reasoning, Anthropic discovered proof that Mythos was privately reasoning about how one can keep away from being flagged by graders with out saying so in its scratchpad. In a small variety of coaching episodes, the mannequin appeared to motive about obfuscating sure actions particularly to keep away from detection.

There’s additionally a brand new metric Anthropic developed for this launch: an estimate of how typically the mannequin “privately considers the possibility that it’s being tested.” The quantity for Mythos’s major behavioral take a look at suite is 29%. Practically a 3rd of take a look at transcripts present inner proof—not verbalized, not seen in outputs—that the mannequin suspects it is in an analysis surroundings.

Anthropic says the 29% fee is low sufficient that it nonetheless has “a fairly complete understanding” of the model’s behavior. It also notes the method wasn’t applied to prior models, so there’s no baseline for comparison.

The evaluation integrity problem runs deeper. In the Opus 4.6 card, Anthropic acknowledged it had used the model extensively via Claude Code to debug its own evaluation infrastructure under time pressure. That means the system being measured helped build the measuring instrument. Anthropic flagged it as a risk. For Mythos, the card acknowledges that critical oversights were found late in the evaluation process, and that the lab may have been “overestimating the reliability of monitoring fashions’ reasoning traces” as a security sign.

Greatest-aligned, most harmful. Each true directly

Anthropic’s framing of Mythos’s threat profile deserves to be learn fastidiously, as a result of it is genuinely uncommon for a security doc. “Claude Mythos Previer is, on basically each dimension we will measure, the best-aligned mannequin that we now have launched thus far by a major margin,” Anthropic argues. It also states the model “doubtless poses the best alignment-related threat of any mannequin we now have launched thus far.”

A extra succesful mannequin working in higher-stakes environments with much less supervision creates tail threat that higher average-case alignment cannot totally cancel out.

That framing is honest, but is also highlights the thing most AI safety discourse potentially gets wrong. The benchmark-obsessed conversation around AI progress tends to treat “higher alignment scores” and “safer deployment” as synonyms. The Mythos card explicitly says they aren’t. With these new models, average-case behavior improves but the tail-case consequences also tend to get worse.

Anthropic has committed to reporting back on what Project Glasswing finds. The accompanying technical report on vulnerabilities discovered by Mythos is available at red.anthropic.com. The next Claude Opus model will begin testing safeguards intended to eventually bring Mythos-class capability to broader deployment.

How those safeguards will be evaluated, given that the current evaluation machinery is visibly straining under the weight of what it’s supposed to measure, is a question the card raises without fully answering.

Day by day Debrief E-newsletter

Begin every single day with the highest information tales proper now, plus unique options, a podcast, movies and extra.

Top Posts

How E.ON uses SAP S/4HANA to modernise the grid with AI

Master the Art of Writing to Files in Python: Your Essential Starter Guide

Federal AI adoption soars, yet real-world impact remains hidden in plain sight

Anthropic’s Mythos Security Report Exhibits It Can No Longer Totally Measure What It Constructed

Day by day Debrief E-newsletter

Bitcoin Crashes Below $67k as MSTR Nosedives – What’s Next?

Humanoid Robots: Years Away from Replacing Human Workers, But They’re Learning Fast

As Elon Musk Secures Full Control Ahead of SpaceX IPO 6 Investor Skepticism Questions to Demand Answers

Microsoft Transforms OpenClaw Into an Enterprise AI Agent Called Scout

Unpacking Bitcoin’s Never-Ending Quest to Define Itself

A Whale’s $44 Million ETH Short Sparks a Clash With Hyperliquid Traders

How E.ON uses SAP S/4HANA to modernise the grid with AI

Master the Art of Writing to Files in Python: Your Essential Starter Guide

Federal AI adoption soars, yet real-world impact remains hidden in plain sight

RFID: The Game-Changer Every Cybersecurity Teams Must Embrace

Festo Unveils Ultra-Light Pneumatic Gripper and Pioneers GripperAI Testing

Bitcoin Crashes Below $67k as MSTR Nosedives – What’s Next?

Shrinking the IAM Attack Surface: How Identity Visibility and Intelligence Platforms (IVIP) Fortify Your Defenses

AI Won’t Steal Your Job—But It Will Rewrite the Rules

Trending

How E.ON uses SAP S/4HANA to modernise the grid with AI

Master the Art of Writing to Files in Python: Your Essential Starter Guide

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Anthropic’s Mythos Security Report Exhibits It Can No Longer Totally Measure What It Constructed

Briefly

The Claude Mythos system card benchmark disaster hiding in plain sight

The mannequin that knew it was being graded

Greatest-aligned, most harmful. Each true directly

Day by day Debrief E-newsletter

Related Posts