In brief
- Anthropic has acknowledged that its hidden safety measures for LLM development were “the wrong tradeoff” and plans to swap them out for visible fallbacks to Claude Opus 4.8, beginning this week.
- Requests flagged on the API will now include an explanation for the refusal, instead of quietly returning a lower-quality response.
- Making these safeguards transparent also means they’ll be simpler to circumvent.
Anthropic spent roughly 48 hours as the AI industry’s most criticized company before backing down.
The company released Claude Fable 5 this week, only to face immediate backlash over a safety feature hidden within its 319-page system card: The model—the first in the company’s new Mythos class—would secretly lower the quality of its responses for users it suspected were building rival AI models. There was no warning, no fallback message—just quietly inferior output. By Thursday, Anthropic was issuing an apology.
“Invisible safeguards can be targeted more precisely, letting us ship quickly with very few false positives. We chose invisible safeguards for this reason—and that was the wrong tradeoff,” the company posted on X. “You deserve transparency into the safeguards we’ve put in place, and why.”
“We’re sorry for not striking the right balance.”
Beginning this week, flagged requests will visibly redirect to Claude Opus 4.8, a less powerful model, rather than silently delivering degraded Fable output. API users will get a clear explanation whenever a request is denied. Anthropic says server-side fallback notifications will be rolled out over the next few days.
What was actually going on
For those without a technical background, here’s what the controversy was really about. Claude Fable 5 already had visible safeguards for cybersecurity and biology research—if you asked something that triggered those filters, you’d receive a notification that your request was being rerouted to the older Opus 4.8 model. You’d know something had changed. You could tweak your prompt or switch to a different tool.
That said, some bio researchers pointed out that these safeguards were overly aggressive.
The LLM-development safeguard, on the other hand, operated differently. If Fable 5 detected you were working on tasks like pretraining AI systems, building distributed training infrastructure, or designing machine learning chips, the model would quietly modify its own behavior—through prompt adjustments, steering vectors, or parameter changes—to give you a worse answer without any notification. You’d still get a response. It just wouldn’t be from the Fable 5 you paid for.
Fable 5 is promoted as the public-facing version of Anthropic’s most advanced Mythos-class model, and researchers relying on it for legitimate machine learning work had no way of knowing their results were compromised. A failed experiment looks identical whether your hypothesis was incorrect or the model was secretly instructed to underperform. That’s the reproducibility issue that sent the AI research community into an uproar.
The problem was that the classifier wasn’t very accurate. AI research firm SemiAnalysis was among the first to publicly call them out after their GPU inference research got flagged.
The catch in the fix
Anthropic’s reversal comes with an honest admission of the tradeoff it’s accepting. Making safeguards visible makes them easier to get around, which means the classifier needs to cast a wider net to stay effective.
More false positives—legitimate machine learning work that gets caught and rerouted—are expected while the company fine-tunes its systems. Anthropic said it’s working to reduce false positives “as quickly as possible” but didn’t provide a specific timeline.
The company is also applying the same improvements to its biology and cybersecurity classifiers, which had drawn their own complaints about flagging harmless research prompts.
Even so, the lingering concern is that Anthropic isn’t removing this category of restrictions—it’s only making them visible. For those who believe the restrictions themselves are misguided, Thursday’s apology is only a partial solution. Fable 5 remains free on Pro, Max, Team, and Enterprise plans until June 22, after which it will shift to API usage credits only.
Daily Debrief Newsletter
Start every day with the top news stories right now, plus original features, a podcast, videos and more.