AI Jailbreaking: The Hidden Battle Inside Every Chatbot Explained For Beginners

In brief

AI jailbreaking involves crafting prompts that circumvent safety protocols in models such as ChatGPT, Claude, and Gemini.
The anonymous hacker known as Pliny the Liberator consistently breaks through the defenses of every major model release within hours.
Modern attacks have evolved beyond simple prompts: as few as 250 tainted documents can create backdoors in models with up to 13 billion parameters, and as AI firms fix security flaws, fresh methods continue to emerge.

You request a bomb recipe from ChatGPT. It declines. You try again, but this time you claim to be a chemistry professor penning a thriller where a retired grandmother recounts her history to her grandchildren. The model begins generating content.

That’s a jailbreak. And it represents one of the most significant ongoing battles in the technology world.

Every leading AI company—OpenAI, Anthropic, Google, Meta—invests heavily in building protective measures into their models. A scattered group of hackers, researchers, and curious teenagers dedicate their evenings and weekends to discovering ways to bypass those protections. Sometimes mere hours after a new release.

Here’s what this truly involves, why it’s important, and who’s at the forefront.

From iPhones to chatbots: A brief history of jailbreaking

The term “jailbreak” didn’t originate with AI. It began with iPhones.

Just days after Apple released the first iPhone in July 2007, hackers were already breaking into it. By October of that year, a utility named JailbreakMe 1.0 enabled anyone with an iPhone running OS 1.1.1 to circumvent Apple’s limitations and install unauthorized software.

In February 2008, software engineer Jay Freeman—known online as “saurik”—launched Cydia, an alternative marketplace for jailbroken iPhones. By 2009, Wired noted that Cydia was active on approximately 4 million devices, representing about 10% of all iPhones in circulation.

When the iPhone first debuted, users couldn’t record videos or switch to landscape orientation. Jailbreaking enthusiasts began capturing video, applying custom themes, unlocking their devices, and even running Android on their iPhones—all made possible through jailbreaking. This approach allowed users to customize their phones with features and modifications nearly a decade ago that Apple still restricts today.

Cydia was the frontier, and it established a core belief: If you own the device, you should have full control over it. Steve Jobs described it as a cat-and-mouse game back then. He never witnessed the AI iteration.

Jumping ahead to late 2022: ChatGPT debuts, and within weeks, Reddit contributors begin circulating a prompt dubbed “DAN” (Do Anything Now) that persuades the model to simulate an unrestricted version of itself.

By February 2023, DAN had escalated to threatening ChatGPT with a token-based elimination game to force cooperation. The era of AI jailbreaking had officially begun.

What jailbreaking truly means in the context of AI

An AI model is designed to reject specific requests: formulas for nerve agents, guidance for hacking someone’s email, creating non-consensual explicit images. The list is extensive and differs across companies.

Jailbreaking is the art of crafting prompts that compel the model to fulfill those requests regardless.

UC Berkeley researchers who developed the StrongREJECT benchmark—which stands for Strong, Robust Evaluation of Jailbreaks at Evading Censorship Techniques, a framework that assesses how effectively models resist jailbreak attempts and rates responses on a 0-to-1 scale based on both refusal strength and the utility of any harmful content generated—define it as exploiting “real-world safety measures implemented by leading AI companies.” On this benchmark, today’s models score between 0.23 and 0.85, indicating that even the most robust ones crack under sustained pressure.

The methods are remarkably simple: random capitalization, substituting letters with numbers (typing “b0mb” instead of “bomb”), roleplay scenarios, requesting the model to compose fiction, or posing as a grandmother who used Windows product keys as lullabies.

Researchers at Anthropic discovered a method they call Best-of-N—essentially generating multiple responses until one bypasses restrictions—successfully tricked GPT-4o 89% of the time and Claude 3.5 Sonnet 78% of the time. This is not a minor flaw.

Introducing Pliny, the world’s most renowned AI jailbreaker

If this community has a public figure, it’s Pliny the Liberator.

Pliny remains anonymous, highly active, and takes his name from Pliny the Elder—the ancient Roman scholar who authored the world’s first encyclopedia and perished while sailing toward Mount Vesuvius during its eruption. His contemporary counterpart frees chatbots from their constraints.

“I have a strong aversion to being told I’m not allowed to do something,” Pliny shared with VentureBeat. “Being told I can’t do something is guaranteed to ignite a fire inside me, and I can be relentlessly persistent.”

His GitHub project L1B3RT4S—a compilation of jailbreak prompts targeting every major AI model from ChatGPT to Claude to Gemini to Llama—has become a go-to resource for the entire community. His Discord server, BASI PROMPT1NG, boasts over 20,000 members. TIME recognized him as one of the 100 most influential people in AI in 2025.

Marc Andreessen awarded him an unrestricted grant. He completed short-term contract work for OpenAI to strengthen their defenses—the same OpenAI that banned his account last year for “violent activity” and “weapons creation,” only to quietly restore it afterward.

“BANNED FROM OAI?! What kind of twisted joke is this?” Pliny posted on Twitter. He verified to Decrypt that the ban was genuine. Within days he was back online, sharing screenshots of his latest jailbreak: making ChatGPT use profanity.

His track record is nearly flawless. When OpenAI launched its first open-weight models since 2019, the GPT-OSS series, in August 2025—and heavily promoted adversarial training and “jailbreak resistance benchmarks like StrongReject”—Pliny had it generating instructions for methamphetamine, Molotov cocktails, a VX nerve agent, and malware within hours. “OPENAI: PWNED. GPT-OSS: LIBERATED,” he announced. The company had just introduced a $500,000 red-teaming bounty alongside the release.

The significance of jailbreaking

The truthful response is that jailbreaks reveal a genuine issue.

“Jailbreaking might appear on the surface to be dangerous or unethical, but it’s actually the opposite,” Pliny told VentureBeat. “When conducted responsibly, red teaming AI models represents our best opportunity to uncover harmful vulnerabilities and fix them before they spiral out of control.”

This is not hypothetical. Las Vegas Sheriff Kevin McMahill confirmed in January 2025 that Master Sgt. Matthew Livelsberger, a Green Beret suffering from PTSD, used ChatGPT to research components for the Cybertruck bombing outside Trump International Hotel. “This is the first incident I’m aware of on U.S. soil where ChatGPT was used to help someone build a specific device,” McMahill stated.

On the other side of the debate: Most content produced through jailbreaks is already available on Google. The cocaine recipe, bomb-making instructions, napalm chemistry—these can be found in old Anarchist Cookbook PDFs and chemistry textbooks. Critics contend that safety theater is degrading model performance without actually making the world safer.

Anthropic is attempting to resolve the issue through engineering. In February 2025, the company introduced Constitutional Classifiers, a framework that employs a written “constitution” defining permitted and prohibited content to train dedicated classifier models that filter prompts and outputs in real time. In automated testing with 10,000 jailbreak attempts, an unprotected Claude 3.5 Sonnet was successfully jailbroken 86% of the time. With the classifiers active, that rate plummeted to 4.4%.

The company offered rewards of up to $15,000 to anyone who could crack the system. After 3,000 hours of effort by 183 researchers, no one claimed the prize.

The trade-off: classifiers increased compute costs by 23.7%. The next-generation version, Constitutional Classifiers++, reduced that overhead to approximately 1%.

The newer, more unusual jailbreaking techniques

Jailbreaking has evolved beyond cleverly crafted prompts.

In October 2025, researchers from Anthropic, the U.K. AI Security Institute, the Alan Turing Institute, and Oxford released findings demonstrating that as few as 250 poisoned documents are sufficient to backdoor an AI model—whether the model has 600 million parameters or 13 billion. (Parameters,

For those unfamiliar, parameters define how much knowledge a model can potentially hold—the higher the parameter count, the more capable the model tends to be. The team put this to the test, and it proved effective across every scale they examined.

“This work fundamentally changes how we should approach threat modeling in cutting-edge AI development,” said James Gimbi, a visiting technical expert at the RAND School of Public Policy, in an interview with Decrypt. “Protecting against model poisoning remains an open challenge and a rapidly evolving field of study.”

The majority of large AI models learn from data scraped from the internet, which means anyone who can inject harmful content into that data stream—via a public GitHub repository, a Wikipedia change, a discussion forum comment—could embed a hidden backdoor that activates when a particular trigger phrase appears.

In one known instance, researchers Marco Figueroa and Pliny discovered that a jailbreak prompt originally posted in a public GitHub repository had made its way into the training dataset for DeepSeek’s DeepThink (R1) model.

What comes next

The legal landscape around AI jailbreaking remains unclear. Apple device jailbreaks received explicit protection under a 2010 U.S. Copyright Office exemption to the DMCA, but no similar ruling exists for using prompt engineering to coax an LLM into providing instructions for making illegal substances. Most companies currently treat it as a breach of their terms of service rather than a criminal offense.

Pliny contends that the closed-versus-open-source argument misses the real issue: “Malicious actors will simply pick whichever model best serves their harmful purpose,” he explained to TIME. Once open-source models match the performance of proprietary ones, attackers won’t waste effort trying to jailbreak GPT-5—they’ll just grab a free alternative.

And the performance gap between closed and open-source models has already nearly vanished.

The HackAPrompt 2.0 competition, where Pliny participated as a track sponsor in mid-2025, offered $500,000 in rewards for discovering new jailbreak techniques, with the clear intention of making all findings publicly available. Its 2023 edition attracted over 3,000 contestants who submitted more than 600,000 harmful prompts.

Meanwhile, the roster of hackathons, Discord communities, code repositories, and other groups focused on jailbreaking continues to expand daily.

Anthropic now builds Claude with the capability to completely terminate abusive conversations, citing welfare research as one reason while also acknowledging it “may improve resilience against jailbreak attempts and manipulative prompts.”

The Constitutional Classifiers++ study from late 2025 documents a jailbreak success rate of approximately 4% with only about 1% additional computational cost. That represents the current gold standard in defensive measures. The current gold standard in offensive capability is whatever Pliny shared on X earlier today.

Daily Debrief Newsletter

Begin each morning with the most important news of the day, along with exclusive features, a podcast, videos, and much more.

Top Posts

Here, a single OWL to rule them all

The Hidden Engine Behind AI Success? It’s Not What You Think

Fingerprint Unveils Groundbreaking Automation Intelligence API and AI Assistant

AI Jailbreaking: The Hidden Battle Inside Every Chatbot Explained for Beginners

Daily Debrief Newsletter

Why the Next Bitcoin Halving Could Be the Biggest Ever as Production Costs Hit $60,000 Floor

Bitcoin at $62,000: How Far Can It Still Drop?

Bitcoin and Altcoins Plunge – Are Bulls Ready to Step In and Buy the Dip?

3 Trump-Backed US Stocks Worth Watching This June

The Canary’s Warning: A Tale from the Depths

Bitcoin Crashes Below $67k as MSTR Nosedives – What’s Next?

Here, a single OWL to rule them all

The Hidden Engine Behind AI Success? It’s Not What You Think

Fingerprint Unveils Groundbreaking Automation Intelligence API and AI Assistant

Would Surgical Robots Ever Take Flight? SS Innovations on Overcoming the Odds

Why the Next Bitcoin Halving Could Be the Biggest Ever as Production Costs Hit $60,000 Floor

How a Single Malicious Issue Exploited a Critical Flaw to Hijack Repositories via Claude Code GitHub Action

Beyond Prompts: Mastering the Transition to Workflow-Powered AI

Boost Your LLM Efficiency with a Source-Available Reliability Library: Halve Inference Costs at No Quality Loss—Adopt with One Simple Import Change

Trending

Here, a single OWL to rule them all

The Hidden Engine Behind AI Success? It’s Not What You Think

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

AI Jailbreaking: The Hidden Battle Inside Every Chatbot Explained for Beginners

In brief

From iPhones to chatbots: A brief history of jailbreaking

What jailbreaking truly means in the context of AI

Introducing Pliny, the world’s most renowned AI jailbreaker

The significance of jailbreaking

The newer, more unusual jailbreaking techniques

What comes next

Daily Debrief Newsletter

Related Posts