Inside the Cat-and-Mouse Game of AI Jailbreaking

AI jailbreaking is the practice of crafting prompts to bypass safety guardrails in models like ChatGPT, Claude, and Gemini. Anonymous hacker Pliny the Liberator consistently cracks new model releases within hours, exposing vulnerabilities that companies spend fortunes to patch.

Newer attacks go beyond prompts: just 250 poisoned documents can backdoor models with up to 13 billion parameters, using scraped web data. The game intensifies as defense techniques evolve, with companies like Anthropic deploying real-time classifiers that reduce jailbreak success rates to near 4% but add compute costs.

Jailbreaking matters because it reveals critical flaws before they can be exploited maliciously, as seen when ChatGPT was used to research components for the Cybertruck bombing in Las Vegas. Critics argue safety theater makes models worse without meaningful security gains. The legal status remains unclear, with no DMCA-style exemption for AI prompt engineering.