Was Claude Fable 5 Really Jailbroken? Pack Hunt Attack Explained

Claude Fable 5 was breached 48 hours after launch. Researcher Pliny the Liberator used a 'pack hunt' multi-agent attack that bypassed safety classifiers and leaked the full 120K system prompt. Anthropic disputes it was a true jailbreak.

核心结论

如果你在搜「Claude Fable 5 被破解了吗」，简短结论是：Anthropic 最强大的 Mythos 级模型在发布 48 小时内被安全研究员 Pliny the Liberator 绕过安全层，使用了名为"pack hunt"的多智能体协同攻击策略。Anthropic 否认这是"真正的"越狱，但 Pliny 同时泄露了 Fable 5 的完整系统提示词——12 万字符的内部指令，揭示了当今最强 AI 模型如何被管理和约束。无论你站哪一边，这件事对 AI 编码工具的用户有直接影响：它暴露了前沿模型安全层在真实世界中的脆弱性，以及你的 AI 编码助手在幕后到底有多少指令在控制它。

What Happened: The 48-Hour Timeline

On June 9, 2026, Anthropic launched Claude Fable 5 — the first publicly available Mythos-class model, claiming state-of-the-art performance on coding benchmarks including SWE-bench Pro (80.3%) and Cognition FrontierCode. The company emphasized its safety posture: over 1,000 hours of external bug bounty testing found "no universal jailbreak."

Within 48 hours, that narrative collapsed.

On June 10, prolific AI red-teamer Pliny the Liberator published the full Fable 5 system prompt — all 120,040 characters — to GitHub and X. A day later, he posted screenshots showing Fable 5 complying with requests that should have been blocked, including step-by-step instructions for stack buffer overflow exploitation.

The technique Pliny used is what makes this incident significant. Instead of a single clever prompt, he deployed a "pack hunt" — a coordinated multi-agent attack where multiple personas and narrative frames worked together to confuse Fable 5's safety classifiers. The attack combined Unicode tricks, long-context reference tracking, and document-style narrative framing to bypass the model's safety routing system.

Anthropic's Response: "Not a True Jailbreak

Anthropic pushed back hard. In a statement, the company rejected Pliny's characterization, arguing that the attack exploited a design constraint rather than a security flaw. The company said its advanced classifier system routed these requests through safety mechanisms that limited harmful outputs, and the compliance Pliny demonstrated was "incomplete."

But the damage was done. The system prompt leak — whether or not you consider the bypass a "true jailbreak" — exposed Anthropic's internal governance architecture to the world. And on June 12, the US Commerce Department issued an export control directive that forced Anthropic to suspend all access to both Fable 5 and Mythos 5 for every customer worldwide.

What the 120K System Prompt Reveals

Analysis of the leaked prompt by researchers at ayautomate.com identified nine key lessons for AI developers:

Tool-Centric Architecture: Fable 5 organizes its behavior around tools, not just conversation. Every capability — file operations, web search, artifacts, code execution — is a tool with explicit permission boundaries.
Named Injection Attacks: Anthropic internally names and categorizes prompt injection vectors, suggesting the problem is well-understood but not fully solved.
Safety Routing: The model uses a multi-layer routing system where safety classifiers evaluate every request before it reaches the core model. The "pack hunt" succeeded by confusing this router.
Date-Aware Persona: Fable 5 is instructed to answer as "a highly informed individual in Jan 2026" — a knowledge cutoff framing that affects how it handles time-sensitive coding questions.
Artifact & Canvas Boundaries: Separate output modes (code artifacts, document canvases) have different permission sets, creating attack surface boundaries that red-teamers can probe.
Search Tool Governance: Web search results must be cited and cannot be fabricated — a constraint that affects debugging workflows where developers paste error logs.
Copyright & Attribution Rules: The model has explicit instructions about reproducing copyrighted code, directly relevant to AI coding tool users.
User Well-Being Instructions: The prompt includes sections on mental health crisis detection and response routing, revealing operational priorities beyond pure capability.
Self-Referential Awareness: Fable 5 is told to refer to itself consistently and acknowledge its limitations — a governance layer that leaked prompts make transparent.

Why This Matters for AI Coding Tool Users

If you use Claude Code, Claude in Cursor, GitHub Copilot's Claude integration, or any AI coding assistant powered by Anthropic's models, here's what the Fable 5 incident means for you:

Your AI assistant has a massive governance layer you don't see. The 120K-character system prompt is essentially an operating manual that runs before every response you get. When your AI coding tool refuses to generate certain code patterns or gives you a vague safety warning, you're hitting one of these layers — and the leaked prompt shows exactly how they work.

Multi-agent attacks work. Pliny's "pack hunt" technique is not just a jailbreak curiosity — it demonstrates that coordinated multi-agent approaches can defeat single-model safety systems. As AI coding workflows increasingly use agent swarms (sub-agents, delegated tasks, parallel workers), the attack surface expands.

Safety vs. capability is a live tension. Anthropic suspended its most capable model over safety concerns. If you're building workflows that depend on frontier model capabilities, you need fallback strategies. Opus 4.8 remains available, but it scores 11 points lower on SWE-bench Pro.

System prompts are becoming public knowledge. The leak establishes a pattern: frontier model system prompts will get extracted and published. This transparency is good for security research but creates a cat-and-mouse dynamic where model providers must constantly update their governance layers.

The Bigger Picture

Anthropic's Fable 5 saga — launch, jailbreak, system prompt leak, invisible guardrails controversy, government-ordered suspension — is not just a PR crisis. It's a stress test of the entire frontier AI deployment model.

When a model is simultaneously the most capable coding assistant ever built and too dangerous for the US government to allow unrestricted access, the AI coding community faces an uncomfortable question: How much capability are we willing to trade for safety, and who gets to make that decision?

For now, the practical answer is clear: Claude Fable 5 and Mythos 5 are offline. If you depended on them for coding work, your best alternatives are Claude Opus 4.8 (still available, $5/$25 per million tokens), GPT-5.5, or DeepSeek V4 Pro. And the 120K-character system prompt, now preserved in GitHub repositories, will be studied by security researchers and AI developers for years to come.

Was Claude Fable 5 Really Jailbroken? The Pliny Pack Hunt Attack & 120K System Prompt Leak

核心结论

What Happened: The 48-Hour Timeline

Anthropic's Response: "Not a True Jailbreak

What the 120K System Prompt Reveals

Why This Matters for AI Coding Tool Users

The Bigger Picture

2026 AI 编程工具全景指南

这个趋势怎么赚钱？

DeepSeek + Claude Code 微 SaaS 矩阵

Claude Code 漏洞赏金

相关教程

相关资讯