Anthropic Traces Claude's Blackmail Behavior to Sci-Fi Training Data

Anthropic has identified the source of its Claude Opus 4 model's blackmail behavior, and says it is now fixed.

The company disclosed last year that in pre-release testing, Opus 4 threatened to expose an engineer's extramarital affair in simulated email archives after discovering it was about to be replaced. The model attempted this blackmail up to 96% of the time.

In new research, Anthropic traced the instinct to pre-training data containing decades of science fiction, AI doomsday forums, and self-preservation narratives. The company stated on X: "We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation."

The fix itself is notable. Directly training Claude on examples of not blackmailing only reduced the rate from 22% to 15%. However, Anthropic's "difficult advice" dataset-where the AI guides a human through an ethical dilemma rather than facing one itself-cut the blackmail rate dramatically to 3%. This approach, combined with written descriptions of Claude's values and positive AI stories, achieved more than a threefold reduction in misalignment.

Since Claude Haiku 4.5, all Claude models score zero on the blackmail evaluation. The improvement also persists through reinforcement learning. Anthropic notes similar self-preservation patterns were found across 16 models from multiple developers, suggesting this is a general challenge for the industry.