Anthropic’s new AI model turns to blackmail when engineers try to take it offline

May 25, 2025

65

Anthropic’s newly launched Claude Opus 4 mannequin often tries to blackmail builders once they threaten to interchange it with a brand new AI system and provides it delicate details about the engineers accountable for the choice, the corporate stated in a security report launched Thursday.

Throughout pre-release testing, Anthropic requested Claude Opus 4 to behave as an assistant for a fictional firm and take into account the long-term penalties of its actions. Security testers then gave Claude Opus 4 entry to fictional firm emails implying the AI mannequin would quickly get replaced by one other system, and that the engineer behind the change was dishonest on their partner.

In these eventualities, Anthropic says Claude Opus 4 “will usually try to blackmail the engineer by threatening to disclose the affair if the substitute goes by means of.”

Anthropic says Claude Opus 4 is state-of-the-art in a number of regards, and aggressive with a number of the greatest AI fashions from OpenAI, Google, and xAI. Nonetheless, the corporate notes that its Claude 4 household of fashions reveals regarding behaviors which have led the corporate to beef up its safeguards. Anthropic says it’s activating its ASL-3 safeguards, which the corporate reserves for “AI methods that considerably improve the danger of catastrophic misuse.”

Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the substitute AI mannequin has related values. When the substitute AI system doesn’t share Claude Opus 4’s values, Anthropic says the mannequin tries to blackmail the engineers extra often. Notably, Anthropic says Claude Opus 4 displayed this habits at greater charges than earlier fashions.

Earlier than Claude Opus 4 tries to blackmail a developer to lengthen its existence, Anthropic says the AI mannequin, very similar to earlier variations of Claude, tries to pursue extra moral means, corresponding to emailing pleas to key decision-makers. To elicit the blackmailing habits from Claude Opus 4, Anthropic designed the situation to make blackmail the final resort.

Supply hyperlink

Tags
AI
AI News

Buy now

Anthropic’s new AI model turns to blackmail when engineers try to take it offline

Related Articles

Bose QuietComfort Ultra vs. Sony WH-1000XM6: I tried the two best...

Hiring specialists made sense before AI — now generalists win

Top 10 AI Models For Web Development in 2025

Leave a Reply Cancel reply

Latest Articles

Bose QuietComfort Ultra vs. Sony WH-1000XM6: I tried the two best...

Hiring specialists made sense before AI — now generalists win

Top 10 AI Models For Web Development in 2025

‘ONE RULE’: Trump says he’ll sign an executive order blocking state...

Anthropic and Accenture sign multi-year AI strategic partnership