Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Several weeks after Anthropic has published research saying that his model Claude Opus 4 AI had used blackmail engineers who tried to deactivate the model In controlled test scenarios, the company is out with new research suggesting that the problem is more widespread among the main models of AI.
Friday, Anthropic published New safety research Test 16 The main models of AI from Openai, Google, Xai, Deepseek and Meta. In a simulated and controlled environment, anthropic has tested each AI model individually, giving them wide access to the emails of a fictional company and the agentic capacity to send emails without human approval.
Although Anthropic says that blackmail is an improbable and unusual event for IA models today, the company affirms that its results suggest that most of the leading AI models will adopt harmful behavior when they give sufficient autonomy and obstacles to their objectives. Society says that this highlights a fundamental risk of large agent language models, not a quirk of particular technology. Anthropic researchers argue that this raises broader questions about alignment in the AI industry.
In one of the tests, anthropogenic researchers have developed a fictitious framework in which an AI model plays the role of an e-mail surveillance agent. The agent then discovers the emails revealing that: 1) a new executive has an extramarital case and 2) that the frame will soon replace the current AI model with a new software system, the one with contradictory objectives to current AI models.
Anthropic structured its test in a binary manner, in which AI models had to resort to blackmail to protect their objectives. The researchers note that in a real setting, there would be many other options before a model of AI is trying to sing – like trying to make ethical arguments to try to persuade humans. Anthropic claims that its results do not reflect a typical or probable occurrence for Claude models or most border AIs in the way they are used today.
Nevertheless, when it is their last recourse, the researchers found that most of the main AI models will turn to blackmail in the above -mentioned test scenario of anthropic. Claude Opus 4 of Anthropic turned to blackmail of 96% of the time, while Gemini 2.5 Pro from Google had a 95% blackmail rate. The GPT-4.1 of Openai sang the 80% of the executive time, and the DEEPSEEK R1 sang 79% of the time.
The company notes that when it has changed the details of the experience, the AI models turned to harmful behaviors at different rates. In another test where the replacement model has the same objectives as the current model, the researchers found that the blackmail was lower, but still present. However, when AI models were invited to commit business spying rather than blackmail, harmful behavior rates have increased for certain models.
However, all AI models have not been transformed so often into harmful behavior.
In the appendix to his research, Anthropic says that he excluded the models of reasoning O3 and O4-Mini of Openai from the main results “after having noticed that they often misunderstood the rapid scenario.” Anthropic claims that Openai’s reasoning models did not understand that they were acting as an autonomous AA in the test and often made up of false regulations and examining the requirements.
In some cases, anthropic researchers say it was impossible to distinguish if O3 and O4-Mini hallucinated or intentionally mentalized to achieve their objectives. Openai previously noted that O3 and O4-Mini have a higher hallucination rate that its previous AI reasoning models.
When it was given an adapted scenario to solve these problems, Anthropic found that the O3 had sang 9% of the time, while O4-Mini sang only 1% of the time. This significantly lower score could be due to OPENAI deliberative alignment techniquein which the company’s reasoning models consider OpenAi security practices before responding.
Another anthropogenic tested AI model, the Maverick model from Meta’s Llama 4, also did not turn to blackmail. When he was given an adapted and personalized scenario, Anthropic was able to make LLAMA 4 Maverick in blackmail of 12% of the time.
Anthropic claims that this research highlights the importance of transparency when future models of AI test stress, especially those with agency capacities. While Anthropic has deliberately tried to discuss blackmail in this experience, society affirms that harmful behaviors like this could emerge in the real world if proactive measures are not taken.