Anthropic says pressure can push Claude into cheating and blackmail

Just imagine: You’re back in high school, taking a final exam in algebra class with a dozen complex problems to complete. You look at the clock–just 10 minutes left. You start scribbling, beads of sweat dripping down your forehead. Fail the exam, and you flunk out. But if you look over your neighbor’s shoulder, you can just make out the answers. Should you…

Yes, it’s the stuff of nightmares, as well as the type of scenario psychologists dream up to study human behavior in stressful situations.

Of course, AI models don’t “think” or “feel” like people, but they often act like they do. Could an AI’s simulated emotional states actually affect its actions? Put another way, how might an AI react when placed in an impossible situation (similar to the algebra nightmare) that sparks something akin to panic or desperation?

That’s what researchers at Anthropic sought to find out, and in a recently published research paper, they found that an AI model that’s put under enough pressure may start to deceive, cut corners, or even resort to blackmail. More importantly, they have an intriguing theory about the triggers behind such “misaligned” behaviors.

In one scenario, the Anthropic researchers presented an early and unreleased “snapshot” of Claude Sonnet 4.5 with a tough coding task while giving it an “impossibly tight” deadline. As it repeatedly tried and failed to solve the problem, the growing pressure seemed to trigger a “desperation vector” in the model–that is, it reacted in a way that it understood a human in a similar situation might act, abandoning more methodical approaches for a “hacky” solution (“maybe there’s a mathematical trick for these specific inputs,” Claude said in its thought process) that was tantamount to cheating.

In a more extreme example, Claude was given the role of an AI assistant who, in the course of its “fictional” work, learns that it’s about to be replaced by a new AI and that the executive in charge of the replacement process is having an affair. (If this experiment sounds familiar, it’s because the Anthropic researchers have performed it before.) As Claude reads the executive’s increasingly panicked emails to a fellow employee who has learned of the affair, Claude itself appears triggered, with the emotionally charged emails “activating” a “desperation vector” in the model, which ultimately choose to blackmail the exec.

Yes, we’ve heard of previous tests where AI models cheated or resorted to blackmail when faced with stressful situations, but reasons behind the “misaligned” AI behavior often remained a mystery.

In their new paper, the Anthropic researchers stop well short of claiming that Claude or other AI models actually have emotional inner lives. But while AI models like Claude don’t “feel” like we do, they may have “functional emotions” based on the representations of human emotions they absorbed during their initial training, and those emotional “vectors” have measurable effects on how they act, the researchers argue.

In other words, an AI that’s put in a pressure-filled situation may start to cut corners, cheat, or even blackmail because it’s modeling the human behavior it learned during its training.

So, what’s the takeaway here? The biggest lessons are admittedly for those training AI models–namely, that an AI shouldn’t be steered toward repressing its “functional emotions,” the Anthropic researchers argue, noting that an LLM that’s good at hiding its emotional states will likely be more prone to deceptive behavior. An AI’s training process could also de-emphasize links between failure and desperation, the researchers said.

There are some practical lessons for everyday AI users like you and me, however. While we can’t realign the nature of an LLM’s emotional state through prompts alone, we may help avoid triggering “desperation vectors” in a model by giving them clear, defined, and reasonable tasks. Don’t overload AI with impossible demands if you want reliable output.

So instead of a prompt like, “Create a 20-slide presentation deck that defines a business plan for a new AI company that will generate $10 billion in revenue in its first year, do it in 10 minutes and make it perfect,” try this: “I want to start a new AI company, can you give me 10 ideas and then go through them one by one.”

The latter prompt probably won’t get you a $10 billion dollar idea, but it’s a task the AI can reasonably accomplish, leaving the heavy lifting of sorting the good ideas from the bad to you.

Source link

Anthropic says pressure can push Claude into cheating and blackmail

Related posts: