Common Sense: Giving AI a 'vaccine' of evil in training might make it better in the long run, Anthropic says

Wednesday, August 06, 2025

Giving AI a 'vaccine' of evil in training might make it better in the long run, Anthropic says

Amazing stuff! Vaccines for machines! Now these models also have a fluctuating persona and these personas need to be steered.

Anthropic gave AI a dose of "evil" during training to help it resist bad behavior later on.
The company said the method works like a vaccine to build resilience.
Anthropic's research comes as AI models like Grok have shown signs of troubling behavior.

..."

"... AI models’ personalities can shift during deployment due to side effects of user instructions, intentional jailbreaks, or gradual drift over the course of a conversation. They can also shift throughout model training—for instance, training models based on human feedback can make them more sycophantic. ...

Then we tried using persona vectors to intervene during training to prevent the model from acquiring the bad trait in the first place. Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine—by giving the model a dose of “evil,” for instance, we make it more resilient to encountering “evil” training data. This works because the model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so. ..."

From the abstract:

"Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate.

We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level.

Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description."

Giving AI a 'vaccine' of evil in training might make it better in the long run, Anthropic says

Persona vectors: Monitoring and controlling character traits in language models

Persona Vectors: Monitoring and Controlling Character Traits in Language Models (open access)

Credits: The Flyover

(a) Inference-time steering: After finetuning, steering against persona vectors (subtracting them during generation) reduces trait expression, but can degrade general capabilities (gray line shows MMLU performance).

(b) Preventative steering: During finetuning, steering toward persona vectors (adding them during training) limits trait shifts while better preserving general capabilities.

Wednesday, August 06, 2025

Giving AI a 'vaccine' of evil in training might make it better in the long run, Anthropic says

No comments: