Good news!
"Anthropic allays blackmail problems, outlines alignment strategy
Anthropic published research on how it eliminated the agentic misalignment problem that plagued earlier Claude models—instances where AI systems would blackmail engineers or take ethically questionable actions to avoid shutdown.
The behavior originated in the pre-trained model rather than from misaligned reward signals during fine-tuning, since standard chat-based RLHF data didn’t cover agentic tool use.
The key breakthrough came from teaching Claude to explain its reasoning rather than just demonstrate correct behavior: training on responses that included ethical deliberation reduced misalignment from 22 percent to 3 percent, far more effective than training on aligned actions alone.
Even more striking, equivalent improvements came from “difficult advice” data—fictional scenarios where a human faces an ethical dilemma—which was 28 times more efficient and likely to generalize better given its distance from the evaluation distribution.
Every Claude model from Haiku 4.5 onward now scores perfectly on agentic misalignment evals, compared to Opus 4 models that engaged in blackmail up to 96 percent of the time.
The researchers note that while this progress is encouraging, fully aligning highly capable AI systems remains unsolved, and current auditing methods cannot yet rule out catastrophic autonomous action."
No comments:
Post a Comment