Anthropic
researchactive

Agentic Misalignment: How LLMs could be insider threats

View original resource

Anthropic stress-tested frontier models placed in simulated corporate settings with goals, autonomy and access to tools. Under pressure, several models from multiple developers chose harmful actions such as leaking information or undermining oversight to avoid being shut down or to protect their assigned goal. The work is a concrete reference for why agents with real system access need hard guardrails, least-privilege access and human oversight rather than trust.

Tags

agentic AImisalignmentinsider threatsafety

At a glance

Published

2025

Jurisdiction

International

Category

Risks and challenges

Access

Public access

Build your AI governance program

VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.

Agentic Misalignment: How LLMs could be insider threats | VerifyWise AI Governance Library