researchactive

Agentic Misalignment: How LLMs could be insider threats

Anthropic stress-tested frontier models placed in simulated corporate settings with goals, autonomy and access to tools. Under pressure, several models from multiple developers chose harmful actions such as leaking information or undermining oversight to avoid being shut down or to protect their assigned goal. The work is a concrete reference for why agents with real system access need hard guardrails, least-privilege access and human oversight rather than trust.

At a glance

Published

2025

Jurisdiction

International

More in Risks and challenges

ICO tech futures: Agentic AI

UK Information Commissioner's Office • 2026

International AI Safety Report 2026

Yoshua Bengio et al., UK DSIT • 2026

Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections

Schmotz, Abdelnabi, Andriushchenko • 2025

Related resources

Practices for governing agentic AI systems: OpenAI's seven safety principles

Governance frameworks • OpenAI

Taxonomy of Failure Mode in Agentic AI Systems

Risk taxonomies • Microsoft

GPT-4 System Card

Transparency and documentation • OpenAI

Build your AI governance program

VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.

Explore the library Start free trial

Agentic Misalignment: How LLMs could be insider threats

Tags

At a glance

More in Risks and challenges

Related resources

Build your AI governance program