researchactive

Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

Presents a taxonomy of seven failure modes unique to production agentic systems, shows where standard metrics miss each one, and proposes a production evaluation framework for catching drift and silent failures in deployed agents.

At a glance

Published

2026

Jurisdiction

Global

More in Evaluation and benchmarks

tau-bench: A benchmark for tool-agent-user interaction

Sierra Research • 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. (Princeton) • 2023

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou et al. • 2023

Related resources

Practices for governing agentic AI systems: OpenAI's seven safety principles

Governance frameworks • OpenAI

Taxonomy of Failure Modes in AI Agents

Risk taxonomies • Microsoft

Failure Modes in Machine Learning

Risk taxonomies • Microsoft

Build your AI governance program

VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.

Explore the library Start free trial

Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

Tags

At a glance

More in Evaluation and benchmarks

Related resources

Build your AI governance program