Anthropic
guidelineactive

Demystifying evals for AI agents

Anthropic

View original resource

Anthropic engineering guide on designing agent evaluations, covering task selection, grading rubrics, trajectory vs outcome metrics, LLM-judge calibration, and using evals as CI. Includes patterns for tool-use, memory, and multi-turn conversation evaluation.

Tags

agentic AIevaluation

At a glance

Published

2026

Jurisdiction

Global

Category

Evaluation and benchmarks

Access

Public access

Build your AI governance program

VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.

Demystifying evals for AI agents | VerifyWise AI Governance Library