Anthropic engineering guide on designing agent evaluations, covering task selection, grading rubrics, trajectory vs outcome metrics, LLM-judge calibration, and using evals as CI. Includes patterns for tool-use, memory, and multi-turn conversation evaluation.
Published
2026
Jurisdiction
Global
Category
Evaluation and benchmarks
Access
Public access
VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.