Evaluation and benchmarks
Benchmarks, measurement methodologies, and evaluation tooling for AI agents.
20 resources
tau-bench: A benchmark for tool-agent-user interaction
Sierra Research's open benchmark for tool-using agents, simulating airline and retail customer-service tasks with an LLM user and rule-based APIs. Measures task success, adherence to policy, and consistency across repeated trials under realistic constraints.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Princeton benchmark of 2,294 real GitHub issues from 12 Python repositories paired with expert-written test patches. Measures whether agents can read codebases, edit files, and produce patches passing the hidden tests that fixed each issue.
WebArena: A Realistic Web Environment for Building Autonomous Agents
Zhou et al. introduce a self-hosted web environment covering e-commerce, forums, software development, and CMS apps, with 812 natural-language tasks. Evaluates end-to-end browsing agents on realistic multi-step workflows with verifiable outcomes.
GAIA: A Benchmark for General AI Assistants
Mialon et al. present 466 questions requiring multi-step reasoning, web browsing, multimodal understanding, and tool use. Designed so humans solve 92% while GPT-4 with plugins solves 15%, highlighting the gap in general assistant capability.
Demystifying evals for AI agents
Anthropic engineering guide on designing agent evaluations, covering task selection, grading rubrics, trajectory vs outcome metrics, LLM-judge calibration, and using evals as CI. Includes patterns for tool-use, memory, and multi-turn conversation evaluation.
How to Evaluate Control Measures for LLM Agents? A Trajectory from Today to Superintelligence
Korbak et al. (UK AISI) propose a methodology for evaluating AI-control measures against increasingly capable LLM agents, using red-team protocols and capability elicitation. Introduces a trajectory from current models to hypothetical superintelligent agents.
The 2025 AI Agent Index
Staufer et al. introduce a structured index of 67 deployed agentic AI systems, documenting technical features, safety measures, transparency, and governance for comparative analysis. Raw data released to support policy and procurement decisions.
AI Agent Index
MIT FutureTech hosted searchable database of deployed AI agents with structured fields for developer, capabilities, safety practices, transparency, and governance. Companion site to the arXiv index paper, updated as new agents launch.
Agent Evaluation
Arize AI documentation covering trajectory evaluation, tool-call correctness, and outcome scoring for production agents. Walks through LLM-as-judge evals, custom metrics, and linking evaluation runs back to specific spans in OpenTelemetry traces.
Phoenix: AI Observability and Evaluation
Arize's open-source observability and evaluation toolkit for LLM and agent applications, offering OpenTelemetry tracing, prompt and dataset management, LLM-judge evals, and a local UI. Runs as a library, self-hosted service, or managed platform.
LangSmith agent observability
LangSmith documentation covering tracing, prompt management, datasets, and evaluation for LangChain, LangGraph, and other LLM apps. Includes agent-specific features like trajectory evals, human annotation queues, and A/B experiment workflows.
Strengthening AI Agent Hijacking Evaluations
US AI Safety Institute (NIST) technical blog strengthening hijacking evaluations for agents, proposing adversarial test suites that measure resistance to indirect prompt injection. Publishes methodology and early results from frontier model evaluations.
Prioritizing Real-Time Failure Detection in AI Agents
Partnership on AI paper arguing offline evaluation is insufficient and detailing methods for real-time failure detection in deployed agents - anomaly scoring on trajectories, tool-call validation, and escalation triggers. Includes recommendations for deployers and auditors.
OpenEvals
LangChain's open-source evaluation harness offering prebuilt and customisable evaluators for LLM and agent outputs, including trajectory matching, exact match, LLM-as-judge, and safety checks. Integrates with LangSmith or runs standalone.
AgentBench: Evaluating LLMs as Agents
A multi-environment benchmark that tests large language models acting as agents across eight distinct settings, from operating systems and databases to web browsing and digital card games. It measures multi-step reasoning and decision-making rather than single-turn answers, and surfaced a large capability gap between top commercial models and open models on agent tasks.
GAIA: A Benchmark for General AI Assistants
A benchmark of real-world questions that are conceptually simple for humans but require agents to chain together reasoning, web browsing, tool use and multimodality to answer. GAIA is widely used to compare assistant-style agents, and the gap between human and model performance makes it a practical governance reference point for what agents can and cannot yet do reliably.
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
Princeton’s proposal for a holistic agent leaderboard, arguing that bare-model, vendor-scaffolded, and full-system results diverge by 30–50 points and that fair agent evaluation needs standardized infrastructure across scaffolds and environments.
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
Presents a taxonomy of seven failure modes unique to production agentic systems, shows where standard metrics miss each one, and proposes a production evaluation framework for catching drift and silent failures in deployed agents.
AgentRx: Diagnosing AI Agent Failures from Execution Trajectories
Introduces the AGENTRY benchmark for attributing the first unrecoverable failure in AI agents across API workflows, incident management, and real-world web/file tasks, supporting trajectory-based diagnosis of why agents fail.
Detecting Silent Failures in Multi-Agentic AI Trajectories
Tackles silent failures in multi-agent trajectories, where systems produce wrong outcomes without obvious errors, and contributes data and methods to capture diverse multi-agent failure scenarios that current public datasets miss.