Evaluation and benchmarks
Benchmarks, measurement methodologies, and evaluation tooling for AI agents.
14 resources
tau-bench: A benchmark for tool-agent-user interaction
Sierra Research's open benchmark for tool-using agents, simulating airline and retail customer-service tasks with an LLM user and rule-based APIs. Measures task success, adherence to policy, and consistency across repeated trials under realistic constraints.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Princeton benchmark of 2,294 real GitHub issues from 12 Python repositories paired with expert-written test patches. Measures whether agents can read codebases, edit files, and produce patches passing the hidden tests that fixed each issue.
WebArena: A Realistic Web Environment for Building Autonomous Agents
Zhou et al. introduce a self-hosted web environment covering e-commerce, forums, software development, and CMS apps, with 812 natural-language tasks. Evaluates end-to-end browsing agents on realistic multi-step workflows with verifiable outcomes.
GAIA: A Benchmark for General AI Assistants
Mialon et al. present 466 questions requiring multi-step reasoning, web browsing, multimodal understanding, and tool use. Designed so humans solve 92% while GPT-4 with plugins solves 15%, highlighting the gap in general assistant capability.
Demystifying evals for AI agents
Anthropic engineering guide on designing agent evaluations, covering task selection, grading rubrics, trajectory vs outcome metrics, LLM-judge calibration, and using evals as CI. Includes patterns for tool-use, memory, and multi-turn conversation evaluation.
How to Evaluate Control Measures for LLM Agents? A Trajectory from Today to Superintelligence
Korbak et al. (UK AISI) propose a methodology for evaluating AI-control measures against increasingly capable LLM agents, using red-team protocols and capability elicitation. Introduces a trajectory from current models to hypothetical superintelligent agents.
The 2025 AI Agent Index
Staufer et al. introduce a structured index of 67 deployed agentic AI systems, documenting technical features, safety measures, transparency, and governance for comparative analysis. Raw data released to support policy and procurement decisions.
AI Agent Index
MIT FutureTech hosted searchable database of deployed AI agents with structured fields for developer, capabilities, safety practices, transparency, and governance. Companion site to the arXiv index paper, updated as new agents launch.
Agent Evaluation
Arize AI documentation covering trajectory evaluation, tool-call correctness, and outcome scoring for production agents. Walks through LLM-as-judge evals, custom metrics, and linking evaluation runs back to specific spans in OpenTelemetry traces.
Phoenix: AI Observability and Evaluation
Arize's open-source observability and evaluation toolkit for LLM and agent applications, offering OpenTelemetry tracing, prompt and dataset management, LLM-judge evals, and a local UI. Runs as a library, self-hosted service, or managed platform.
LangSmith agent observability
LangSmith documentation covering tracing, prompt management, datasets, and evaluation for LangChain, LangGraph, and other LLM apps. Includes agent-specific features like trajectory evals, human annotation queues, and A/B experiment workflows.
Strengthening AI Agent Hijacking Evaluations
US AI Safety Institute (NIST) technical blog strengthening hijacking evaluations for agents, proposing adversarial test suites that measure resistance to indirect prompt injection. Publishes methodology and early results from frontier model evaluations.
Prioritizing Real-Time Failure Detection in AI Agents
Partnership on AI paper arguing offline evaluation is insufficient and detailing methods for real-time failure detection in deployed agents - anomaly scoring on trajectories, tool-call validation, and escalation triggers. Includes recommendations for deployers and auditors.
OpenEvals
LangChain's open-source evaluation harness offering prebuilt and customisable evaluators for LLM and agent outputs, including trajectory matching, exact match, LLM-as-judge, and safety checks. Integrates with LangSmith or runs standalone.