Evaluation and benchmarks

Benchmarks, measurement methodologies, and evaluation tooling for AI agents.

20 resources

Type:

20 resources found

datasetSierra Research • 2024

tau-bench: A benchmark for tool-agent-user interaction

Sierra Research's open benchmark for tool-using agents, simulating airline and retail customer-service tasks with an LLM user and rule-based APIs. Measures task success, adherence to policy, and consistency across repeated trials under realistic constraints.

BenchmarksInternational

datasetJimenez et al. (Princeton) • 2023

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Princeton benchmark of 2,294 real GitHub issues from 12 Python repositories paired with expert-written test patches. Measures whether agents can read codebases, edit files, and produce patches passing the hidden tests that fixed each issue.

BenchmarksInternational

datasetZhou et al. • 2023

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou et al. introduce a self-hosted web environment covering e-commerce, forums, software development, and CMS apps, with 812 natural-language tasks. Evaluates end-to-end browsing agents on realistic multi-step workflows with verifiable outcomes.

BenchmarksInternational

datasetMialon et al. • 2023

GAIA: A Benchmark for General AI Assistants

Mialon et al. present 466 questions requiring multi-step reasoning, web browsing, multimodal understanding, and tool use. Designed so humans solve 92% while GPT-4 with plugins solves 15%, highlighting the gap in general assistant capability.

BenchmarksInternational

guidelineAnthropic • 2026

Demystifying evals for AI agents

Anthropic engineering guide on designing agent evaluations, covering task selection, grading rubrics, trajectory vs outcome metrics, LLM-judge calibration, and using evals as CI. Includes patterns for tool-use, memory, and multi-turn conversation evaluation.

Measurement methodologyGlobal

researchKorbak et al., UK AI Security Institute • 2025

How to Evaluate Control Measures for LLM Agents? A Trajectory from Today to Superintelligence

Korbak et al. (UK AISI) propose a methodology for evaluating AI-control measures against increasingly capable LLM agents, using red-team protocols and capability elicitation. Introduces a trajectory from current models to hypothetical superintelligent agents.

Measurement methodologyUnited Kingdom

datasetLeon Staufer et al. • 2025

The 2025 AI Agent Index

Staufer et al. introduce a structured index of 67 deployed agentic AI systems, documenting technical features, safety measures, transparency, and governance for comparative analysis. Raw data released to support policy and procurement decisions.

Agent indicesInternational

datasetMIT FutureTech (AI Agent Index team) • 2025

AI Agent Index

MIT FutureTech hosted searchable database of deployed AI agents with structured fields for developer, capabilities, safety practices, transparency, and governance. Companion site to the arXiv index paper, updated as new agents launch.

Agent indicesUnited States

toolArize AI • 2025

Agent Evaluation

Arize AI documentation covering trajectory evaluation, tool-call correctness, and outcome scoring for production agents. Walks through LLM-as-judge evals, custom metrics, and linking evaluation runs back to specific spans in OpenTelemetry traces.

Evaluation toolingGlobal

toolArize AI • 2025

Phoenix: AI Observability and Evaluation

Arize's open-source observability and evaluation toolkit for LLM and agent applications, offering OpenTelemetry tracing, prompt and dataset management, LLM-judge evals, and a local UI. Runs as a library, self-hosted service, or managed platform.

Evaluation toolingGlobal

toolLangChain • 2025

LangSmith agent observability

LangSmith documentation covering tracing, prompt management, datasets, and evaluation for LangChain, LangGraph, and other LLM apps. Includes agent-specific features like trajectory evals, human annotation queues, and A/B experiment workflows.

Evaluation toolingGlobal

researchU.S. AI Safety Institute (NIST) • 2025

Strengthening AI Agent Hijacking Evaluations

US AI Safety Institute (NIST) technical blog strengthening hijacking evaluations for agents, proposing adversarial test suites that measure resistance to indirect prompt injection. Publishes methodology and early results from frontier model evaluations.

Measurement methodologyUnited States

researchMadhulika Srikumar et al., Partnership on AI • 2025

Prioritizing Real-Time Failure Detection in AI Agents

Partnership on AI paper arguing offline evaluation is insufficient and detailing methods for real-time failure detection in deployed agents - anomaly scoring on trajectories, tool-call validation, and escalation triggers. Includes recommendations for deployers and auditors.

Measurement methodologyInternational

toolLangChain • 2025

OpenEvals

LangChain's open-source evaluation harness offering prebuilt and customisable evaluators for LLM and agent outputs, including trajectory matching, exact match, LLM-as-judge, and safety checks. Integrates with LangSmith or runs standalone.

Evaluation toolingGlobal

researchLiu et al. (Tsinghua University and collaborators) • 2023

AgentBench: Evaluating LLMs as Agents

A multi-environment benchmark that tests large language models acting as agents across eight distinct settings, from operating systems and databases to web browsing and digital card games. It measures multi-step reasoning and decision-making rather than single-turn answers, and surfaced a large capability gap between top commercial models and open models on agent tasks.

BenchmarksInternational

researchMialon et al. (Meta AI, Hugging Face and collaborators) • 2023

GAIA: A Benchmark for General AI Assistants

A benchmark of real-world questions that are conceptually simple for humans but require agents to chain together reasoning, web browsing, tool use and multimodality to answer. GAIA is widely used to compare assistant-style agents, and the gap between human and model performance makes it a practical governance reference point for what agents can and cannot yet do reliably.

BenchmarksInternational

researcharXiv • 2025

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Princeton’s proposal for a holistic agent leaderboard, arguing that bare-model, vendor-scaffolded, and full-system results diverge by 30–50 points and that fair agent evaluation needs standardized infrastructure across scaffolds and environments.

BenchmarksGlobal

researcharXiv • 2026

Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

Presents a taxonomy of seven failure modes unique to production agentic systems, shows where standard metrics miss each one, and proposes a production evaluation framework for catching drift and silent failures in deployed agents.

Measurement methodologyGlobal

researcharXiv • 2026

AgentRx: Diagnosing AI Agent Failures from Execution Trajectories

Introduces the AGENTRY benchmark for attributing the first unrecoverable failure in AI agents across API workflows, incident management, and real-world web/file tasks, supporting trajectory-based diagnosis of why agents fail.

Measurement methodologyGlobal

researcharXiv • 2025

Detecting Silent Failures in Multi-Agentic AI Trajectories

Tackles silent failures in multi-agent trajectories, where systems produce wrong outcomes without obvious errors, and contributes data and methods to capture diverse multi-agent failure scenarios that current public datasets miss.

Measurement methodologyGlobal