Back to library

Evaluation and benchmarks

Benchmarks, measurement methodologies, and evaluation tooling for AI agents.

14 resources

Type:
14 resources found
datasetSierra Research • 2024

tau-bench: A benchmark for tool-agent-user interaction

Sierra Research's open benchmark for tool-using agents, simulating airline and retail customer-service tasks with an LLM user and rule-based APIs. Measures task success, adherence to policy, and consistency across repeated trials under realistic constraints.

BenchmarksInternational
datasetJimenez et al. (Princeton) • 2023

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Princeton benchmark of 2,294 real GitHub issues from 12 Python repositories paired with expert-written test patches. Measures whether agents can read codebases, edit files, and produce patches passing the hidden tests that fixed each issue.

BenchmarksInternational
datasetZhou et al. • 2023

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou et al. introduce a self-hosted web environment covering e-commerce, forums, software development, and CMS apps, with 812 natural-language tasks. Evaluates end-to-end browsing agents on realistic multi-step workflows with verifiable outcomes.

BenchmarksInternational
datasetMialon et al. • 2023

GAIA: A Benchmark for General AI Assistants

Mialon et al. present 466 questions requiring multi-step reasoning, web browsing, multimodal understanding, and tool use. Designed so humans solve 92% while GPT-4 with plugins solves 15%, highlighting the gap in general assistant capability.

BenchmarksInternational
guidelineAnthropic • 2026

Demystifying evals for AI agents

Anthropic engineering guide on designing agent evaluations, covering task selection, grading rubrics, trajectory vs outcome metrics, LLM-judge calibration, and using evals as CI. Includes patterns for tool-use, memory, and multi-turn conversation evaluation.

Measurement methodologyGlobal
researchKorbak et al., UK AI Security Institute • 2025

How to Evaluate Control Measures for LLM Agents? A Trajectory from Today to Superintelligence

Korbak et al. (UK AISI) propose a methodology for evaluating AI-control measures against increasingly capable LLM agents, using red-team protocols and capability elicitation. Introduces a trajectory from current models to hypothetical superintelligent agents.

Measurement methodologyUnited Kingdom
datasetLeon Staufer et al. • 2025

The 2025 AI Agent Index

Staufer et al. introduce a structured index of 67 deployed agentic AI systems, documenting technical features, safety measures, transparency, and governance for comparative analysis. Raw data released to support policy and procurement decisions.

Agent indicesInternational
datasetMIT FutureTech (AI Agent Index team) • 2025

AI Agent Index

MIT FutureTech hosted searchable database of deployed AI agents with structured fields for developer, capabilities, safety practices, transparency, and governance. Companion site to the arXiv index paper, updated as new agents launch.

Agent indicesUnited States
toolArize AI • 2025

Agent Evaluation

Arize AI documentation covering trajectory evaluation, tool-call correctness, and outcome scoring for production agents. Walks through LLM-as-judge evals, custom metrics, and linking evaluation runs back to specific spans in OpenTelemetry traces.

Evaluation toolingGlobal
toolArize AI • 2025

Phoenix: AI Observability and Evaluation

Arize's open-source observability and evaluation toolkit for LLM and agent applications, offering OpenTelemetry tracing, prompt and dataset management, LLM-judge evals, and a local UI. Runs as a library, self-hosted service, or managed platform.

Evaluation toolingGlobal
toolLangChain • 2025

LangSmith agent observability

LangSmith documentation covering tracing, prompt management, datasets, and evaluation for LangChain, LangGraph, and other LLM apps. Includes agent-specific features like trajectory evals, human annotation queues, and A/B experiment workflows.

Evaluation toolingGlobal
researchU.S. AI Safety Institute (NIST) • 2025

Strengthening AI Agent Hijacking Evaluations

US AI Safety Institute (NIST) technical blog strengthening hijacking evaluations for agents, proposing adversarial test suites that measure resistance to indirect prompt injection. Publishes methodology and early results from frontier model evaluations.

Measurement methodologyUnited States
researchMadhulika Srikumar et al., Partnership on AI • 2025

Prioritizing Real-Time Failure Detection in AI Agents

Partnership on AI paper arguing offline evaluation is insufficient and detailing methods for real-time failure detection in deployed agents - anomaly scoring on trajectories, tool-call validation, and escalation triggers. Includes recommendations for deployers and auditors.

Measurement methodologyInternational
toolLangChain • 2025

OpenEvals

LangChain's open-source evaluation harness offering prebuilt and customisable evaluators for LLM and agent outputs, including trajectory matching, exact match, LLM-as-judge, and safety checks. Integrates with LangSmith or runs standalone.

Evaluation toolingGlobal
Evaluation and benchmarks | VerifyWise AI Governance Library