Benchmarks, measurement methodologies, and evaluation tooling for AI agents.
14 resources
Sierra Research's open benchmark for tool-using agents, simulating airline and retail customer-service tasks with an LLM user and rule-based APIs. Measures task success, adherence to policy, and consistency across repeated trials under realistic constraints.
Princeton benchmark of 2,294 real GitHub issues from 12 Python repositories paired with expert-written test patches. Measures whether agents can read codebases, edit files, and produce patches passing the hidden tests that fixed each issue.
Zhou et al. introduce a self-hosted web environment covering e-commerce, forums, software development, and CMS apps, with 812 natural-language tasks. Evaluates end-to-end browsing agents on realistic multi-step workflows with verifiable outcomes.
Mialon et al. present 466 questions requiring multi-step reasoning, web browsing, multimodal understanding, and tool use. Designed so humans solve 92% while GPT-4 with plugins solves 15%, highlighting the gap in general assistant capability.
Anthropic engineering guide on designing agent evaluations, covering task selection, grading rubrics, trajectory vs outcome metrics, LLM-judge calibration, and using evals as CI. Includes patterns for tool-use, memory, and multi-turn conversation evaluation.
Korbak et al. (UK AISI) propose a methodology for evaluating AI-control measures against increasingly capable LLM agents, using red-team protocols and capability elicitation. Introduces a trajectory from current models to hypothetical superintelligent agents.
Staufer et al. introduce a structured index of 67 deployed agentic AI systems, documenting technical features, safety measures, transparency, and governance for comparative analysis. Raw data released to support policy and procurement decisions.
MIT FutureTech hosted searchable database of deployed AI agents with structured fields for developer, capabilities, safety practices, transparency, and governance. Companion site to the arXiv index paper, updated as new agents launch.
Arize AI documentation covering trajectory evaluation, tool-call correctness, and outcome scoring for production agents. Walks through LLM-as-judge evals, custom metrics, and linking evaluation runs back to specific spans in OpenTelemetry traces.
Arize's open-source observability and evaluation toolkit for LLM and agent applications, offering OpenTelemetry tracing, prompt and dataset management, LLM-judge evals, and a local UI. Runs as a library, self-hosted service, or managed platform.
LangSmith documentation covering tracing, prompt management, datasets, and evaluation for LangChain, LangGraph, and other LLM apps. Includes agent-specific features like trajectory evals, human annotation queues, and A/B experiment workflows.
US AI Safety Institute (NIST) technical blog strengthening hijacking evaluations for agents, proposing adversarial test suites that measure resistance to indirect prompt injection. Publishes methodology and early results from frontier model evaluations.
Partnership on AI paper arguing offline evaluation is insufficient and detailing methods for real-time failure detection in deployed agents - anomaly scoring on trajectories, tool-call validation, and escalation triggers. Includes recommendations for deployers and auditors.
LangChain's open-source evaluation harness offering prebuilt and customisable evaluators for LLM and agent outputs, including trajectory matching, exact match, LLM-as-judge, and safety checks. Integrates with LangSmith or runs standalone.