researchactive

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Princeton’s proposal for a holistic agent leaderboard, arguing that bare-model, vendor-scaffolded, and full-system results diverge by 30–50 points and that fair agent evaluation needs standardized infrastructure across scaffolds and environments.

At a glance

Published

2025

Jurisdiction

Global

More in Evaluation and benchmarks

tau-bench: A benchmark for tool-agent-user interaction

Sierra Research • 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. (Princeton) • 2023

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou et al. • 2023

Related resources

Practices for governing agentic AI systems: OpenAI's seven safety principles

Governance frameworks • OpenAI

Taxonomy of Failure Mode in Agentic AI Systems

Risk taxonomies • Microsoft

EleutherAI LM Evaluation Harness

Assessment and evaluation • EleutherAI

Build your AI governance program

VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.

Explore the library Start free trial

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Tags

At a glance

More in Evaluation and benchmarks

Related resources

Build your AI governance program