arXiv
researchactive

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

View original resource

Princeton’s proposal for a holistic agent leaderboard, arguing that bare-model, vendor-scaffolded, and full-system results diverge by 30–50 points and that fair agent evaluation needs standardized infrastructure across scaffolds and environments.

Tags

agentic AIevaluationbenchmarksleaderboard

At a glance

Published

2025

Jurisdiction

Global

Category

Evaluation and benchmarks

Access

Public access

Build your AI governance program

VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation | VerifyWise AI Governance Library