Pan et al. propose a measurement framework for production agents covering task success, trajectory quality, cost, latency, and regression detection. Argues offline benchmarks miss drift and tool-call errors, and outlines continuous evaluation for live traffic.
Published
2025
Jurisdiction
International
Category
Enterprise adoption
Access
Public access
VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.