datasetactive
tau-bench: A benchmark for tool-agent-user interaction
Sierra Research
View original resourceSierra Research's open benchmark for tool-using agents, simulating airline and retail customer-service tasks with an LLM user and rule-based APIs. Measures task success, adherence to policy, and consistency across repeated trials under realistic constraints.
Tags
agentic AIevaluation
At a glance
Published
2024
Jurisdiction
International
Category
Evaluation and benchmarks
Access
Public access
Build your AI governance program
VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.