researchactive
AgentBench: Evaluating LLMs as Agents
View original resourceA multi-environment benchmark that tests large language models acting as agents across eight distinct settings, from operating systems and databases to web browsing and digital card games. It measures multi-step reasoning and decision-making rather than single-turn answers, and surfaced a large capability gap between top commercial models and open models on agent tasks.
Tags
agentic AIevaluationbenchmark
At a glance
Published
2023
Jurisdiction
International
Category
Evaluation and benchmarks
Access
Public access
Build your AI governance program
VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.