Liu et al. (Tsinghua University and collaborators)
researchactive

AgentBench: Evaluating LLMs as Agents

View original resource

A multi-environment benchmark that tests large language models acting as agents across eight distinct settings, from operating systems and databases to web browsing and digital card games. It measures multi-step reasoning and decision-making rather than single-turn answers, and surfaced a large capability gap between top commercial models and open models on agent tasks.

Tags

agentic AIevaluationbenchmark

At a glance

Published

2023

Jurisdiction

International

Category

Evaluation and benchmarks

Access

Public access

Build your AI governance program

VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.

AgentBench: Evaluating LLMs as Agents | VerifyWise AI Governance Library