researchactive

GAIA: A Benchmark for General AI Assistants

A benchmark of real-world questions that are conceptually simple for humans but require agents to chain together reasoning, web browsing, tool use and multimodality to answer. GAIA is widely used to compare assistant-style agents, and the gap between human and model performance makes it a practical governance reference point for what agents can and cannot yet do reliably.

At a glance

Published

2023

Jurisdiction

International

More in Evaluation and benchmarks

tau-bench: A benchmark for tool-agent-user interaction

Sierra Research • 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. (Princeton) • 2023

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou et al. • 2023

Related resources

Practices for governing agentic AI systems: OpenAI's seven safety principles

Governance frameworks • OpenAI

Taxonomy of Failure Mode in Agentic AI Systems

Risk taxonomies • Microsoft

EleutherAI LM Evaluation Harness

Assessment and evaluation • EleutherAI

Build your AI governance program

VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.

Explore the library Start free trial

GAIA: A Benchmark for General AI Assistants | VerifyWise AI Governance Library

GAIA: A Benchmark for General AI Assistants

Tags

At a glance

More in Evaluation and benchmarks

Related resources

Build your AI governance program