Jimenez et al. (Princeton)
datasetactive

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. (Princeton)

View original resource

Princeton benchmark of 2,294 real GitHub issues from 12 Python repositories paired with expert-written test patches. Measures whether agents can read codebases, edit files, and produce patches passing the hidden tests that fixed each issue.

Tags

agentic AIevaluation

At a glance

Published

2023

Jurisdiction

International

Category

Evaluation and benchmarks

Access

Public access

Build your AI governance program

VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? | VerifyWise AI Governance Library