Jimenez et al. (Princeton)
View original resourcePrinceton benchmark of 2,294 real GitHub issues from 12 Python repositories paired with expert-written test patches. Measures whether agents can read codebases, edit files, and produce patches passing the hidden tests that fixed each issue.
Published
2023
Jurisdiction
International
Category
Evaluation and benchmarks
Access
Public access
VerifyWise helps you implement AI governance frameworks, track compliance, and manage risk across your AI systems.