LLM Evals
Evaluate and benchmark your LLM applications for quality, safety, and performance.
13 articles
LLM Evals overview
Introduction to the LLM evaluation platform and key concepts.
Running experiments
Create and run evaluation experiments to test your models.
Managing datasets
Upload, browse, and manage evaluation datasets.
Configuring scorers
Set up evaluation metrics and scoring thresholds.
Running bias audits
Run demographic bias audits against compliance frameworks like NYC LL144 and EEOC.
Managing models
View and manage the AI models used across your evaluation experiments.
LLM Arena
Compare two models side by side on the same prompt.
CI/CD integration
Run LLM evaluations automatically in GitHub Actions or any CI pipeline and block merges when quality drops.
Playground
Chat with any configured model directly to test prompts before running experiments.
Leaderboard
View model performance rankings based on arena comparison results.
Evaluation reports
Generate structured PDF or CSV reports from experiment results following the EvalCards standard.
Project configuration
Set the LLM use case type for your evaluation project.
LLM Evals settings
Configure API keys for LLM providers used across all evaluation projects.