LLM Evals

Evaluate and benchmark your LLM applications for quality, safety, and performance.

13 articles

LLM Evals overview

Introduction to the LLM evaluation platform and key concepts.

Create and run evaluation experiments to test your models.

Upload, browse, and manage evaluation datasets.

Set up evaluation metrics and scoring thresholds.

Run demographic bias audits against compliance frameworks like NYC LL144 and EEOC.

View and manage the AI models used across your evaluation experiments.

Compare two models side by side on the same prompt.

Run LLM evaluations automatically in GitHub Actions or any CI pipeline and block merges when quality drops.

Chat with any configured model directly to test prompts before running experiments.

View model performance rankings based on arena comparison results.

Generate structured PDF or CSV reports from experiment results following the EvalCards standard.

Set the LLM use case type for your evaluation project.

Configure API keys for LLM providers used across all evaluation projects.