Evaluate and benchmark your LLM applications for quality, safety, and performance using experiments, datasets, and configurable scoring metrics.
4 articles
Introduction to the LLM evaluation platform, key concepts like experiments and scorers, and how to measure model quality over time.
Create and run evaluation experiments to test your models against datasets, compare results across runs, and identify regressions.
Upload, browse, and manage evaluation datasets including custom prompt sets and built-in benchmarks for consistent LLM testing.
Set up evaluation metrics and scoring thresholds for bias, toxicity, hallucination, and other quality dimensions in your LLM experiments.