LLM Evals overview
Introduction to the LLM evaluation platform and key concepts.
What is LLM Evals?
Building an LLM application is one thing—knowing whether it actually works well is another. LLM Evals gives you a systematic way to measure how your models perform before they reach users, and to catch regressions as you iterate on prompts, fine-tune models, or swap providers.
Think of it as automated quality assurance for your AI. Instead of manually testing outputs or waiting for user complaints, you can run structured evaluations that check for the things that matter: Is the response relevant? Is it accurate? Does it contain harmful content? Is the model making things up?
How it works
LLM Evals uses what's called a Judge LLM approach. Here's the basic idea: you send prompts to your model, collect its responses, and then have a separate (usually more capable) model evaluate those responses against your criteria.
For example, if you're building a customer support chatbot, you might have a dataset of 100 common questions with ideal answers. LLM Evals sends each question to your chatbot, then asks GPT-4 or Claude to score how well your chatbot's response matches the expected answer, whether it stays on topic, and whether it contains any problematic content.
Key concepts
Before diving in, it helps to understand how the pieces fit together:
- Projects: Your workspace for a specific application or use case. A customer support bot would be one project; an internal knowledge assistant would be another. Each project has its own experiments, datasets, and configuration.
- Experiments: A single evaluation run. Each experiment tests a specific model configuration against a dataset and produces scores. Run experiments whenever you change prompts, switch models, or want to compare approaches.
- Datasets: Collections of test cases—prompts paired with expected outputs or evaluation criteria. Good datasets reflect real usage patterns and cover edge cases your model might struggle with.
- Scorers: The metrics you're measuring. Out of the box, you get answer relevancy, bias detection, toxicity detection, faithfulness, hallucination detection, and contextual relevancy. You can also create custom scorers for domain-specific needs.
- Judge LLM: The model that evaluates your model's outputs. This is typically a frontier model like GPT-4 or Claude that can reliably assess quality. You configure the judge separately from the model being evaluated.
When to run evaluations
Evaluations are most valuable at these moments:
- Before launch: Establish baseline performance and catch issues before users see them
- After prompt changes: Verify that improvements in one area don't cause regressions elsewhere
- When switching models: Compare performance across providers or model versions objectively
- Periodically in production: Catch drift or degradation over time as the underlying models update
Finding your way around
Access LLM Evals from the main sidebar. Once inside, you'll see a project dropdown at the top and a navigation panel on the left:
- Overview: Your project dashboard. See recent experiments, quick stats, and jump into common actions.
- Experiments: The full history of evaluation runs. Track performance over time with the built-in chart, dig into individual results, or start new experiments.
- Datasets: Manage your test cases. Browse built-in datasets, upload your own, or edit existing ones to better match your use cases.
- Scorers: Configure what you're measuring. Enable or disable metrics, adjust thresholds, or create custom scorers.
- Configuration: Project-level settings like default models and evaluation preferences.
- Organizations: Manage API keys for different providers. These are stored securely and used when running evaluations.
Running your first evaluation
Ready to try it out? Here's the quickest path to your first results:
- Create a project: — Click the project dropdown and select "Create new project." Give it a descriptive name like "Customer Support Bot" or "Document Q&A."
- Add your API keys: — Go to Organizations and add API keys for the model you're testing and the judge model. You'll need at least one of each.
- Start a new experiment: — Click "New experiment" and follow the wizard. Select your model, pick a built-in dataset to start, choose your judge, and select which metrics to evaluate.
- Review results: — Once the experiment completes, click into it to see per-metric scores and drill down into individual test cases.
What's next
Once you're comfortable with the basics, explore these areas:
- Learn how to configure experiments in detail, including model selection and metric customization
- Upload custom datasets that reflect your actual user queries and expected responses
- Create custom scorers for domain-specific evaluation criteria
- Set up regular evaluation runs to track model performance over time