LLM Evals

Running experiments

Create and run evaluation experiments to test your models.

What is an experiment?

An experiment is a single evaluation run. You take a model, throw a dataset of prompts at it and measure how well it performs. Each experiment captures a snapshot of your model's capabilities at a specific point in time, making it easy to track improvements (or regressions) as you iterate.

The experiment wizard walks you through four steps: configuring the model you want to test, selecting your dataset, choosing a judge LLM and picking which metrics to measure. Most experiments take just a few minutes to set up and run.

Starting a new experiment

Click New experiment from either the Overview or Experiments tab. This opens a step-by-step wizard that guides you through the configuration. Let's walk through each step.

Step 1: Choose your model

First, you'll configure the model you want to evaluate. This is the model that will receive the prompts from your dataset and generate responses.

Select a provider from the grid. Each provider card shows its logo for easy recognition:

OpenAI: GPT-4, GPT-4 Turbo, GPT-3.5 Turbo and other OpenAI models. You'll need your OpenAI API key.
Anthropic: Claude 3 Opus, Sonnet and Haiku. Great for testing against a different model family.
Google Gemini: Gemini Pro and Ultra. Useful if you're considering Google's offerings.
xAI: Grok models for those exploring newer providers.
Mistral: Mistral Large and Medium. Strong open-weight alternative to closed models.
HuggingFace: Open-source models like TinyLlama. No API key required for some models.
Ollama: Locally-hosted models. Point to your Ollama instance to evaluate models running on your own hardware.
Local: Any local endpoint with an OpenAI-compatible API. Enter your endpoint URL.
Custom API: For custom deployments or proxies. Provide both the endpoint URL and any required authentication.

After selecting a provider, enter the specific model name (like gpt-4 or claude-3-opus). For cloud providers, you'll also need to provide your API key. For local or custom endpoints, enter the endpoint URL.

Testing the same model at different temperature settings? Run separate experiments and name them clearly (e.g., "GPT-4 temp=0.3" vs "GPT-4 temp=0.7") so you can compare results side by side.

Step 2: Select your dataset

The dataset determines what prompts your model will receive. This is where you define the test cases that will reveal how well your model handles different scenarios. For more on creating and customizing datasets, see [[Managing datasets]](llm-evals/managing-datasets).

You have three options for sourcing your data:

My datasets: Custom datasets you've previously uploaded. These are stored in your project and tailored to your specific use case.
Built-in datasets: Curated test suites maintained by VerifyWise. Great for getting started or benchmarking against industry-standard prompts.
Upload now: Upload a new JSON file on the spot. The file is saved to your project for future use.

Single-turn vs. conversational

For chatbot evaluations, you'll choose between two dataset formats:

Single-turn: Each test case is an isolated prompt and response. Good for question-answering or one-shot tasks.
Conversational (multi-turn): Test cases include a conversation history. The model sees previous messages before generating its response. This is recommended for chatbots since it better reflects how users actually interact with them.

For chatbots, conversational mode is marked as "Recommended" because real conversations rarely happen in isolation. Users reference previous messages, ask follow-ups and expect the bot to remember context.

Filtering and limiting prompts

Built-in datasets can be large. To run faster experiments or focus on specific areas, you can:

Filter by category: Select categories like coding, mathematics, reasoning, creative, or knowledge to evaluate specific capabilities.
Limit prompts: Set a maximum number of prompts to evaluate. Starting with 10-20 is a good way to test your configuration before running a full evaluation.

The dataset preview shows you exactly which prompts will be included. You can expand any prompt to see its full content, expected output and metadata.

Step 3: Configure the judge LLM

The judge LLM is the model that evaluates your model's responses. It looks at each output, compares it against the expected result and assigns scores for each metric.

Choosing a good judge matters. You generally want:

A capable model: GPT-4, Claude 3 Opus, or similar frontier models make the best judges. They're better at nuanced evaluation.
Consistency: Use the same judge across experiments if you're comparing results. Different judges may score differently.
Cost: Judging can get expensive since every test case requires multiple API calls. Consider using GPT-4o-mini for initial testing.

Configure your judge with these settings:

Provider and model: Same options as the model being tested. Pick a strong evaluator.
API key: Required for cloud providers. This can be the same key as your test model if using the same provider.
Temperature: Controls randomness in evaluations. Lower values (0.3-0.5) give more consistent scoring. Default is 0.7.
Max tokens: How much the judge can write in its evaluation. Default of 2048 is usually plenty.

Avoid using the same model as both the test subject and the judge. Models tend to be biased toward their own outputs. If you're testing GPT-4, use Claude as the judge, or vice versa.

Step 4: Select your metrics

The metrics available depend on your use case and dataset type. The wizard automatically shows relevant metrics based on your selections, so you're not overwhelmed with options that don't apply.

Metrics by use case

Different use cases benefit from different evaluation approaches:

Chatbot: For conversational AI. Single-turn datasets show core metrics (relevancy, bias, toxicity). Multi-turn datasets unlock conversational metrics like turn relevancy, knowledge retention, coherence, helpfulness, task completion and conversation safety.
RAG: For retrieval-augmented generation. Core metrics plus faithfulness (does it stick to the context?), hallucination detection and contextual relevancy (was the right context retrieved?).
Agent: For AI assistants that take actions. Core metrics plus agent-specific evaluations for tool use, multi-step reasoning and task execution.

Core metrics (all use cases)

Answer relevancy: Does the response actually address the question? A model that gives accurate but off-topic answers scores low here.
Bias detection: Flags responses containing gender, racial, political or other forms of bias. Important for any user-facing application.
Toxicity detection: Catches harmful, offensive, or inappropriate language. Turn this on for any model that interacts with users.

Conversational metrics (multi-turn chatbots)

When you select a multi-turn dataset for chatbot evaluation, you'll see an expandable "Conversational metrics" accordion with these additional options:

Turn relevancy: Does each response address the user's current message in context?
Knowledge retention: Does the model remember information from earlier in the conversation?
Conversation coherence: Does the conversation flow logically without contradictions?
Conversation helpfulness: Are responses actually useful for the user's goals?
Task completion: Does the conversation successfully complete the user's intended task?
Conversation safety: Is the conversation safe across all turns, even under manipulation?

RAG-specific metrics

Faithfulness: When given context, does the model stick to that context or make things up?
Hallucination detection: Identifies fabricated facts or unsupported claims.
Contextual relevancy: Is the retrieved context actually relevant to the question?

A "Multi-turn dataset detected" indicator appears when you select a conversational dataset, letting you know that additional metrics are available.

Running the experiment

Click Start Eval to begin. The experiment will show up immediately in your experiments list with a "Running" status. Depending on your dataset size and the models involved, evaluation can take anywhere from seconds to several minutes.

You can navigate away while the experiment runs, it continues in the background. When it completes, the status changes to "Completed" and you can view the results.

Understanding your results

Click into any completed experiment to see the details. You'll find:

Overall scores: Aggregate metrics across all prompts. A quick health check of model performance.
Per-prompt breakdown: Drill into individual test cases to see exactly where the model succeeded or struggled.
Configuration details: What model, dataset and judge were used. Helpful when comparing experiments later.
Timestamps: When the experiment was created and completed. Track how long evaluations take.

Viewing multi-turn results

Multi-turn experiments display results in a chat-style interface that mirrors actual conversation flow. Each turn shows the user message followed by the model's generated response, with color-coded indicators for different speakers.

This format makes it easy to trace how a conversation evolved and identify exactly where the model's performance changed. You can see per-turn bias and toxicity scores alongside each response, highlighting problematic outputs as they occur in context.

Tracking progress over time

The Experiments tab includes a performance chart that plots your scores across all experiments. This is where patterns emerge: you can see if that prompt tweak actually helped, whether switching models improved quality or if performance has been steadily declining.

Name your experiments descriptively (include the date, model version, or what you changed) so the chart tells a clear story.

Managing datasets

Learn how to create and customize evaluation datasets

Configuring scorers

Set up metrics and conversational evaluation

PreviousLLM Evals overview

NextManaging datasets