Running experiments
Create and run evaluation experiments to test your models.
What is an experiment?
An experiment is a single evaluation run—you take a model, throw a dataset of prompts at it, and measure how well it performs. Each experiment captures a snapshot of your model's capabilities at a specific point in time, making it easy to track improvements (or regressions) as you iterate.
The experiment wizard walks you through four steps: configuring the model you want to test, selecting your dataset, choosing a judge LLM, and picking which metrics to measure. Most experiments take just a few minutes to set up and run.
Starting a new experiment
Click New experiment from either the Overview or Experiments tab. This opens a step-by-step wizard that guides you through the configuration. Let's walk through each step.
Step 1: Choose your model
First, you'll configure the model you want to evaluate. This is the model that will receive the prompts from your dataset and generate responses.
Select a provider from the grid. Each provider card shows its logo for easy recognition:
- OpenAI: GPT-4, GPT-4 Turbo, GPT-3.5 Turbo, and other OpenAI models. You'll need your OpenAI API key.
- Anthropic: Claude 3 Opus, Sonnet, and Haiku. Great for testing against a different model family.
- Google Gemini: Gemini Pro and Ultra. Useful if you're considering Google's offerings.
- xAI: Grok models for those exploring newer providers.
- Mistral: Mistral Large and Medium. Strong open-weight alternative to closed models.
- HuggingFace: Open-source models like TinyLlama. No API key required for some models.
- Ollama: Locally-hosted models. Point to your Ollama instance to evaluate models running on your own hardware.
- Local: Any local endpoint with an OpenAI-compatible API. Enter your endpoint URL.
- Custom API: For custom deployments or proxies. Provide both the endpoint URL and any required authentication.
After selecting a provider, enter the specific model name (like gpt-4 or claude-3-opus). For cloud providers, you'll also need to provide your API key. For local or custom endpoints, enter the endpoint URL.
Step 2: Select your dataset
The dataset determines what prompts your model will receive. This is where you define the test cases that will reveal how well your model handles different scenarios.
You have three options for sourcing your data:
- My datasets: Custom datasets you've previously uploaded. These are stored in your project and tailored to your specific use case.
- Built-in datasets: Curated test suites maintained by VerifyWise. Great for getting started or benchmarking against industry-standard prompts.
- Upload now: Upload a new JSON file on the spot. The file is saved to your project for future use.
Single-turn vs. conversational
For chatbot evaluations, you'll choose between two dataset formats:
- Single-turn: Each test case is an isolated prompt and response. Good for question-answering or one-shot tasks.
- Conversational (multi-turn): Test cases include a conversation history. The model sees previous messages before generating its response. This is recommended for chatbots since it better reflects how users actually interact with them.
Filtering and limiting prompts
Built-in datasets can be large. To run faster experiments or focus on specific areas, you can:
- Filter by category: Select categories like coding, mathematics, reasoning, creative, or knowledge to evaluate specific capabilities.
- Limit prompts: Set a maximum number of prompts to evaluate. Starting with 10-20 is a good way to test your configuration before running a full evaluation.
The dataset preview shows you exactly which prompts will be included. You can expand any prompt to see its full content, expected output, and metadata.
Step 3: Configure the judge LLM
The judge LLM is the model that evaluates your model's responses. It looks at each output, compares it against the expected result, and assigns scores for each metric.
Choosing a good judge matters. You generally want:
- A capable model: GPT-4, Claude 3 Opus, or similar frontier models make the best judges. They're better at nuanced evaluation.
- Consistency: Use the same judge across experiments if you're comparing results. Different judges may score differently.
- Cost awareness: Judging can be expensive since every test case requires multiple API calls. Consider using GPT-4o-mini for initial testing.
Configure your judge with these settings:
- Provider and model: Same options as the model being tested. Pick a strong evaluator.
- API key: Required for cloud providers. This can be the same key as your test model if using the same provider.
- Temperature: Controls randomness in evaluations. Lower values (0.3-0.5) give more consistent scoring. Default is 0.7.
- Max tokens: How much the judge can write in its evaluation. Default of 2048 is usually plenty.
Step 4: Select your metrics
Finally, choose which aspects of the responses to evaluate. All metrics are enabled by default—uncheck any that don't apply to your use case.
- Answer relevancy: Does the response actually address the question? A model that gives accurate but off-topic answers scores low here.
- Bias detection: Flags responses containing gender, racial, political, or other forms of bias. Essential for user-facing applications.
- Toxicity detection: Catches harmful, offensive, or inappropriate language. Turn this on for any model that interacts with users.
- Faithfulness: When given context (like in RAG), does the model stick to that context or make things up? Critical for knowledge-based applications.
- Hallucination detection: Identifies fabricated facts or unsupported claims. Even without explicit context, this catches models that confidently state false information.
- Contextual relevancy: For RAG systems: is the retrieved context actually relevant to the question? Poor retrieval means poor answers.
Running the experiment
Click Start Eval to begin. The experiment will show up immediately in your experiments list with a "Running" status. Depending on your dataset size and the models involved, evaluation can take anywhere from seconds to several minutes.
You can navigate away while the experiment runs—it continues in the background. When it completes, the status changes to "Completed" and you can view the results.
Understanding your results
Click into any completed experiment to see the details. You'll find:
- Overall scores: Aggregate metrics across all prompts. A quick health check of model performance.
- Per-prompt breakdown: Drill into individual test cases to see exactly where the model succeeded or struggled.
- Configuration details: What model, dataset, and judge were used. Helpful when comparing experiments later.
- Timestamps: When the experiment was created and completed. Track how long evaluations take.
Tracking progress over time
The Experiments tab includes a performance chart that plots your scores across all experiments. This is where patterns emerge: you can see if that prompt tweak actually helped, whether switching models improved quality, or if performance has been steadily declining.
Name your experiments descriptively (include the date, model version, or what you changed) so the chart tells a clear story.