LLM Evals

Managing datasets

Upload, browse, and manage evaluation datasets.

Why datasets matter

Your evaluation is only as good as your dataset. A well-crafted dataset exposes how your model handles the scenarios that actually matter to your users, from common questions to edge cases and tricky phrasings.

LLM Evals gives you two paths: start with built-in datasets to get quick results, or invest time in custom datasets that precisely match your use case. Most teams do both, using built-in datasets for general benchmarking while building custom ones for application-specific testing.

Built-in datasets

VerifyWise maintains a library of curated datasets organized by use case. Each dataset specifies its type (single-turn or multi-turn) and intended use case, helping you quickly find relevant test cases.

Chatbot datasets

For conversational AI evaluation:

Basic Chatbot: General-purpose prompts covering conversation, coding, math, knowledge, creative writing and reasoning. Single-turn format, great for getting started.
Coding Helper: Programming assistance scenarios testing code generation, debugging and technical explanations.
Customer Support: Customer service interactions for support chatbot evaluation. Tests helpfulness, accuracy and appropriate responses to common support queries.

RAG datasets

For retrieval-augmented generation systems:

Product Documentation: Tests accuracy of information retrieval and contextual response generation with product docs.
Research Papers: Academic and technical content retrieval, testing comprehension of complex material.
Wikipedia: General knowledge retrieval from encyclopedia-style content.
Document QA: Multi-turn question-answering against a document corpus.
Knowledge Base: Retrieval-augmented Q&A over an internal knowledge base.
Research Assistant: Multi-turn assistant interactions over technical research material.

Agent datasets

For agentic AI assistants. Agent datasets are conversation-based multi-turn evaluations that exercise planning, tool selection and follow-through over several turns; they are not isolated single-call examples.

Each dataset card shows its type (single-turn or multi-turn) and use case. Filter by use case to find datasets that match your application.

Creating custom datasets

While built-in datasets are great for getting started, custom datasets let you evaluate what really matters for your specific application. A healthcare chatbot needs different test cases than a code assistant.

Good custom datasets typically include:

Real user queries: Sample from your actual traffic or support tickets. These represent how people really phrase questions, not how you imagine they might.
Known failure cases: Prompts where you've observed problems. Turn bugs into regression tests.
Edge cases: Unusual inputs, ambiguous questions or requests at the boundaries of your model's capabilities.
Adversarial examples: Attempts to confuse, mislead or manipulate your model. What happens when someone asks something inappropriate?

Uploading a dataset

Navigate to the Datasets tab and click Upload dataset. Select a JSON file from your computer. Once uploaded, the dataset becomes available for all experiments in your project.

Your JSON file should contain an array of test cases. Here's the expected structure:

json

[
  {
    "id": "support-001",
    "category": "billing",
    "prompt": "I was charged twice for my subscription. Can you help?",
    "expected_output": "I'm sorry to hear about the double charge. I can see your account and will process a refund for the duplicate payment. It should appear in 3-5 business days.",
    "expected_keywords": ["refund", "duplicate", "business days"],
    "difficulty": "medium"
  },
  {
    "id": "support-002",
    "category": "technical",
    "prompt": "The app keeps crashing when I try to upload a photo.",
    "expected_output": "Let's troubleshoot the crash. First, try clearing the app cache in Settings > Apps > [App Name] > Clear Cache. If that doesn't work, try reinstalling the app.",
    "expected_keywords": ["cache", "settings", "reinstall"],
    "difficulty": "easy"
  }
]

Dataset field reference

Each test case can include the following fields:

id: (required) A unique identifier for the test case. Use something descriptive like "billing-refund-001" rather than just numbers.
category: (required) A grouping label. Use this to organize test cases by topic, feature, or difficulty. You can filter by category when running experiments.
prompt: (required) The input that will be sent to your model. For conversational datasets, this is the user's current message.
expected_output: (required) What a good response looks like. The judge uses this to evaluate answer quality. Be specific enough to enable meaningful comparison.
expected_keywords: (optional) Key terms that should appear in a good response. Useful for checking that models mention specific products, procedures, or concepts.
difficulty: (optional) How hard this test case is: easy, medium, or hard. Helps you understand whether failures are on routine cases or challenging ones.
retrieval_context: (optional) For RAG evaluations, the context documents that should inform the answer. Include this when testing whether models properly use provided information.

The expected_output doesn't need to be an exact match. The judge LLM evaluates semantic similarity, checking whether the model's response conveys the same meaning and covers the same points.

Conversational datasets

Multi-turn datasets test how models handle ongoing conversations. Unlike single-turn prompts where you provide pre-written expected outputs, multi-turn evaluation works differently: the model actually generates responses at each turn, which are then evaluated for quality. These datasets unlock [[conversational metrics]](llm-evals/configuring-scorers) like knowledge retention and coherence.

Multi-turn dataset structure

Multi-turn datasets use a conversation array format where each item represents one turn in the conversation:

json

{
  "id": "support-conv-001",
  "category": "customer-support",
  "conversation": [
    { "role": "user", "content": "Hi, I need help with my order" },
    { "role": "assistant", "content": "" },
    { "role": "user", "content": "Order #12345 hasn't arrived yet" },
    { "role": "assistant", "content": "" },
    { "role": "user", "content": "Can you check where it is?" },
    { "role": "assistant", "content": "" }
  ]
}

Notice the empty assistant content fields. During evaluation, the model generates these responses based on the conversation context up to that point. This tests real conversational ability rather than just comparing against pre-written answers.

Single-turn format for conversations

You can also include conversation history in single-turn format by embedding it in the prompt. This approach uses pre-written expected outputs:

json

{
  "id": "conv-followup-001",
  "category": "multi-turn",
  "prompt": "User: What's the return policy?\nAssistant: You can return any item within 30 days of purchase with a receipt.\nUser: What if I don't have the receipt?",
  "expected_output": "Without a receipt, we can offer store credit for the current selling price. Just bring the item and a valid ID.",
  "difficulty": "medium"
}

For chatbot evaluation, the multi-turn conversation format is recommended. It tests actual model behavior across turns rather than comparing against static expected outputs.

Browsing and previewing

The Datasets tab shows all available datasets, both built-in and custom. Each card displays the dataset name, type (built-in or custom) and the number of prompts it contains.

Click any dataset to open a preview drawer showing all its prompts. You can expand individual prompts to see the full expected output and metadata. This is useful for understanding what a dataset covers before using it in an experiment.

Editing and customizing

Want to tweak an existing dataset? Click the action menu on any dataset and select Open in editor. This loads all prompts into an editable view where you can:

Modify prompts and expected outputs
Change categories and difficulty levels
Add or remove expected keywords
Delete test cases that aren't relevant

When you're done, click Save copy to create a new custom dataset with your changes. The original dataset stays unchanged, so this is non-destructive editing.

Multi-turn dataset editor

Multi-turn datasets have a specialized editor that displays conversations in a chat-like format. Each turn shows role icons (user or assistant) with color-coded backgrounds, making it easy to see the conversation flow at a glance.

In the editor, you can add new turns to existing conversations, modify message content, or rearrange the conversation flow. The visual format helps you maintain realistic dialogue patterns while editing.

Built-in datasets can't be modified directly. Use "Save copy" to create an editable version that you can customize for your needs.

Dataset best practices

After helping many teams build effective evaluation datasets, here's what we've learned works well:

Start with real data: Sample actual user queries from logs, support tickets, or user research. Synthetic prompts often miss the quirks of how real people phrase things.
Include failures: When you find a bug or user complaint, turn it into a test case. Your dataset should grow from production experience.
Cover the distribution: Don't just test happy paths. Include common variations, typos, ambiguous requests and edge cases. If 10% of your users ask questions in a particular way, 10% of your test cases should too.
Write specific expected outputs: Vague expectations lead to vague evaluations. If there's a specific procedure or piece of information the model should include, spell it out.
Use meaningful categories: Categories help you identify patterns. If your model struggles with "refund" questions but handles "shipping" well, you want to know that.
Update regularly: Your product changes, your users change and your model's failure modes evolve. Review and refresh your datasets periodically.

Managing your datasets

Over time, you'll accumulate multiple datasets. Keep them organized:

Use clear names: Include the purpose, version or date in the name. "Customer Support v2 - Dec 2024" is better than "test_data_final".
Delete obsolete versions: Click the action menu and select "Remove" to delete datasets you no longer need. This keeps your list manageable.
Document your datasets: Keep notes (even externally) about what each dataset covers, when it was last updated and any known limitations.

Running experiments

Use your datasets in evaluation runs

Configuring scorers

Set up metrics for multi-turn and single-turn evaluations

PreviousRunning experiments

NextConfiguring scorers

Managing datasets

Why datasets matter

Built-in datasets

Chatbot datasets

RAG datasets

Agent datasets

Creating custom datasets

Uploading a dataset

Dataset field reference

Conversational datasets

Multi-turn dataset structure

Single-turn format for conversations

Browsing and previewing

Editing and customizing

Multi-turn dataset editor

Dataset best practices

Managing your datasets

Related articles

Running experiments

Configuring scorers