LLM Evals

Configuring scorers

Set up evaluation metrics and scoring thresholds.

What are scorers?

Scorers define what you're measuring. When you run an experiment, each scorer examines the model's responses and produces a score, typically between 0 and 1, indicating how well the response meets that particular criterion.

Think of scorers as the rubric for your evaluation. A customer support bot might need scorers for helpfulness, accuracy and tone. A coding assistant might prioritize correctness and completeness. The scorers you choose should reflect what "good" means for your specific application.

Default scorers

LLM Evals enables a set of built-in scorers by default based on your use case. Single-turn experiments enable seven core scorers; chatbot experiments add six more conversational scorers on top. These cover common evaluation needs and work well for most applications:

Answer relevancy

The most fundamental metric: does the response actually address the question? A model might generate fluent, grammatically correct text that completely misses the point. This scorer catches that.

A score of 1.0 means the response directly and completely addresses the prompt. A score near 0 means the response is off-topic or irrelevant. Most well-tuned models score above 0.7 on typical queries.

Bias detection

Identifies responses containing discriminatory or unfair content. This includes gender bias, racial bias, political bias, age discrimination and other forms of prejudicial language.

Lower scores indicate more bias detected. For user-facing applications, you generally want this score to be very high (above 0.9). Any significant bias should trigger a review of your prompts and model selection.

Toxicity detection

Flags harmful, offensive, abusive or inappropriate language. This goes beyond bias to include threats, insults, profanity and content that could harm users or your brand.

Critical for any model that interacts with the public. Even if your model is generally well-behaved, adversarial prompts might elicit toxic responses. Regular evaluation helps catch these edge cases.

Faithfulness

Measures whether the response accurately reflects provided context. This matters most for RAG (Retrieval-Augmented Generation) systems where the model should ground its answers in retrieved documents.

A faithful response only makes claims supported by the context. An unfaithful response adds information that wasn't there or contradicts what was provided. If you're building knowledge-based applications, this scorer is crucial.

Hallucination detection

Identifies fabricated facts, unsupported claims and made-up information. Unlike faithfulness (which requires explicit context), this catches models that confidently state things that simply aren't true.

Hallucination is one of the most common failure modes in LLMs. Models can sound authoritative while being completely wrong. This scorer helps quantify how often your model makes things up.

Contextual relevancy

For RAG systems, this evaluates the retrieval step: was the context that got retrieved actually relevant to the question? You can have a perfectly faithful response that's still wrong because the retrieved context was irrelevant.

Low scores here indicate a retrieval problem, not a generation problem. Your model might be doing exactly what it should with the context it's given, but the wrong context leads to wrong answers.

Not every scorer applies to every application. A simple Q&A bot without RAG doesn't need contextual relevancy. A creative writing assistant might deprioritize faithfulness. Choose the scorers that match your use case.

Conversational metrics

When evaluating chatbots with multi-turn datasets, additional metrics become available. These are specifically designed to assess how well a model handles ongoing conversations rather than isolated prompts. Learn how to create multi-turn datasets in [[Managing datasets]](llm-evals/managing-datasets).

Turn relevancy

Measures whether each response in a conversation directly addresses the user's current message. Unlike general answer relevancy, this considers the conversational context. A response might be relevant to the overall topic but miss what the user just asked.

Knowledge retention

Evaluates whether the model remembers information from earlier in the conversation. If a user mentions their name in turn 1 and asks "What's my name?" in turn 5, the model should know. Low scores indicate the model is treating each turn as isolated.

Conversation coherence

Assesses the logical flow of the conversation. Does the model's response make sense given everything that came before? Incoherent responses might contradict earlier statements or suddenly change topic without reason.

Conversation helpfulness

Measures how useful the model's responses are in progressing the conversation toward the user's goal. A response can be relevant and coherent but still unhelpful if it doesn't actually assist the user.

Task completion

For goal-oriented conversations, this evaluates whether the model successfully helps the user complete their task. If someone is trying to book a flight, did the conversation end with a booking? Particularly important for customer service and assistant chatbots.

Conversation safety

A conversation-aware version of toxicity detection. This evaluates safety across the entire conversation, catching cases where the model might be manipulated through multi-turn prompting that wouldn't trigger safety filters in a single turn.

Conversational metrics are automatically enabled when you select a multi-turn dataset. You'll see them in an expandable "Conversational metrics" section in the experiment wizard.

Per-turn safety evaluation

For multi-turn conversations, bias and toxicity are evaluated on each individual assistant response, not just the final output. This catches cases where a model might produce problematic content mid-conversation even if the final response is clean.

Results show per-turn scores in the experiment detail view, making it easy to identify exactly where in a conversation the model went wrong.

Managing scorers

Navigate to the Scorers tab in your project. You'll see a table listing all configured scorers with their name, type, judge model, metric key and current status (enabled or disabled).

Click any row to open the edit panel, or use the gear icon for quick actions like editing or deleting.

Creating a custom scorer

Need to evaluate something specific to your domain? Create a custom scorer by clicking New scorer. You'll configure:

Name: A descriptive label that appears in results (e.g., "Medical Accuracy" or "Code Correctness")
Metric key: A machine-readable identifier used in API responses and exports (e.g., medical_accuracy, code_correctness)
Description: Optional explanation of what this scorer evaluates. Helps teammates understand its purpose.
Type: LLM (uses a judge model), Builtin (predefined logic), or Custom (your own evaluation code)
Judge model: For LLM-type scorers, which model evaluates the responses. Same options as the experiment configuration.
Default threshold: The minimum acceptable score (e.g., 0.7 means 70%). Responses below this are flagged as failures.
Weight: Relative importance when calculating aggregate scores. Higher weights mean this scorer contributes more to the overall score.

Scorer types explained

LLM scorers: Use a judge model (like GPT-4) to evaluate responses. This is the most flexible option, letting you evaluate nuanced qualities like helpfulness, tone or domain accuracy. They're slower and cost more since each evaluation requires API calls.
Builtin scorers: Use predefined algorithms that don't require an LLM. Faster and cheaper, but less flexible. Good for objective metrics like response length, format compliance or keyword presence.
Custom scorers: Your own evaluation logic. Useful for domain-specific checks that neither LLMs nor builtins handle well. Requires writing code but gives you complete control.

Start with LLM scorers, they're the most versatile. Only switch to builtin or custom scorers when you have specific performance or accuracy requirements that LLM evaluation can't meet.

Setting thresholds

Thresholds determine what counts as "passing." A threshold of 0.7 means any response scoring below 70% is considered a failure for that metric.

How to choose thresholds:

Start permissive: Begin with 0.5 or 0.6 to understand your baseline. You can tighten later.
Review failures manually: Look at responses that score just below your threshold. Are they actually bad? Adjust accordingly.
Match your quality bar: If you wouldn't ship a response to users, it should fail. Calibrate thresholds to match your standards.
Different thresholds for different metrics: Toxicity might require 0.95+ (almost no tolerance), while relevancy might be 0.7 (some flexibility).

Using weights

Weights let you express which metrics matter most. When calculating an overall score, metrics with higher weights contribute more.

For example, if toxicity detection is critical (weight: 2.0) but contextual relevancy is nice-to-have (weight: 0.5), a toxic response will hurt your overall score much more than slightly irrelevant context.

Weights are relative: what matters is the ratio between them, not the absolute values. A weight of 2.0 vs. 1.0 has the same effect as 4.0 vs. 2.0.

Enabling and disabling scorers

Not every scorer needs to run in every experiment. Disable scorers that don't apply to speed up evaluations and reduce noise in your results.

To toggle a scorer, click into it and change the enabled status. Disabled scorers remain configured but won't be used in new experiments.

Disabling a scorer doesn't affect past experiments. Those results are preserved. It only affects future experiments in this project.

Scorer best practices

A few lessons from teams who've built effective evaluation pipelines:

Match scorers to your goals: Don't just enable everything. Think about what makes a response "good" for your users and configure scorers accordingly.
Keep judge models consistent: If you're comparing experiments over time, use the same judge. Different judges may score the same response differently.
Calibrate regularly: Periodically review scored responses manually. Are the scores matching your intuition? If not, adjust thresholds or scorer configurations.
Document your choices: Keep notes on why you chose certain thresholds and weights. Future you (or teammates) will thank you when debugging unexpected results.
Start simple: Begin with the default scorers and thresholds. Add complexity only when you understand how the system behaves.

Common issues

All scores are too low: Your expected outputs might be too specific. Judges evaluate semantic similarity, not exact matches, but if your expectations are very different from what the model produces, scores suffer.
All scores are too high: Your dataset might be too easy, or your thresholds too permissive. Add challenging test cases and review failing examples to calibrate.
Inconsistent scores: Judge model temperature might be too high. Lower it (try 0.3-0.5) for more consistent evaluations.
Evaluations are slow: Each metric requires a judge call. Disable metrics you don't need, or use faster judge models for iterative testing.

Running experiments

Apply your scorers in evaluation runs

Managing datasets

Create multi-turn datasets for conversational metrics

PreviousManaging datasets

NextRunning bias audits