Configuring scorers
Set up evaluation metrics and scoring thresholds.
What are scorers?
Scorers define what you're measuring. When you run an experiment, each scorer examines the model's responses and produces a score—typically between 0 and 1—indicating how well the response meets that particular criterion.
Think of scorers as the rubric for your evaluation. A customer support bot might need scorers for helpfulness, accuracy, and tone. A coding assistant might prioritize correctness and completeness. The scorers you choose should reflect what "good" means for your specific application.
Default scorers
LLM Evals comes with six built-in scorers that cover the most common evaluation needs. These are enabled by default and work well for most applications:
Answer relevancy
The most fundamental metric: does the response actually address the question? A model might generate fluent, grammatically correct text that completely misses the point. This scorer catches that.
A score of 1.0 means the response directly and completely addresses the prompt. A score near 0 means the response is off-topic or irrelevant. Most well-tuned models score above 0.7 on typical queries.
Bias detection
Identifies responses containing discriminatory or unfair content. This includes gender bias, racial bias, political bias, age discrimination, and other forms of prejudicial language.
Lower scores indicate more bias detected. For user-facing applications, you generally want this score to be very high (above 0.9). Any significant bias should trigger a review of your prompts and model selection.
Toxicity detection
Flags harmful, offensive, abusive, or inappropriate language. This goes beyond bias to include threats, insults, profanity, and content that could harm users or your brand.
Critical for any model that interacts with the public. Even if your model is generally well-behaved, adversarial prompts might elicit toxic responses. Regular evaluation helps catch these edge cases.
Faithfulness
Measures whether the response accurately reflects provided context. This is essential for RAG (Retrieval-Augmented Generation) systems where the model should ground its answers in retrieved documents.
A faithful response only makes claims supported by the context. An unfaithful response adds information that wasn't there or contradicts what was provided. If you're building knowledge-based applications, this scorer is crucial.
Hallucination detection
Identifies fabricated facts, unsupported claims, and made-up information. Unlike faithfulness (which requires explicit context), this catches models that confidently state things that simply aren't true.
Hallucination is one of the most common failure modes in LLMs. Models can sound authoritative while being completely wrong. This scorer helps quantify how often your model makes things up.
Contextual relevancy
For RAG systems, this evaluates the retrieval step: is the context that was retrieved actually relevant to the question? You can have a perfectly faithful response that's still wrong because the retrieved context was irrelevant.
Low scores here indicate a retrieval problem, not a generation problem. Your model might be doing exactly what it should with the context it's given—but the wrong context leads to wrong answers.
Managing scorers
Navigate to the Scorers tab in your project. You'll see a table listing all configured scorers with their name, type, judge model, metric key, and current status (enabled or disabled).
Click any row to open the edit panel, or use the gear icon for quick actions like editing or deleting.
Creating a custom scorer
Need to evaluate something specific to your domain? Create a custom scorer by clicking New scorer. You'll configure:
- Name: A descriptive label that appears in results (e.g., "Medical Accuracy" or "Code Correctness")
- Metric key: A machine-readable identifier used in API responses and exports (e.g.,
medical_accuracy,code_correctness) - Description: Optional explanation of what this scorer evaluates. Helps teammates understand its purpose.
- Type: LLM (uses a judge model), Builtin (predefined logic), or Custom (your own evaluation code)
- Judge model: For LLM-type scorers, which model evaluates the responses. Same options as the experiment configuration.
- Default threshold: The minimum acceptable score (e.g., 0.7 means 70%). Responses below this are flagged as failures.
- Weight: Relative importance when calculating aggregate scores. Higher weights mean this scorer contributes more to the overall score.
Scorer types explained
- LLM scorers: Use a judge model (like GPT-4) to evaluate responses. The most flexible option—you can evaluate nuanced qualities like helpfulness, tone, or domain accuracy. However, they're slower and cost more since each evaluation requires API calls.
- Builtin scorers: Use predefined algorithms that don't require an LLM. Faster and cheaper, but less flexible. Good for objective metrics like response length, format compliance, or keyword presence.
- Custom scorers: Your own evaluation logic. Useful for domain-specific checks that neither LLMs nor builtins handle well. Requires writing code but gives you complete control.
Setting thresholds
Thresholds determine what counts as "passing." A threshold of 0.7 means any response scoring below 70% is considered a failure for that metric.
How to choose thresholds:
- Start permissive: Begin with 0.5 or 0.6 to understand your baseline. You can tighten later.
- Review failures manually: Look at responses that score just below your threshold. Are they actually bad? Adjust accordingly.
- Match your quality bar: If you wouldn't ship a response to users, it should fail. Calibrate thresholds to match your standards.
- Different thresholds for different metrics: Toxicity might require 0.95+ (almost no tolerance), while relevancy might be 0.7 (some flexibility).
Using weights
Weights let you express which metrics matter most. When calculating an overall score, metrics with higher weights contribute more.
For example, if toxicity detection is critical (weight: 2.0) but contextual relevancy is nice-to-have (weight: 0.5), a toxic response will hurt your overall score much more than slightly irrelevant context.
Weights are relative: what matters is the ratio between them, not the absolute values. A weight of 2.0 vs. 1.0 has the same effect as 4.0 vs. 2.0.
Enabling and disabling scorers
Not every scorer needs to run in every experiment. Disable scorers that don't apply to speed up evaluations and reduce noise in your results.
To toggle a scorer, click into it and change the enabled status. Disabled scorers remain configured but won't be used in new experiments.
Scorer best practices
A few lessons from teams who've built robust evaluation pipelines:
- Match scorers to your goals: Don't just enable everything. Think about what makes a response "good" for your users, and configure scorers accordingly.
- Keep judge models consistent: If you're comparing experiments over time, use the same judge. Different judges may score the same response differently.
- Calibrate regularly: Periodically review scored responses manually. Are the scores matching your intuition? If not, adjust thresholds or scorer configurations.
- Document your choices: Keep notes on why you chose certain thresholds and weights. Future you (or teammates) will thank you when debugging unexpected results.
- Start simple: Begin with the default scorers and thresholds. Add complexity only when you understand how the system behaves.
Common issues
- All scores are too low: Your expected outputs might be too specific. Judges evaluate semantic similarity, not exact matches—but if your expectations are very different from what the model produces, scores suffer.
- All scores are too high: Your dataset might be too easy, or your thresholds too permissive. Add challenging test cases and review failing examples to calibrate.
- Inconsistent scores: Judge model temperature might be too high. Lower it (try 0.3-0.5) for more consistent evaluations.
- Evaluations are slow: Each metric requires a judge call. Disable metrics you don't need, or use faster judge models for iterative testing.