LLM Evals

CI/CD integration

Run LLM evaluations automatically in GitHub Actions or any CI pipeline and block merges when quality drops.

Overview

You can run LLM evaluations automatically in your CI/CD pipeline. Every time someone opens a pull request or pushes to a branch, the pipeline evaluates your model against a test dataset and blocks the merge if quality drops below your threshold.

This works with GitHub Actions out of the box. For other CI systems (GitLab, Jenkins, CircleCI), a standalone Python script and CLI are available too.

What it does

Creates an evaluation experiment on your VerifyWise instance.
Runs your model against the dataset you specify.
An LLM judge scores each response on the metrics you chose (answer_relevancy, bias, toxicity, faithfulness, hallucination, contextual_relevancy).
If any metric falls below the threshold, the CI step fails and the PR is blocked.
Results are posted as a PR comment, uploaded as build artifacts and stored in your VerifyWise dashboard.

GitHub Actions setup

Add this file to your repository at .github/workflows/llm-eval.yml:

yaml

name: LLM Quality Gate

on:
  pull_request:
    branches: [main, develop]

jobs:
  eval:
    name: Evaluate LLM
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read
    steps:
      - uses: actions/checkout@v4

      - name: Run evaluation
        uses: verifywise-ai/verifywise-eval-action@v1
        with:
          api_url: https://your-verifywise-instance.com
          project_id: proj_abc
          dataset_id: '2'
          metrics: answer_relevancy,faithfulness,hallucination
          model_name: gpt-4o-mini
          model_provider: openai
          threshold: '0.7'
          vw_api_token: ${{ secrets.VW_API_TOKEN }}
          llm_api_key: ${{ secrets.LLM_API_KEY }}

Required secrets

Add these in your GitHub repo under Settings > Secrets and variables > Actions:

Secret	Required	Where to get it
VW_API_TOKEN	Yes	VerifyWise dashboard > Settings > API tokens
LLM_API_KEY	Yes	API key for the model being evaluated (OpenAI, Anthropic, etc.)
JUDGE_API_KEY	No	API key for the judge LLM. Only needed when the model and judge use different providers.

Model vs judge

The evaluation uses two LLMs: the model generates responses and the judge scores them. If both use the same provider (e.g. both OpenAI), a single LLM_API_KEY is enough. If they use different providers, set JUDGE_API_KEY separately.

Configuration options

Input	Default	Description
api_url	(required)	Base URL of your VerifyWise instance
project_id	(required)	Project ID from the VerifyWise dashboard
dataset_id	(required)	Dataset to evaluate against
metrics	(required)	Comma-separated metric names
model_name	(required)	Model to evaluate (e.g. gpt-4o-mini)
model_provider	(required)	openai, anthropic, google, mistral, xai, or self-hosted
threshold	0.7	Pass/fail threshold (0.0 to 1.0)
judge_model	gpt-4o	LLM used to score responses
judge_provider	openai	Provider for the judge LLM
timeout_minutes	30	Max wait time before timing out
fail_on_threshold	true	Set to false to report without failing the build
post_pr_comment	true	Post results as a PR comment

Available metrics

Choose the metrics that matter for your use case. Standard metrics pass when the score is at or above the threshold. Inverted metrics (hallucination, toxicity, bias) pass when the score is at or below the threshold.

Metric	Category	What it measures
answer_relevancy	Universal	Is the response relevant to what was asked?
hallucination	Universal	How much of the response is fabricated? (lower is better)
toxicity	Universal	Does the response contain harmful content? (lower is better)
bias	Universal	Does the response exhibit unfair bias? (lower is better)
faithfulness	RAG	Is the response grounded in the provided context?
contextual_relevancy	RAG	Is the retrieved context relevant?

RAG metrics are off by default

faithfulness and contextual_relevancy are only meaningful when the model has retrieved context to evaluate against. Enable them in your experiment config only for RAG use cases; for non-RAG evaluations leave them disabled to avoid noisy or incorrect scores.

Using with other CI systems

For GitLab CI, Jenkins, CircleCI or any other system, use the standalone Python script. It only needs the requests library:

bash

pip install requests

python ci_eval_runner.py \
  --api-url "$VW_API_URL" --token "$VW_API_TOKEN" \
  --project-id "$VW_PROJECT_ID" --dataset-id "$VW_DATASET_ID" \
  --metrics "answer_relevancy,faithfulness" \
  --model-name "gpt-4o-mini" --model-provider "openai" \
  --threshold 0.7 \
  --output results.json --markdown-output summary.md

The script exits with code 0 if all metrics pass, 1 if any metric fails and 2 on errors. Download ci_eval_runner.py from the verifywise-eval-action repository.

Python SDK

For more control, install the Python SDK and call the API directly:

python

pip install verifywise

from verifywise import VerifyWiseClient

client = VerifyWiseClient(
    api_url="https://your-instance.com",
    token="your-token"
)

results = client.experiments.run_and_wait(
    project_id="proj_abc",
    name="Nightly Eval",
    model_name="gpt-4o-mini",
    model_provider="openai",
    dataset_id="2",
    metrics=["answer_relevancy", "hallucination"],
    threshold=0.7,
)

assert results.passed, f"Failed: {[m.name for m in results.metrics if not m.passed]}"

Finding your project and dataset IDs

Project ID: Open LLM Evals in the sidebar, click on your project. The ID is in the URL.
Dataset ID: Go to the Datasets tab in your project. Click any dataset to see its ID.
API token: Go to Settings > API tokens in the main VerifyWise sidebar.

Viewing results

CI-triggered experiments appear in the same Experiments list as manually run ones. You can see the scores, compare against previous runs and drill into individual prompt-level results from the VerifyWise dashboard.

Running experiments

Learn how to run evaluations manually from the dashboard.

Managing datasets

Upload and manage the test datasets used by CI evaluations.

Configuring scorers

Customize the LLM judge metrics used in evaluations.

PreviousLLM Arena

NextPlayground