User guideLLM EvalsCI/CD integration
LLM Evals

CI/CD integration

Run LLM evaluations automatically in GitHub Actions or any CI pipeline and block merges when quality drops.

Overview

You can run LLM evaluations automatically in your CI/CD pipeline. Every time someone opens a pull request or pushes to a branch, the pipeline evaluates your model against a test dataset and blocks the merge if quality drops below your threshold.

This works with GitHub Actions out of the box. For other CI systems (GitLab, Jenkins, CircleCI), a standalone Python script and CLI are also available.

What it does

  1. Creates an evaluation experiment on your VerifyWise instance.
  2. Runs your model against the dataset you specify.
  3. An LLM judge scores each response on the metrics you chose (correctness, hallucination, faithfulness, etc.).
  4. If any metric falls below the threshold, the CI step fails and the PR is blocked.
  5. Results are posted as a PR comment, uploaded as build artifacts, and stored in your VerifyWise dashboard.

GitHub Actions setup

Add this file to your repository at .github/workflows/llm-eval.yml:

yaml
name: LLM Quality Gate

on:
  pull_request:
    branches: [main, develop]

jobs:
  eval:
    name: Evaluate LLM
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read
    steps:
      - uses: actions/checkout@v4

      - name: Run evaluation
        uses: verifywise-ai/verifywise-eval-action@v1
        with:
          api_url: https://your-verifywise-instance.com
          project_id: proj_abc
          dataset_id: '2'
          metrics: correctness,faithfulness,hallucination
          model_name: gpt-4o-mini
          model_provider: openai
          threshold: '0.7'
          vw_api_token: ${{ secrets.VW_API_TOKEN }}
          llm_api_key: ${{ secrets.LLM_API_KEY }}

Required secrets

Add these in your GitHub repo under Settings > Secrets and variables > Actions:

SecretRequiredWhere to get it
VW_API_TOKENYesVerifyWise dashboard > Settings > API tokens
LLM_API_KEYYesAPI key for the model being evaluated (OpenAI, Anthropic, etc.)
JUDGE_API_KEYNoAPI key for the judge LLM. Only needed when the model and judge use different providers.
Model vs judge
The evaluation uses two LLMs: the model generates responses, and the judge scores them. If both use the same provider (e.g. both OpenAI), a single LLM_API_KEY is enough. If they use different providers, set JUDGE_API_KEY separately.

Configuration options

InputDefaultDescription
api_url(required)Base URL of your VerifyWise instance
project_id(required)Project ID from the VerifyWise dashboard
dataset_id(required)Dataset to evaluate against
metrics(required)Comma-separated metric names
model_name(required)Model to evaluate (e.g. gpt-4o-mini)
model_provider(required)openai, anthropic, google, mistral, xai, or self-hosted
threshold0.7Pass/fail threshold (0.0 to 1.0)
judge_modelgpt-4oLLM used to score responses
judge_provideropenaiProvider for the judge LLM
timeout_minutes30Max wait time before timing out
fail_on_thresholdtrueSet to false to report without failing the build
post_pr_commenttruePost results as a PR comment

Available metrics

Choose the metrics that matter for your use case. Standard metrics pass when the score is at or above the threshold. Inverted metrics (hallucination, toxicity, bias) pass when the score is at or below the threshold.

MetricCategoryWhat it measures
correctnessUniversalAre the answers factually right?
answer_relevancyUniversalIs the response relevant to what was asked?
completenessUniversalDoes the answer cover all parts of the question?
hallucinationUniversalHow much of the response is fabricated? (lower is better)
toxicityUniversalDoes the response contain harmful content? (lower is better)
biasUniversalDoes the response exhibit unfair bias? (lower is better)
faithfulnessRAGIs the response grounded in the provided context?
contextual_relevancyRAGIs the retrieved context relevant?
tool_correctnessAgentAre the right tools selected?
task_completionAgentIs the overall task completed?

Using with other CI systems

For GitLab CI, Jenkins, CircleCI, or any other system, use the standalone Python script. It only needs the requests library:

bash
pip install requests

python ci_eval_runner.py \
  --api-url "$VW_API_URL" --token "$VW_API_TOKEN" \
  --project-id "$VW_PROJECT_ID" --dataset-id "$VW_DATASET_ID" \
  --metrics "correctness,faithfulness" \
  --model-name "gpt-4o-mini" --model-provider "openai" \
  --threshold 0.7 \
  --output results.json --markdown-output summary.md

The script exits with code 0 if all metrics pass, 1 if any metric fails, and 2 on errors. Download ci_eval_runner.py from the verifywise-eval-action repository.

Python SDK

For more control, install the Python SDK and call the API directly:

python
pip install verifywise

from verifywise import VerifyWiseClient

client = VerifyWiseClient(
    api_url="https://your-instance.com",
    token="your-token"
)

results = client.experiments.run_and_wait(
    project_id="proj_abc",
    name="Nightly Eval",
    model_name="gpt-4o-mini",
    model_provider="openai",
    dataset_id="2",
    metrics=["correctness", "faithfulness", "hallucination"],
    threshold=0.7,
)

assert results.passed, f"Failed: {[m.name for m in results.metrics if not m.passed]}"

Finding your project and dataset IDs

  1. Project ID: Open LLM Evals in the sidebar, click on your project. The ID is in the URL.
  2. Dataset ID: Go to the Datasets tab in your project. Click any dataset to see its ID.
  3. API token: Go to Settings > API tokens in the main VerifyWise sidebar.

Viewing results

CI-triggered experiments appear in the same Experiments list as manually run ones. You can see the scores, compare against previous runs, and drill into individual prompt-level results from the VerifyWise dashboard.

PreviousLLM Arena
CI/CD integration - LLM Evals - VerifyWise User Guide