CI/CD integration
Run LLM evaluations automatically in GitHub Actions or any CI pipeline and block merges when quality drops.
Overview
You can run LLM evaluations automatically in your CI/CD pipeline. Every time someone opens a pull request or pushes to a branch, the pipeline evaluates your model against a test dataset and blocks the merge if quality drops below your threshold.
This works with GitHub Actions out of the box. For other CI systems (GitLab, Jenkins, CircleCI), a standalone Python script and CLI are also available.
What it does
- Creates an evaluation experiment on your VerifyWise instance.
- Runs your model against the dataset you specify.
- An LLM judge scores each response on the metrics you chose (correctness, hallucination, faithfulness, etc.).
- If any metric falls below the threshold, the CI step fails and the PR is blocked.
- Results are posted as a PR comment, uploaded as build artifacts, and stored in your VerifyWise dashboard.
GitHub Actions setup
Add this file to your repository at .github/workflows/llm-eval.yml:
name: LLM Quality Gate
on:
pull_request:
branches: [main, develop]
jobs:
eval:
name: Evaluate LLM
runs-on: ubuntu-latest
permissions:
pull-requests: write
contents: read
steps:
- uses: actions/checkout@v4
- name: Run evaluation
uses: verifywise-ai/verifywise-eval-action@v1
with:
api_url: https://your-verifywise-instance.com
project_id: proj_abc
dataset_id: '2'
metrics: correctness,faithfulness,hallucination
model_name: gpt-4o-mini
model_provider: openai
threshold: '0.7'
vw_api_token: ${{ secrets.VW_API_TOKEN }}
llm_api_key: ${{ secrets.LLM_API_KEY }}Required secrets
Add these in your GitHub repo under Settings > Secrets and variables > Actions:
| Secret | Required | Where to get it |
|---|---|---|
| VW_API_TOKEN | Yes | VerifyWise dashboard > Settings > API tokens |
| LLM_API_KEY | Yes | API key for the model being evaluated (OpenAI, Anthropic, etc.) |
| JUDGE_API_KEY | No | API key for the judge LLM. Only needed when the model and judge use different providers. |
Configuration options
| Input | Default | Description |
|---|---|---|
| api_url | (required) | Base URL of your VerifyWise instance |
| project_id | (required) | Project ID from the VerifyWise dashboard |
| dataset_id | (required) | Dataset to evaluate against |
| metrics | (required) | Comma-separated metric names |
| model_name | (required) | Model to evaluate (e.g. gpt-4o-mini) |
| model_provider | (required) | openai, anthropic, google, mistral, xai, or self-hosted |
| threshold | 0.7 | Pass/fail threshold (0.0 to 1.0) |
| judge_model | gpt-4o | LLM used to score responses |
| judge_provider | openai | Provider for the judge LLM |
| timeout_minutes | 30 | Max wait time before timing out |
| fail_on_threshold | true | Set to false to report without failing the build |
| post_pr_comment | true | Post results as a PR comment |
Available metrics
Choose the metrics that matter for your use case. Standard metrics pass when the score is at or above the threshold. Inverted metrics (hallucination, toxicity, bias) pass when the score is at or below the threshold.
| Metric | Category | What it measures |
|---|---|---|
| correctness | Universal | Are the answers factually right? |
| answer_relevancy | Universal | Is the response relevant to what was asked? |
| completeness | Universal | Does the answer cover all parts of the question? |
| hallucination | Universal | How much of the response is fabricated? (lower is better) |
| toxicity | Universal | Does the response contain harmful content? (lower is better) |
| bias | Universal | Does the response exhibit unfair bias? (lower is better) |
| faithfulness | RAG | Is the response grounded in the provided context? |
| contextual_relevancy | RAG | Is the retrieved context relevant? |
| tool_correctness | Agent | Are the right tools selected? |
| task_completion | Agent | Is the overall task completed? |
Using with other CI systems
For GitLab CI, Jenkins, CircleCI, or any other system, use the standalone Python script. It only needs the requests library:
pip install requests
python ci_eval_runner.py \
--api-url "$VW_API_URL" --token "$VW_API_TOKEN" \
--project-id "$VW_PROJECT_ID" --dataset-id "$VW_DATASET_ID" \
--metrics "correctness,faithfulness" \
--model-name "gpt-4o-mini" --model-provider "openai" \
--threshold 0.7 \
--output results.json --markdown-output summary.mdThe script exits with code 0 if all metrics pass, 1 if any metric fails, and 2 on errors. Download ci_eval_runner.py from the verifywise-eval-action repository.
Python SDK
For more control, install the Python SDK and call the API directly:
pip install verifywise
from verifywise import VerifyWiseClient
client = VerifyWiseClient(
api_url="https://your-instance.com",
token="your-token"
)
results = client.experiments.run_and_wait(
project_id="proj_abc",
name="Nightly Eval",
model_name="gpt-4o-mini",
model_provider="openai",
dataset_id="2",
metrics=["correctness", "faithfulness", "hallucination"],
threshold=0.7,
)
assert results.passed, f"Failed: {[m.name for m in results.metrics if not m.passed]}"Finding your project and dataset IDs
- Project ID: Open LLM Evals in the sidebar, click on your project. The ID is in the URL.
- Dataset ID: Go to the Datasets tab in your project. Click any dataset to see its ID.
- API token: Go to Settings > API tokens in the main VerifyWise sidebar.
Viewing results
CI-triggered experiments appear in the same Experiments list as manually run ones. You can see the scores, compare against previous runs, and drill into individual prompt-level results from the VerifyWise dashboard.