LLM Evals

Running bias audits

Run demographic bias audits against compliance frameworks like NYC LL144 and EEOC.

What is a bias audit?

A bias audit analyzes whether an AI system treats demographic groups fairly. You upload applicant data with demographic categories and outcomes (selected or not), and the system calculates selection rates and impact ratios for each group. If any group's selection rate falls below the threshold compared to the most-selected group, it gets flagged.

This matters for regulatory compliance. NYC Local Law 144, for example, requires annual independent bias audits for any automated employment decision tool. The EU AI Act and EEOC guidelines have similar expectations. VerifyWise supports 15 compliance frameworks out of the box, each with pre-configured categories, thresholds, and reporting requirements.

Accessing bias audits

Open the LLM Evals module from the sidebar and click Bias audits. You'll see a list of all audits for your organization, sortable by date, status, framework, or mode. Running audits update automatically every few seconds.

Creating a new audit

Click New bias audit to open the setup wizard. It walks you through four steps.

Step 1: Select a compliance framework

Pick the law or standard that applies to your situation. Each framework card shows the jurisdiction and a short description of what it requires. Frameworks are grouped by audit mode:

Quantitative audit: Computes selection rates and impact ratios with statistical flagging. Used by NYC LL144, EEOC guidelines, and California FEHA.
Impact assessment: Structured assessment with optional quantitative supplement. Used by Colorado SB 205, EU AI Act, and South Korea.
Compliance checklist: Checklist-based evaluation with recommended quantitative analysis. Used by Illinois HB 3773, New Jersey, Texas TRAIGA, and others.

Selecting a framework auto-fills everything: protected categories, group labels, threshold values, and intersectional analysis settings. You can override any of these in step 4.

Not sure which framework applies? If you're hiring in New York City using an AI tool, start with NYC LL144. For general US employment, EEOC guidelines is a safe default. The "Custom" preset lets you configure everything from scratch.

Step 2: Enter system information

Provide details about the AI system being audited. The form adapts based on your framework. For NYC LL144, you'll see fields specific to AEDTs (automated employment decision tools). For other frameworks, the labels adjust accordingly.

System name: The name of the AI tool or model being audited.
Description: What the system does and how it's used in decision-making.
Distribution date: When the tool was first deployed or made available.
Data source description: Where the demographic and outcome data came from.

Step 3: Upload demographic data

Upload a CSV file where each row represents one applicant. The file needs demographic columns and a binary outcome column.

The wizard shows required columns based on your selected framework. For NYC LL144, you need sex and race/ethnicity columns plus an outcome column. Categories with empty group definitions are marked as optional.

After uploading, you'll see:

Column mapping: Dropdowns that map each required demographic category to a column in your CSV. For example, map "Sex" to your CSV's "Gender" column.
Outcome column: Select the column that indicates selection outcomes. Accepted values: 1, true, yes, selected, hired, promoted (and their inverses for non-selection).
Data preview: A preview of the first five rows so you can confirm the data looks correct before proceeding.

The wizard validates that required categories are mapped, no duplicate mappings exist, and the outcome column isn't reused as a demographic column.

Step 4: Review and run

Review all settings before running the audit. The framework auto-fills these values, but you can adjust them:

Adverse impact threshold: Groups with an impact ratio below this value are flagged. NYC LL144 and EEOC use 0.80 (the "four-fifths rule").
Small sample exclusion: Groups representing less than this percentage of total applicants are excluded from impact ratio calculations. Prevents unreliable results from very small groups.
Intersectional analysis: When enabled, the audit computes cross-tabulated results (e.g., Male + Hispanic, Female + Asian) in addition to per-category results.

Click Run audit to start. The audit runs in the background and typically completes in a few seconds. You'll be redirected to the audit list where the status updates automatically.

Reading audit results

Click into a completed audit to see the results. The detail page has three sections: summary cards, a text summary, and the impact ratio tables.

Summary cards

At the top you'll see cards for total applicants, total selected, overall selection rate, and number of flags. If rows were excluded due to missing demographic data, an "Unknown" card also appears.

Impact ratio tables

Each demographic category gets its own table. For NYC LL144, you'll see separate tables for sex, race/ethnicity, and (if enabled) intersectional categories. Each table shows:

Group: The demographic group name (e.g., "Female", "Hispanic or Latino").
Applicants: Number of applicants in this group.
Selected: Number selected/hired from this group.
Selection rate: Percentage of the group that was selected.
Impact ratio: This group's selection rate divided by the highest group's selection rate. A value of 1.000 means equal treatment.
Status: Pass (above threshold), Flag (below threshold), or N/A (excluded due to small sample size).

Flagged rows are highlighted in red. The table header shows which group had the highest selection rate, since all impact ratios are calculated relative to that group.

A flag doesn't automatically mean discrimination. It means the data shows a statistical disparity that warrants further investigation. The four-fifths rule is a screening tool, not a legal conclusion.

Intersectional results

When intersectional analysis is enabled, an additional table shows compound groups like "Male - Hispanic or Latino" or "Female - Asian". This reveals disparities that single-category analysis might miss. For example, a system might treat women and men equally overall, but show significant differences for women of a specific racial group.

Audit actions

From the results page, you can:

Download JSON: Export the full results as a JSON file for external reporting or record-keeping.
Delete: Permanently remove the audit and all its results. This requires confirmation.

Supported frameworks

Framework	Jurisdiction	Mode	Default threshold
NYC Local Law 144	New York City	Quantitative audit	0.80
EEOC guidelines	United States	Quantitative audit	0.80
California FEHA	California	Quantitative audit	0.80
Colorado SB 205	Colorado	Impact assessment	0.80
EU AI Act	European Union	Impact assessment	0.80
South Korea AI Act	South Korea	Impact assessment	0.80
Illinois HB 3773	Illinois	Compliance checklist	0.80
New Jersey AI guidance	New Jersey	Compliance checklist	—
Texas TRAIGA	Texas	Compliance checklist	—
UK GDPR & Equality Act	United Kingdom	Compliance checklist	0.80
Singapore WFA	Singapore	Compliance checklist	—
Brazil Bill 2338	Brazil	Compliance checklist	—
NIST AI RMF	International	Impact assessment	0.80
ISO 42001	International	Impact assessment	0.80
Custom	—	Quantitative audit	User-defined

Preparing your CSV file

Your CSV needs at minimum a demographic column and an outcome column. Here's what a typical file looks like for an NYC LL144 audit:

csv

Gender,Race,Selected
Male,White,1
Female,Hispanic or Latino,0
Male,Black or African American,1
Female,Asian,1
Male,White,0

A few things to keep in mind:

Column names are flexible: You map them in step 3, so they don't need to match the framework's category names exactly.
Outcome values: The outcome column accepts 1/true/yes/selected/hired/promoted as positive outcomes. Everything else (0/false/no/rejected/declined) is treated as not selected.
Missing data: Rows with empty values in any mapped demographic column are excluded and counted separately as "unknown".
File size: Maximum 50 MB. Quoted fields with commas are supported (RFC 4180).
Encoding: UTF-8 is preferred. The parser also handles UTF-8 with BOM, Latin-1, and Windows-1252.

How the math works

The core calculation is straightforward. For each demographic group:

Selection rate = number selected / total applicants in that group
Impact ratio = this group's selection rate / highest group's selection rate
If the impact ratio falls below the threshold (typically 0.80), the group is flagged

The 0.80 threshold is the "four-fifths rule" from the EEOC Uniform Guidelines on Employee Selection Procedures. It means a group's selection rate should be at least 80% of the most-selected group's rate. A ratio of 0.75 means that group is selected at 75% the rate of the top group, which falls below the threshold.

Groups that make up less than the small sample exclusion percentage (default 2%) are excluded from the calculation entirely, since small samples produce unreliable ratios.

LLM Evals overview

Introduction to the evaluation platform

Running experiments

Create evaluation experiments for your models

PreviousConfiguring scorers

NextManaging models