Running bias audits
Run demographic bias audits against compliance frameworks like NYC LL144 and EEOC.
What is a bias audit?
A bias audit checks whether an automated decision tool treats demographic groups consistently. You upload records with demographic columns and one of three kinds of outcome data. VerifyWise then calculates per-group rates, cross-group disparities and flags groups that fall below the configured threshold. The tool isn't LLM-specific; it works for any system that produces a decision, a score or a classification.
NYC Local Law 144 requires annual independent bias audits for any automated employment decision tool. EU AI Act Article 9 and the EEOC guidelines set similar expectations. VerifyWise ships with 15 compliance frameworks out of the box, each pre-configured with the right categories, thresholds and reporting requirements.
Every run gives you an interactive results dashboard, a raw JSON export and a formal PDF report you can hand to a reviewer or procurement team without any additional explanation.
Accessing bias audits
Open LLM Evals from the flask icon in the sidebar and click Bias audits. You'll see a list of all audits for your organization, sortable by date, status, framework or mode. Running audits update automatically every few seconds.
Choosing a metric
Before uploading your data, you pick an audit metric. The metric determines what your CSV needs to contain and what the results mean. VerifyWise supports three modes:
- Selection rate: The default and the right choice for any tool that produces a binary decision (hire or reject, approve or deny, flag or pass). Your CSV needs one outcome column. The audit computes each group's selection rate and an impact ratio against the highest-rate group. NYC Local Law 144 mandates this metric for binary employment decisions.
- Scoring rate: Use this when your tool outputs a continuous score rather than a yes/no, like a ranker, risk score or suitability score. Your CSV needs one numeric score column. VerifyWise computes each group's "above-median rate" (the share of records whose score beats the overall median) and then applies the same impact ratio logic. Local Law 144 explicitly allows this as an alternative to selection rate for scoring tools.
- Fairness metrics: Pick this when you have both the model's prediction and the real answer. Your CSV needs a prediction column and a ground-truth column. You get a confusion matrix per group (true positive rate, false positive rate, precision, accuracy) plus three standard cross-group differences: equal opportunity (TPR gap), equalized odds and predictive parity. This is the most informative mode but it also requires the most data.
Creating a new audit
Click New bias audit to open the setup wizard. It walks you through four steps.
Step 1: Select a compliance framework
Pick the law or standard that applies to your situation. Each framework card shows the jurisdiction and a short description of what it requires. Frameworks are grouped by audit mode:
- Quantitative audit: Computes selection rates and impact ratios with statistical flagging. Used by NYC LL144, EEOC guidelines and California FEHA.
- Impact assessment: Structured assessment with optional quantitative supplement. Used by Colorado SB 205, EU AI Act and South Korea.
- Compliance checklist: Checklist-based evaluation with recommended quantitative analysis. Used by Illinois HB 3773, New Jersey, Texas TRAIGA and others.
Selecting a framework auto-fills everything: protected categories, group labels, threshold values and intersectional analysis settings. You can override any of these in step 4.
Step 2: Enter system information
Provide details about the AI system being audited. The form adapts based on your framework. For NYC LL144, you'll see fields specific to AEDTs (automated employment decision tools). For other frameworks, the labels adjust accordingly.
- System name: The name of the AI tool or model being audited.
- Description: What the system does and how it's used in decision-making.
- Distribution date: When the tool was first deployed or made available.
- Data source description: Where the demographic and outcome data came from.
Step 3: Upload demographic data
Upload a CSV file where each row represents one record. The file needs demographic columns plus the metric-specific column(s) described in the "Choosing a metric" section above.
Step 3 starts with a metric dropdown. Pick selection rate, scoring rate or fairness metrics and the rest of the step adapts to your choice. After uploading the CSV you'll see:
- Metric selector: Dropdown at the top of the step. Changing the metric swaps the column selectors below it.
- Column mapping: Dropdowns that map each required demographic category to a column in your CSV. For example, map "Sex" to your CSV's "Gender" column.
- Metric column(s): For selection rate, an outcome column. For scoring rate, a numeric score column. For fairness metrics, a prediction column plus a ground-truth column.
- Data preview: A preview of the first five rows so you can confirm the data looks correct before proceeding.
Accepted outcome values for selection-rate and fairness-metrics modes are 1/true/yes/selected/hired/promoted as positive, and 0/false/no/rejected/declined as negative. Scoring-rate columns need to parse as numbers; missing or non-numeric values are counted as unknown.
The wizard validates that required categories are mapped, no duplicate mappings exist and your metric columns aren't reused as demographic columns.
Step 4: Review and run
Review all settings before running the audit. The framework auto-fills these values, but you can adjust them:
- Adverse impact threshold: Groups with an impact ratio below this value are flagged. NYC LL144 and EEOC use 0.80 (the "four-fifths rule").
- Small sample exclusion: Groups representing less than this percentage of total applicants are excluded from impact ratio calculations. Prevents unreliable results from very small groups.
- Intersectional analysis: When enabled, the audit computes cross-tabulated results (e.g., Male + Hispanic, Female + Asian) in addition to per-category results.
Click Run audit to start. The audit runs in the background and typically completes in a few seconds. You'll be redirected to the audit list where the status updates automatically.
Reading audit results
Click into a completed audit to see the results. The detail page shows the audit title, the compliance framework, headline summary cards, a plain-English summary and the metric-specific results tables underneath.
Naming your audit
Hover over the audit title at the top of the detail page and click the pencil icon to rename it. New audits inherit the compliance framework as their default title ("NYC Local Law 144"), but a descriptive name like "Acme Resume Screener Q1 2026" makes the list view easier to scan. The compliance framework continues to appear as a subtitle.
Summary cards
At the top you'll see cards for total records, the count and rate of positive outcomes and the number of flagged groups. The positive-outcome label depends on the metric: "Total selected" for selection rate, "Above median" for scoring rate, "Predicted positive" for fairness metrics. If rows were excluded due to missing demographic data, an "Unknown" card also appears.
Impact ratio tables
Each demographic category gets its own table. For NYC LL144, you'll see separate tables for sex, race/ethnicity and (if enabled) intersectional categories. Each table shows:
- Group: The demographic group name (e.g., "Female", "Hispanic or Latino").
- Applicants: Number of applicants in this group.
- Selected: Number selected/hired from this group.
- Selection rate: Percentage of the group that was selected.
- Impact ratio: This group's selection rate divided by the highest group's selection rate. A value of 1.000 means equal treatment.
- Status: Pass (above threshold), Flag (below threshold) or N/A (excluded due to small sample size).
Flagged rows are highlighted in red. The table header shows which group had the highest selection rate, since all impact ratios are calculated relative to that group.
Intersectional results
When intersectional analysis is enabled, an additional table shows compound groups like "Male - Hispanic or Latino" or "Female - Asian". This reveals disparities that single-category analysis might miss. For example, a system might treat women and men equally overall, but show significant differences for women of a specific racial group.
Fairness metrics tables
When you run the audit with the fairness-metrics mode, the results page renders a confusion-matrix table for each demographic category. Each row shows one group with its true positive rate, false positive rate, false negative rate, precision, accuracy and the raw TP/FP/TN/FN counts.
Above the per-group table you'll find three cross-group differences that summarize the disparity in a single number each:
- Equal opportunity difference: The gap between the highest and lowest true positive rate across groups. A value near zero means every group's qualified members get correctly identified at the same rate. Big values mean some groups are systematically missed.
- Equalized odds difference: The larger of the TPR gap and the FPR gap. A stricter version of equal opportunity that penalizes both missed positives and false alarms.
- Predictive parity difference: The gap in precision across groups. A positive prediction should mean roughly the same thing regardless of which group it applies to.
Score distributions
If your CSV includes a score column, the results page shows a score-distribution section alongside the impact-ratio tables. Each group gets a histogram of its scores plus descriptive statistics (mean, median, standard deviation) and a Kolmogorov-Smirnov statistic comparing the group's distribution to the overall one. Small K-S values mean the group looks like the overall distribution; large values mean it's shifted in some way.
Score distributions are diagnostic. Two groups can have identical impact ratios but very different score distributions and the distribution view surfaces that difference.
Audit actions
From the results page, you can:
- Download PDF report: Generate a formal PDF audit report. This is the artifact to hand to a reviewer, auditor or procurement team. Described in detail below.
- Download JSON: Export the full raw results as a JSON file for external reporting, record-keeping or downstream tooling.
- Delete: Permanently remove the audit and all its results. This requires confirmation.
The PDF report
The PDF is built so a reader who's never opened VerifyWise can still understand what was audited and what the numbers mean. It starts with a cover page, moves through an executive summary, describes the system and data, walks through the methodology and then presents the results tables with any fairness metrics or score distributions your audit produced. It closes with a limitations section. The PDF doesn't offer mitigation advice, that's a conversation for qualified counsel.
The cover page also shows the auditor's declared independence level. Self-declared audits carry a warning box so readers know the tool vendor or system owner produced the report without third-party oversight. Plenty of legitimate audits are self-declared, but it's the first thing a reviewer should see.
Supported frameworks
| Framework | Jurisdiction | Mode | Default threshold |
|---|---|---|---|
| NYC Local Law 144 | New York City | Quantitative audit | 0.80 |
| EEOC guidelines | United States | Quantitative audit | 0.80 |
| California FEHA | California | Quantitative audit | 0.80 |
| Colorado SB 205 | Colorado | Impact assessment | 0.80 |
| EU AI Act | European Union | Impact assessment | 0.80 |
| South Korea AI Act | South Korea | Impact assessment | 0.80 |
| Illinois HB 3773 | Illinois | Compliance checklist | 0.80 |
| New Jersey AI guidance | New Jersey | Compliance checklist | — |
| Texas TRAIGA | Texas | Compliance checklist | — |
| UK GDPR & Equality Act | United Kingdom | Compliance checklist | 0.80 |
| Singapore WFA | Singapore | Compliance checklist | — |
| Brazil Bill 2338 | Brazil | Compliance checklist | — |
| NIST AI RMF | International | Impact assessment | 0.80 |
| ISO 42001 | International | Impact assessment | 0.80 |
| Custom | — | Quantitative audit | User-defined |
Preparing your CSV file
Your CSV needs at minimum a demographic column and an outcome column. Here's what a typical file looks like for an NYC LL144 audit:
Gender,Race,Selected
Male,White,1
Female,Hispanic or Latino,0
Male,Black or African American,1
Female,Asian,1
Male,White,0A few things to keep in mind:
- Column names are flexible: You map them in step 3, so they don't need to match the framework's category names exactly.
- Outcome values: The outcome column accepts 1/true/yes/selected/hired/promoted as positive outcomes. Everything else (0/false/no/rejected/declined) is treated as not selected.
- Missing data: Rows with empty values in any mapped demographic column are excluded and counted separately as "unknown".
- File size: Maximum 50 MB. Quoted fields with commas are supported (RFC 4180).
- Encoding: UTF-8 is preferred. The parser also handles UTF-8 with BOM, Latin-1 and Windows-1252.
How the math works
Each metric mode uses a slightly different formula, but the shape of the output is the same: a per-group rate, a ratio against the highest-rate group and a flag when the ratio falls below the threshold.
Selection rate
- Selection rate = number selected / total records in that group
- Impact ratio = this group's selection rate / highest group's selection rate
- If the impact ratio falls below the threshold (typically 0.80), the group is flagged
The 0.80 threshold is the "four-fifths rule" from the EEOC Uniform Guidelines on Employee Selection Procedures. A group's selection rate should be at least 80% of the most-selected group's rate. A ratio of 0.75 means that group is selected at 75% the rate of the top group, which falls below the threshold.
Scoring rate
- Compute the overall median of every record's score (one number across all groups)
- Scoring rate for a group = share of that group's records whose score is above the overall median
- Impact ratio = this group's scoring rate / highest group's scoring rate
- The same threshold and flagging rules apply as for selection rate
Scoring rate is the right metric for ranking tools and continuous-score tools because it asks "does this tool place group X above its overall median as often as it places group Y?" rather than collapsing the score into a yes/no first.
Fairness metrics
Fairness metrics are computed from a confusion matrix per group. Each record has both a model prediction and a ground-truth label, so every record falls into exactly one of four cells:
- TP (true positive): Predicted positive, actually positive.
- FP (false positive): Predicted positive, actually negative.
- TN (true negative): Predicted negative, actually negative.
- FN (false negative): Predicted negative, actually positive.
From these counts VerifyWise derives the standard per-group rates:
- True positive rate (TPR) = TP / (TP + FN), also called recall or sensitivity
- False positive rate (FPR) = FP / (FP + TN)
- Precision = TP / (TP + FP)
- Accuracy = (TP + TN) / total
Cross-group differences are reported as the max-minus-min gap across groups: equal opportunity difference is the TPR gap, equalized odds difference takes the larger of the TPR and FPR gaps and predictive parity difference is the precision gap.
Kolmogorov-Smirnov for score distributions
The score distribution view uses a two-sample Kolmogorov-Smirnov test to compare each group's scores against the overall distribution. The K-S statistic is the largest gap between the two empirical cumulative distribution functions. A statistic of 0.0 means the distributions are identical; 1.0 means they are completely disjoint. The reported p-value uses the standard Kolmogorov asymptotic formula and tells you how surprising that gap would be under the null hypothesis that both samples come from the same distribution.
Small sample exclusion
Groups that make up less than the small sample exclusion percentage (default 2%) are excluded from the calculation entirely. Small samples produce unreliable ratios and including them can mask real patterns with noise. Excluded groups appear in the results with a grey "N/A" status so it's obvious they were present in the data but set aside.