1. Purpose
This policy establishes the minimum validation and testing standards for AI models at [Organization Name]. It specifies what must be tested, who performs the testing, when tests are required, and what evidence must be produced. The goal is to catch errors, bias, and performance problems before they reach production, and to detect degradation after deployment.
2. Scope
This policy applies to:
- All AI and machine learning models before initial deployment.
- All model updates, retraining, or fine-tuning before promotion to production.
- All third-party models integrated into organizational systems.
- All models in production (ongoing monitoring and periodic revalidation).
3. Testing dimensions
Every AI model must be evaluated across the following dimensions. The depth of testing is proportional to the risk classification.
3.1 Functional performance
- Accuracy, precision, recall, F1, or equivalent metrics appropriate to the task.
- Performance measured against a held-out test set that was not used during training or hyperparameter tuning.
- Comparison against a baseline (previous model version, simple heuristic, or human performance).
- Acceptance thresholds defined before testing begins, not after reviewing results.
3.2 Bias and fairness
- Performance disaggregated across protected groups (gender, age, ethnicity, disability) where applicable and where data permits.
- Disparate impact analysis: does the model produce materially different outcomes for different groups?
- Statistical fairness metrics (e.g., equalized odds, demographic parity, calibration) selected based on the use case.
- High-risk systems require documented bias testing with results recorded in the model card.
3.3 Security and adversarial testing
- Prompt injection and jailbreak testing for LLM-based systems.
- Adversarial input testing: does the model produce dangerous or unexpected outputs when given deliberately crafted inputs?
- Data poisoning assessment: could the training data have been tampered with?
- Model extraction and inversion risk assessment for high-value models.
- Supply chain review: are model dependencies (libraries, pre-trained weights) from trusted sources?
3.4 Reliability and stress testing
- Behavior under edge cases, unusual inputs, and out-of-distribution data.
- Performance under load (latency, throughput) at expected and peak volumes.
- Graceful degradation: does the system fail safely when it encounters conditions outside its operating envelope?
- Rollback testing: can the system be reverted to the previous version without data loss or service interruption?
3.5 Data quality validation
- Training, validation, and test sets verified for no overlap (data leakage check).
- Data quality metrics (completeness, accuracy, freshness) confirmed against standards in the AI Training Data Sourcing Policy.
- Feature distributions in production compared against training data distributions (drift baseline).
4. Independent validation
For high-risk AI systems, validation must be performed by a party independent of the development team:
Medium and low-risk systems may be validated by the model owner with peer review.
- The validator must not have been involved in model design, development, or training.
- The validator has access to test data, model documentation, and testing infrastructure.
- Validation findings are reported directly to the AI Governance Lead, not filtered through the development team.
- The validator may be an internal team (e.g., risk, audit) or an external assessor.
5. When testing is required
| Trigger | Testing scope |
|---|---|
| Initial deployment (new model) | All 5 dimensions. Independent validation for high-risk. |
| Model retrain or fine-tune | Performance, bias, and data quality. Security if architecture changed. |
| Data pipeline change | Data quality validation and drift check. |
| Environment change (infrastructure, dependencies) | Reliability and stress testing. |
| Periodic revalidation | Quarterly for high-risk, semi-annually for medium, annually for low. |
| Post-incident | Targeted testing based on incident root cause. |
6. Test evidence and documentation
Every validation must produce a test report that includes:
Test reports are stored in the evidence library and linked to the model card in the AI inventory.
- Model identifier and version tested.
- Test date and tester identity.
- Test data description (source, size, split methodology).
- Metrics measured and results achieved.
- Pass/fail determination against pre-defined thresholds.
- Bias testing results with demographic breakdowns (where applicable).
- Security testing results and any vulnerabilities identified.
- Findings, recommendations, and required remediations.
- Sign-off from the validator.
7. Production monitoring
After deployment, ongoing monitoring must track:
Material drift or performance degradation triggers a revalidation cycle per section 5.
- Model performance against agreed metrics (alerting on degradation beyond defined thresholds).
- Input data distribution drift (feature drift, concept drift).
- Output distribution changes that may indicate model behavior shift.
- Fairness metrics over time (are bias patterns emerging post-deployment?).
- Error rates, latency, and availability.
8. Third-party model testing
For third-party models (APIs, foundation models, vendor solutions):
- The organization must conduct its own evaluation, even if the vendor provides test results.
- Evaluate on data representative of the organization's use case, not generic benchmarks.
- Test for bias using the organization's demographic context.
- Assess prompt injection and safety risks for LLM-based services.
- Re-test when the vendor releases model updates (request change notifications contractually).
9. Roles and responsibilities
| Role | Testing responsibilities |
|---|---|
| Model Owner | Defines acceptance criteria, coordinates testing, acts on findings, signs off on medium/low-risk results. |
| Development team | Executes functional, bias, and data quality tests. Documents results. |
| Independent validator | Validates high-risk systems. Reports findings directly to AI Governance Lead. |
| Security team | Conducts adversarial, prompt injection, and supply chain testing. |
| AI Governance Lead | Reviews test reports, tracks revalidation schedules, escalates failures. |
10. Regulatory alignment
- EU AI Act: Article 9 (risk management including testing), Article 10 (data quality), Article 15 (accuracy and reliability).
- ISO/IEC 42001: Clause 8.4 (AI system verification and validation).
- NIST AI RMF: MEASURE function (MS-1 through MS-4: assessment methods and metrics).
- OWASP AI Testing Guide: Security, privacy, and responsible AI testing pillars.
11. Review
This policy is reviewed annually or when triggered by new testing methodologies, regulatory changes, or patterns in validation failures.
Document control
| Field | Value |
|---|---|
| Policy owner | [AI Governance Lead] |
| Approved by | [AI Governance Committee] |
| Effective date | [Date] |
| Next review date | [Date + 12 months] |
| Version | 1.0 |
| Classification | Internal |