AI Training Data Sourcing Policy

1. Purpose

This policy defines how [Organization Name] sources, evaluates, and documents data used to train, fine-tune, validate, and test AI models. It confirms that all training data has clear provenance, appropriate licensing, acceptable quality, and has been reviewed for bias — before it enters any AI pipeline.

2. Scope

This policy applies to:

All data used to train, fine-tune, or adapt AI models (including pre-training, instruction tuning, RLHF, and retrieval augmentation).
All validation and test datasets used to evaluate model performance.
All data sourced internally, purchased from vendors, scraped from the internet, or generated synthetically.
Both internally developed models and third-party models fine-tuned by the organization.

3. Definitions

Training data: Data used during the model learning process to establish parameters and patterns.
Validation data: Data used during development to tune hyperparameters and prevent overfitting. Must not overlap with training data.
Test data: Data used after development to evaluate final model performance. Must not overlap with training or validation data.
Data provenance: The documented origin, history, and chain of custody of a dataset.
Data lineage: The record of how data was collected, transformed, and processed before use.
Synthetic data: Artificially generated data that preserves statistical properties of real data without containing actual personal or proprietary information.

4. Data sourcing requirements

Before any dataset is used for AI training, it must pass the following checks:

4.1 Provenance documentation

The source of the data must be identified and documented (internal system, vendor, public dataset, web scrape, synthetic generation).
The date of collection or acquisition must be recorded.
The chain of custody from source to AI pipeline must be traceable.
If the data has been preprocessed or transformed, the transformations must be documented.

4.2 Licensing and legal review

All external data must have a clear license that permits its use for AI training.
Open-source datasets must be reviewed for license terms (some prohibit commercial use or derivative models).
Purchased data must include explicit contractual permission for AI training purposes.
Web-scraped data must be reviewed for terms of service violations, copyright restrictions, and personal data content.
Legal review is required before using any dataset in a regulated domain (healthcare, financial services, employment).

4.3 Personal data assessment

All datasets must be scanned for personal data before use.
If personal data is present, the lawful basis for processing must be established per the AI Data Use Policy.
Anonymization, pseudonymization, or synthetic data generation must be considered to reduce privacy risk.
Special category data (health, biometric, financial) requires additional legal review and DPIA.

5. Data quality standards

EU AI Act Article 10 requires that training data for high-risk systems be "relevant, representative, free of errors, and complete." All training data must meet the following standards:

Quality dimension	Requirement	How to verify
Relevance	Data must be appropriate for the AI system's intended purpose.	Domain expert review of sample data.
Representativeness	Data must represent the population or context the model will serve.	Demographic analysis, geographic distribution check.
Accuracy	Data must be factually correct and free of systematic errors.	Spot-check validation, cross-reference with ground truth.
Completeness	Data must not have critical gaps that would bias the model.	Missing value analysis, coverage assessment.
Temporal relevance	Data must reflect current conditions if the model operates in a changing environment.	Date range review, staleness check.
Consistency	Data from multiple sources must be harmonized in format, schema, and semantics.	Schema validation, deduplication analysis.

6. Bias review

All training data must be reviewed for potential biases before use:

Bias review findings must be documented in the dataset record. Material biases that cannot be mitigated must be escalated to the AI Governance Committee before the dataset is approved for use.

Representation bias: Are all relevant demographic groups, geographies, and use cases proportionally represented?
Historical bias: Does the data reflect historical discrimination or systemic inequities that the model could amplify?
Measurement bias: Are the labels or annotations consistent and free of systematic error?
Selection bias: Was the data collected in a way that excludes certain populations or contexts?

7. Dataset documentation

Every dataset used for AI training must have a dataset record (data sheet) that includes:

Dataset name and version.
Source and provenance information.
License type and usage restrictions.
Personal data assessment result.
Data quality metrics (completeness, accuracy, representativeness).
Bias review findings and mitigations applied.
Preprocessing and transformation steps.
Date of review and reviewer name.
Approved use cases (what this data is authorized for).

8. Prohibited data sources

The following data sources must not be used for AI training without explicit AI Governance Committee approval:

Data scraped in violation of terms of service or applicable law.
Data containing personal information without lawful basis.
Data from jurisdictions with restrictions on cross-border AI use.
Data generated by or about minors without appropriate safeguards.
Data from competitors obtained through unauthorized means.
Data with unclear provenance where the original source cannot be determined.

9. Third-party model considerations

When using pre-trained third-party models (foundation models, fine-tuned models, API-based services):

Request documentation of the vendor's training data governance practices.
Assess whether the vendor's training data includes content that could create legal, ethical, or reputational risk for the organization.
Contractually require the vendor to notify the organization of material changes to training data composition.
Evaluate the vendor's compliance with EU AI Act training data transparency requirements (Public Summary Template).

10. Roles and responsibilities

Role	Responsibilities
Data Owner	Approves datasets for AI use, ensures provenance documentation, maintains data quality.
Model Owner	Ensures training data meets quality standards, documents data in model card, manages data-model relationship.
Legal	Reviews licensing, assesses lawful basis for personal data, evaluates copyright and terms of service.
Data Privacy Officer	Reviews personal data assessments, advises on anonymization, ensures DPIA is completed when required.
AI Governance Lead	Maintains dataset inventory, tracks compliance, escalates issues to Committee.

11. Regulatory alignment

EU AI Act: Article 10 (data and data governance for high-risk systems), Recital 67 (training data quality).
GDPR: Articles 5 (data quality principles), 6 (lawful basis), 9 (special categories), 25 (privacy by design).
ISO/IEC 42001: Annex B (B.7 — data for AI systems).
NIST AI RMF: MAP function (MP-3 , AI risks and benefits from third-party resources).

12. Review

This policy is reviewed annually or sooner when triggered by changes to data protection regulations, new training data sources, or audit findings related to data quality or bias.

Document control

Field	Value
Policy owner	[AI Governance Lead]
Approved by	[AI Governance Committee]
Effective date	[Date]
Next review date	[Date + 12 months]
Version	1.0
Classification	Internal