RLHF and Fine Tuning Policy

Purpose

Mitigate risks associated with adapting foundation models through reinforcement learning with human feedback (RLHF) or supervised fine tuning by enforcing data sourcing, rater management, and evaluation safeguards.

Scope

Covers any project that modifies a base model’s behaviour via RLHF, supervised fine tuning, or instruction tuning, whether executed internally or by a vendor under our control.

Customer-specific fine-tunes for generative assistants
Safety/alignment tuning to reduce harmful outputs
Domain adaptation on proprietary datasets
Third-party RLHF programs where we provide data or raters

Definitions

RLHF: Training approach that uses human-generated preference data to fine tune model responses.
Rater: Human annotator providing labels or preferences used to train reward models.
Reward Model: Model trained on rater feedback to guide policy updates.

Policy

All RLHF and fine-tuning initiatives must use vetted datasets, qualified raters, and reproducible pipelines. Training data must respect privacy, consent, and contractual obligations. Safety/evaluation gates must be executed after each tuning cycle before deploying updated models.

Roles and Responsibilities

Applied Research Lead oversees methodology and tooling. Data Governance approves datasets. Responsible AI team defines alignment objectives and harmful content policies. Vendor Management ensures outsourced raters meet contractual and ethical standards.

Procedures

Each RLHF/fine-tuning project must complete the following:

Data approval documenting provenance, consent, and minimization compliance.
Rater onboarding with guideline training, confidentiality agreements, and qualification exams.
Annotation quality checks with sampling, double scoring, and escalation paths.
Safety alignment, including allowed/disallowed behaviours and safety reward modelling.
Evaluation covering regression tests, hallucination probes, and bias audits after each tuning cycle.
Deployment readiness steps linking results to validation and release workflows.

Exceptions

Experimental RLHF projects using synthetic data may request limited waivers but must remain isolated from production until all controls are satisfied.

Review Cadence

Quarterly reviews examine rater quality metrics, safety regressions, and alignment to updated Responsible AI guidelines. Vendor-supplied RLHF programs undergo annual audits.

References

NIST AI RMF Manage/Govern functions
OECD AI Principles on human-centered values
Internal documents: RLHF Playbook, Rater Handbook, Safety Evaluation Suite, Data Use Policy