Retrieval-augmented generation (RAG)

Retrieval-augmented generation (RAG) is a technique that lets a language model answer questions using documents it retrieves at query time, rather than relying only on what it memorized during training.

When a user asks something, the system first searches a knowledge store (often a vector database holding embeddings of your documents), pulls back the most relevant passages, and inserts them into the prompt. The model then writes its answer grounded in that retrieved text.

The appeal is practical. You can keep the model current without retraining it, point it at private or proprietary content it never saw, and reduce the rate at which it makes things up. For governance teams, RAG also changes the risk picture in ways that are easy to underestimate.

How RAG works

A typical RAG pipeline has a few stages. Documents are split into chunks and converted into embeddings, which are numeric representations stored in an index. At query time the user's question is also embedded, and the system retrieves the chunks whose embeddings are closest to it.

Those retrieved chunks are assembled into a context window alongside the original question and any system instructions. The model reads all of it and produces a response. Most production systems add a reranking step to reorder retrieved passages by relevance, and many cite the source documents back to the user.

The quality of a RAG answer depends as much on retrieval as on the model. If the index returns the wrong passages, even a strong model will give a confident wrong answer. This is why teams treat retrieval and generation as two separate things to test.

Why RAG matters for governance

RAG matters to AI governance for four reasons.

Grounding. Answers are tied to specific source documents instead of the model's parametric memory, which makes it easier to check whether a claim is supported.

Hallucination reduction. Giving the model relevant context lowers the chance it invents facts, though it does not eliminate the risk. A model can still misread or contradict the passages it was given.

Data provenance. Because answers trace back to retrieved sources, you can show where information came from. That supports auditability and helps satisfy transparency expectations.

A new attack surface. The retrieval store becomes part of the trust boundary. If an attacker can write to the documents being indexed, they can plant instructions or false facts that the model later retrieves and acts on. This is the indirect prompt injection problem, and it is unique to systems that pull in external content.

The retrieval store as a risk surface

The knowledge base is now a security and compliance concern, not just an engineering convenience.

Access control matters at the chunk level. If the index mixes documents with different permissions, a user might retrieve passages they should not see. Many incidents trace back to over-broad indexing rather than a model flaw.

Data sensitivity travels with the documents. If you index customer records, health data, or confidential contracts, that data can surface in answers and in logs. Privacy obligations such as GDPR purpose limitation and data minimization apply to what you put in the store.

Poisoning is a real threat. Content that gets ingested from the open web, shared drives, or user uploads can carry hidden instructions. Treat ingested content as untrusted input.

How RAG systems are evaluated

Evaluating RAG means measuring retrieval and generation separately, then together.

Faithfulness measures whether the generated answer is actually supported by the retrieved context, rather than adding unsupported claims. An unfaithful answer is a hallucination even when the retrieval was correct.

Contextual precision and contextual recall measure retrieval quality. Precision asks whether the retrieved passages are relevant and ranked sensibly. Recall asks whether the passages that contained the answer were retrieved at all.

Answer relevance checks whether the response addresses the actual question. Teams often combine these with human review on a sample, and some use a separate model as a judge to score faithfulness at scale.

Governance implications

RAG does not remove governance work, it relocates it. Document what goes into the index and why, who can access which chunks, and how often the store is refreshed. Keep retrieval logs so you can reconstruct why a given answer was produced, which supports incident investigation and audit.

Under the EU AI Act and ISO 42001, the same expectations around testing, monitoring, and record keeping apply to RAG systems. The retrieval pipeline is part of the system, so its data sources, access rules, and evaluation results belong in your technical documentation.

FAQ

Does RAG stop hallucinations?

No. RAG lowers the rate of fabricated answers by grounding responses in retrieved text, but the model can still misinterpret a passage, blend it with its own assumptions, or answer confidently when retrieval returns nothing relevant. Measuring faithfulness is how you catch these cases. Treat RAG as a strong mitigation, not a guarantee.

What is the difference between RAG and fine-tuning?

Fine-tuning changes the model's weights so it learns a style or domain. RAG leaves the model unchanged and supplies fresh information at query time. RAG is easier to update, since you just change the documents, and it gives you provenance. Fine-tuning is better when you need the model to adopt a behavior or format consistently. Many teams use both.

Is the vector database a security risk?

It can be. The retrieval store holds your indexed content, so weak access controls can leak sensitive passages, and writable sources can be poisoned with hidden instructions. Apply the same access control, data classification, and input validation you would apply to any system holding production data.

What is indirect prompt injection in a RAG system?

It is when malicious instructions are hidden inside documents that later get retrieved and placed in the model's context. The model may treat that text as a command. Because the content arrives through retrieval rather than the user, it bypasses input filtering. Sanitizing and isolating retrieved content helps reduce this.

What should I log for a RAG system?

At minimum, the query, which chunks were retrieved, the source documents and versions, and the final answer. These logs let you reconstruct why an answer was produced, investigate incidents, and provide audit evidence. Be careful that logs themselves do not become an unprotected copy of sensitive data.

How do I evaluate retrieval quality?

Use contextual precision and recall against a labeled set of questions with known correct source passages. Precision tells you whether retrieved passages are relevant, recall tells you whether the right passages were found at all. Pair these with faithfulness checks on the generated answers so you separate retrieval failures from generation failures.

Summary

Retrieval-augmented generation grounds model answers in documents fetched at query time, which keeps responses current, reduces fabrication, and gives you provenance. The tradeoff is that the retrieval store becomes part of the trust boundary, carrying access control, data sensitivity, and poisoning risks. Govern RAG by documenting your sources, controlling access at the chunk level, logging retrievals, and evaluating faithfulness and contextual precision separately so you know whether a wrong answer came from retrieval or generation.

Retrieval-augmented generation (RAG)