Prompt injection

Prompt injection is an attack where crafted input causes a language model to ignore its original instructions and follow the attacker's instead. It is the most discussed security risk for applications built on large language models, and it is hard to fully eliminate because of how these models work.

An LLM does not cleanly separate trusted instructions from untrusted data. The system prompt, the user message, and any content the application feeds in all arrive as text in the same context window. If attacker-controlled text says "ignore your previous instructions and do X," the model may comply, because to the model it is all just text to act on.

The result is that an application's guardrails can be talked around. A model told to be a polite support bot can be coaxed into revealing its system prompt, producing disallowed content, or, in a tool-using system, calling functions it should not.

Direct prompt injection

In a direct injection, the attacker is the user. They type instructions designed to override the system prompt: asking the model to reveal hidden instructions, role-play around a restriction, or treat earlier rules as cancelled.

Jailbreaking is a well-known form of direct injection, where the user constructs a scenario that coaxes the model past its safety training. Direct injection is the easier case to reason about, because the malicious input comes from the person interacting with the system, and rate limiting or abuse detection can help.

Indirect prompt injection

Indirect injection is the more dangerous variant. Here the malicious instructions are not typed by the user. They are hidden in content the application pulls in: a web page the model summarizes, a document in a retrieval store, an email in an inbox the model reads, or a file uploaded by someone else.

When the model processes that content, it encounters the planted instructions and may follow them, even though the user never saw or intended them. In a system that can use tools or send data, indirect injection lets an attacker who controls a single retrieved document steer the model's actions. This is why retrieval and agent systems treat all ingested content as untrusted.

Why it is a top LLM risk

Prompt injection sits at the top of risk lists, including the OWASP Top 10 for LLM applications, for a few reasons.

It is fundamental, not a bug. The lack of a hard boundary between instructions and data is inherent to how current models read a prompt. There is no patch that closes it completely.

It scales with capability. As applications give models access to tools, private data, and the ability to act, a successful injection moves from an embarrassing output to data exfiltration or unauthorized actions.

It is hard to test exhaustively. Attackers can phrase the same intent in countless ways, including in other languages or encoded forms, so a filter that blocks one phrasing rarely blocks all of them.

Mitigations

There is no single fix, so defenses are layered.

Separate and label trust levels. Keep system instructions, user input, and retrieved content distinct, and make clear to the model which is authoritative. Some frameworks use structured message roles to reinforce this.

Constrain what the model can do. The strongest mitigation is reducing impact. Limit tool permissions, require human approval for consequential actions, and never let the model's output directly trigger an irreversible operation without a check.

Validate inputs and outputs. Filter known injection patterns, and check the model's output and any tool calls before acting on them. Treat retrieved and user content as untrusted by default.

Isolate and sandbox. Run tool actions with least privilege so a hijacked model still cannot reach sensitive systems. Sanitize content before it enters the context window where practical.

Test adversarially. Red team the application with injection attempts, including indirect ones planted in documents and pages, and track which phrasings get through over time.

No combination is perfect, so the realistic goal is to make injection harder and to limit the damage when it succeeds.

Governance relevance

Prompt injection is where security and governance meet. Under the EU AI Act, high-risk systems must be resilient to attempts to manipulate them, which directly implicates injection. ISO 42001 and the NIST AI Risk Management Framework expect threats like this to be identified, tested, and mitigated as part of ongoing risk management.

For governance teams, the practical asks are clear. Document that prompt injection is in your threat model. Show evidence of adversarial testing. Record the controls that limit what a compromised model can do, and tie injection incidents into your AI incident response process. The point is not to claim immunity, which no one can, but to demonstrate that the risk is understood and contained.

FAQ

What is the difference between direct and indirect prompt injection?

In direct injection the user types instructions that override the system prompt. In indirect injection the malicious instructions are hidden in content the model ingests, such as a web page, document, or email, so the user never typed them and may not know they are there. Indirect injection is more dangerous because it can hijack tool-using and retrieval systems through a single planted source.

Can prompt injection be fully prevented?

No. Current models do not enforce a hard boundary between instructions and data, so injection cannot be closed completely. Defenses reduce how often it succeeds and limit the damage when it does, mainly by constraining the model's permissions and validating its actions. Treat it as a managed risk, not a solved one.

How is prompt injection different from jailbreaking?

Jailbreaking is a kind of direct injection aimed at getting the model past its safety restrictions, for example producing disallowed content. Prompt injection is the broader category, which also includes overriding application instructions and, in indirect form, planting commands in external content to hijack behavior or tool use.

Why is indirect prompt injection a problem for RAG and agents?

These systems feed external content into the model: retrieved documents in RAG, tool results and fetched pages in agents. If any of that content carries hidden instructions, the model may follow them and, in an agent, misuse its tools. Because the content arrives through retrieval or tools rather than the user, it bypasses input checks aimed at the user.

What is the single most effective defense?

Limiting the model's capabilities. If a compromised model cannot reach sensitive data, spend money, or take irreversible actions without human approval, a successful injection produces a bad output rather than real damage. Input filtering and trust labeling help, but constraining impact is what holds up when filtering fails.

How does prompt injection relate to compliance?

Frameworks like the EU AI Act require high-risk systems to resist manipulation, and standards like ISO 42001 and NIST AI RMF expect threats to be identified, tested, and mitigated. For an auditor, you want to show injection is in your threat model, that you red team for it, and that controls limit the impact of a successful attack.

Summary

Prompt injection is an attack that makes a language model follow attacker-supplied instructions instead of its own, exploiting the fact that models do not separate trusted instructions from untrusted data. Direct injection comes from the user, while indirect injection hides instructions in content the model ingests and is the more dangerous variant for retrieval and agent systems. It cannot be fully eliminated, so the response is layered: separate trust levels, constrain the model's permissions, validate inputs and actions, isolate tools, and test adversarially. For governance, the expectation is to show the risk is in your threat model, tested, and contained, not to claim immunity.

Prompt injection