AI Glossary
Prompt injection
prompt injection attack, prompt attack
Prompt injection is an attack in which hidden instructions in the input hijack a model's behavior and coax it into breaking its own rules. It is especially dangerous for agents that read content from external sources.
- The attacker's command is hidden in text the model processes (an email, a web page, a document).
- It can lead to data disclosure, rule bypass, or unwanted use of tools.
- Defenses include separating instructions from data, input/output guardrails, and limiting the agent's permissions.
Prompt injection exploits the fact that a model treats all the text it is given as context. An attacker places a command inside data the model is going to read anyway — for example in the body of a web page, a document, or a message. The model may then treat the hidden instruction as its own and ignore its original rules.
The risk grows in systems with external content retrieval and in agents that have access to tools, because the outcome can be a data leak or the execution of an unwanted operation. There is no single complete safeguard. You limit the impact instead: separating instructions from data, applying guardrails, and narrowing the permissions the model holds.
Related terms
In guides