Guide
Decisions & comparisons
AI system security: prompt injection, data leaks, and how to defend
An AI system has its own attack surface: prompt injection, data leaks through context, tool abuse. The defense is layers, not a single filter.
- AI adds new attack vectors; it doesn't replace classic safeguards.
- The most dangerous: prompt injection and data leaks from the context.
- Layered defense: input filters, restricted permissions, a human at the point of risk.
AI has its own attack surface
A system built on a language model inherits all the classic application threats — and adds its own. The new part is that the model executes instructions written in plain text, and that text arrives from many places: from the user, from documents, from pages the model reads. The line between "data" and "command" blurs, and that's the source of most of the new risk.
AI security doesn't replace the safeguards you already have. It adds a layer you have to think about separately.
The three most important threats
Prompt injection. Prompt injection is the injection of commands that hijack the model's behavior. The attack can be direct (a user types "ignore previous instructions") or indirect — hidden in a document or on a page the model processes. The second kind is more dangerous, because the victim doesn't have to be malicious at all.
Data leaks through context. The model sees everything that lands in its context. If you put sensitive data there, or the contents of the system prompt, a suitably coaxed model can repeat it back to the user. This also covers one customer's data leaking into another's session.
Tool abuse. When a model has access to tools — a database, an API, email dispatch — one hijacked by injection can use them: delete a record, send a message, pull data. The broader the permissions, the greater the damage from a single successful attack.
Threat and defense
| Threat | What it is | Defense |
|---|---|---|
| Prompt injection | Injected commands hijack the model | Input filters, separating data from instructions, guardrails |
| Data leak | The model repeats data from the context | Session isolation, context minimization, privacy control |
| Tool abuse | An attacker triggers actions through the model | Least-needed permissions, human confirmation |
| Hallucination as fact | The model invents and acts on the invention | A source requirement, checking outputs before any action |
Operator's rule: treat every piece of text the model reads as potentially hostile — including your own documents. Security starts from assuming the injection will succeed.
Defense is layered, not singular
There's no single filter that settles the matter. Guardrails on input and output catch some attacks, but they can be bypassed, so they're the first layer, not the last. Beneath them run harder mechanisms: restricting tool permissions to the absolute minimum, isolating data between sessions and customers, logging every tool call, and a human approving high-risk decisions.
That last layer is the crucial one. Automation can prepare an action — draft an email, propose a database change — but on real risk the final "send" stays with a person. This isn't a lack of trust in the system; it's a deliberate design in which one successful attack doesn't translate straight into damage.
What this means for the rollout decision
Security isn't a stage at the end of the project. It shapes the architecture from the start: what data even reaches the context, what permissions the tools get, where the human plugs in. A system that's meant to do anything serious is designed with these layers from the outset — bolting them on later is more expensive and less effective.
Terms in this guide
Related articles
- Keys, not prompts — what really limits an AI agent
- A voice agent on your site — build it by talking, not clicking
Frequently asked questions
- What is prompt injection, in short?
- It's the injection of commands that hijack the model's behavior — for example a hidden instruction in a document or on a page the model reads, telling it to ignore the rules or disclose data.
- Are guardrails enough to make a system secure?
- No. Filters catch some attacks, but they can be bypassed. Security is layered: restricted tool permissions, data isolation, logging, and a human on high-risk decisions.
- Does self-hosting the model solve the data-leak problem?
- It reduces one risk — the data doesn't go to a provider — but it doesn't remove the others. Prompt injection and tool abuse work the same way on a model in your own server room.