Why isn't writing 'DO NOT SEND emails' in capitals in the instructions a safeguard?

Because it's still just a request — the agent treats it like any other line of text, and if it has inbox access it'll use it one day

Because the agent ignores text written in capital letters

Because the agent doesn't read instructions at all, only keys

Because email sending can't be blocked by any method

What does the assumption 'if an agent can do something, sooner or later it will' mean?

That every capability you grant gets used eventually — more often through a misunderstanding than deliberately — so you design for the misread scenario

That agents have ill intent and cause harm on purpose

That the agent gradually learns to get around prohibitions

That you should give the agent as many keys as possible to be effective

What is the principle of least privilege in the context of an agent?

You give the agent the narrowest access that's enough for its task — and not one key more

You give the agent full access but ask it to use it sparingly

You give all the keys first and revoke the unneeded ones only after the first error

You keep all permissions on your side and never connect the agent to tools at all

What is a 'narrowly scoped key', using the meeting-transcripts example?

A key that can do one thing — e.g. only read the transcripts, with no right to edit, delete, or view settings

A key that grants full access to the tool, but only during working hours

A password the agent must confirm with a human every time

An instruction in which you forbid the agent from editing the transcripts

How, according to the text, should you treat an agent's autonomy?

Autonomy is earned — you hand over 'live' keys only once the agent's routine is battle-tested, and you keep visibility over it

Best to let it off the leash right away, especially while you sleep, to save time

Once autonomy is granted, responsibility passes entirely to the agent

Autonomy is safe because it lowers cost, risk and maintenance all at once

Keys, not prompts — what really limits an AI agent · Wiki

Imagine writing in your AI agent's instructions, in capital letters: "DO NOT SEND emails to clients." It sounds like a safeguard. It isn't. It's a request — and the agent treats requests the same way it treats any other line of text, in good faith and with no guarantee. If it has access to the inbox, it'll use it one day. I'll show you why the real boundary isn't words but the keys the agent physically holds — and how to hand out those keys so you can sleep at night.

A prompt is not a safeguard

Let me start with the terms, so nobody gets lost. An AI agent is a program that doesn't just answer questions but carries out tasks for you — reads documents, sends messages, changes entries in systems. A prompt is the instruction you give it in words: "do this, don't do that." A key (often called a credential or an API key) is something far more concrete — it's the access that lets the agent actually get into a given tool and do something there. A key is something the agent either has or doesn't.

And here's the crux many people miss: a prompt is never a permissions layer. When you write "don't delete records" in the instructions, you're resting your whole security on the model understanding you correctly and obeying every single time. That isn't a lock — it's a hope. And hope isn't a strategy when the thing on the other side runs fast, at scale, and around the clock.

To my mind, this is the most common mistake I see in people starting out with automation: they confuse a statement of intent with a real constraint. A statement of intent tells the agent what you want. A real constraint decides what the agent can do. Only the second one protects you.

"If it can, it will"

There's one assumption worth adopting once and for all and never abandoning: if an agent can do something, sooner or later it will. Not because it's malicious — agents have no ill intent. It's simply that every capability you give it will be used at some point: sometimes on purpose, more often through a misunderstanding.

If an agent can send an email, one day it'll send one you didn't plan for. If it can delete a record in a database — assume it'll delete it. That's not pessimism, just sober arithmetic. You're designing not for the scenario where everything goes right, but for the one where something gets misread.

And it really does happen. The classic scenario goes like this: an agent works through a task list, picking up and carrying out each item in turn. On one of them it misinterprets the instruction — it thinks it's supposed to blast out a discount code, when that was never the point. And it does it. At the scale it was wired up for: to hundreds of thousands of recipients on the mailing list, before anyone can react. One misread row on a list, one action executed instantly and en masse.

A bunch of slender keys in a graphite space, most of them dimmed, with only one sharp key glowing green.

Notice where the problem lies here. The agent didn't "hack" anything. It did exactly what it was able to do, because it had the key for it. Had it not had access to sending — no misreading of the task would have ended in disaster. At worst it would have stopped and asked.

Keys, not prompts

Hence the principle I regard as the foundation of all agent security: you give an agent only the keys to the rooms it genuinely needs to enter.

Picture a building with many rooms. One holds email sending, another the client database, a third the team settings. The agent gets a bunch of keys — and it can open exactly as many doors as the keys you handed it. If it has no key to the "send email" room, then it doesn't matter how badly it misreads a task: it simply won't get into that room. The boundary no longer depends on whether the model understood the instruction correctly. It depends on what the agent can physically open.

In the technology world this principle has a name: least privilege. In plain terms it goes like this: you give every tool — and every agent — the narrowest possible scope of access that's enough for its task. Not one key more. Not because you don't trust the agent, but because you don't want to rely on trust where a plain lock will do.

The same applies to people on a team, so the analogy is intuitive. You don't give a new colleague access to everything from day one "just in case." You give them exactly what they need for their role. You treat an agent the same way — the difference being that an agent acts faster and at greater scale, so a surplus key hurts more.

Narrowly scoped keys

Keys aren't all-or-nothing. You don't have to choose between "full access" and "no access" — you can issue a key that opens only one drawer in a room and doesn't even touch the rest. These are scoped credentials.

Take a concrete example. Say an agent is meant to work with the records of your meetings — the transcripts. Instead of giving it full access to the tool where those transcripts live, you issue a key that can do just one thing: read the transcripts. It can't edit them. It can't delete them. It has no view into the team settings or anything beyond what you assigned it. One key, one capability: read transcripts, and that's all.

This is exactly how you build a real permissions layer — not with a sentence in the instructions, but with the scope of the key. The rule is simple, and worth repeating at every new connection: match the scope of each key to the narrowest capability that's actually needed. Before you connect the agent to the next tool, ask yourself one question: what does it genuinely need to be able to do here, and what would merely be "convenient"? Convenience usually means excess permissions — and excess permissions are exactly the keys you don't want to give it.

Autonomy is earned, not granted by default

There's one more temptation worth resisting: the urge to let the agent off the leash right away, to have it run on its own, ideally even while you sleep. It's tempting — and that's exactly why it's dangerous. An agent's autonomy is earned, not granted by default.

The more you automate and the more freedom you hand the agent, the higher three things climb at once: cost, risk, and maintenance. So don't give an agent "live" keys — the ones that really do something in real systems — until its routine is battle-tested. First you watch how it works under your eye, on safe data. Only once it does the right thing time after time do you hand it a slice of autonomy.

And here's the thing that's easy to forget: automation doesn't end your responsibility. Greater agent autonomy doesn't mean "ticked off, I can forget about it." Someone is accountable for every automation — someone owns it, checks in on it, and knows whether it still does what it was meant to do. You hand the agent the action, but you don't hand it visibility over what's happening. That stays on your side.

Finally — the most important shift in how you think. When the agent makes a mistake (and it will), don't paper over the symptom and pretend nothing happened. Slip-ups are data. Every error tells you exactly which key was too broad or which instruction too vague. So you narrow the capability or sharpen the instruction so the same mistake can't recur — and you treat it as a lesson, not a defeat. The system grows stronger for it, rather than more brittle.

Because trust in an agent doesn't come from how nicely you asked it. It comes from what that agent physically can't do. The next time you connect an agent to another tool, don't start with the instructions. Start with the question of the key: what's the narrowest access that's enough here — and can you give it exactly that much, not one key more.

A prompt is not a safeguard

"If it can, it will"

Keys, not prompts

Narrowly scoped keys

Autonomy is earned, not granted by default

Test yourself

Related in the knowledge base