What is Jailbreak (bypassing safeguards)?

AI Glossary

Jailbreak (bypassing safeguards)

bypassing safeguards, breaking a model's safeguards, jailbreak, jailbreaking

A jailbreak is a prompt crafted to get around a model's rules and safeguards and coax it into answers it would normally refuse. The attacker manipulates the instruction itself, not the data the model is processing.

A jailbreak is a prompt that circumvents a model's rules to extract a forbidden or unsafe answer.
It differs from prompt injection: here the attacker acts directly on the model's instruction rather than hiding instructions inside external data.
Common techniques include role-play, hypothetical scenarios and gradually softening the rules; the defenses are guardrails and red-team testing.

A jailbreak is the deliberate phrasing of an instruction so that the model ignores its own rules and safety guidelines and does what it would normally refuse to do. The attacker does not exploit a flaw in the code; instead they manipulate the very way the question is asked — for example, telling the model to play a fictional character with no limits, framing the situation as purely hypothetical, or step by step softening its rules until it starts producing forbidden answers.

The key distinction is from prompt injection. In a jailbreak the attacker controls the instruction sent to the model directly, and wants it to break its rules toward the attacker themselves. In prompt injection the malicious instruction is hidden inside data the model processes anyway — in the content of a web page, document or message — and it is that data that hijacks the model's behavior. Put simply: a jailbreak works through the user's instruction, while prompt injection works through poisoned input content. Both lead to broken rules, but by different routes, and they call for different defenses.

In an enterprise deployment, no single method eliminates the risk entirely. The usual approach combines guardrails that filter input and output, tight limits on the model's permissions, and systematic red-teaming — controlled attempts to bypass the safeguards that surface vulnerabilities before an outsider does.

Related terms