What is AI red teaming?

AI Glossary

AI red teaming

red teaming, adversarial AI testing, offensive AI testing

AI red teaming is deliberately adversarial testing of a system, meant to find its weak points, safeguard bypasses and harmful outputs before it reaches users.

Involves deliberately attacking the system to surface vulnerabilities before deployment.
Looks for safeguard bypasses, susceptibility to prompt injection and harmful responses.
Complements ordinary quality evaluation, because it probes behavior under pressure rather than typical scenarios.

AI red teaming is a testing method in which a team deliberately acts adversarially toward a system in order to provoke undesirable behavior. Instead of checking whether the model performs well on typical tasks, red teaming probes the edges: attempts to bypass its rules, susceptibility to prompt injection, data leakage, and the generation of harmful content. The name comes from security practice, where the "red team" plays the role of the attacker.

The difference from standard model quality evaluation is significant: evaluation measures effectiveness on planned cases, while red teaming checks how the system behaves under pressure and against a user acting in bad faith. One answers the question "does it work well," the other "how can it be broken."

In a company deployment, red teaming precedes making the system available and is repeated after major changes. Its findings feed directly into the design of guardrails — every gap found points to where additional protection is needed. It is often combined with automated tests and human work, because some vulnerabilities only surface under a creative, non-obvious attack.

Related terms