Every AI Model Breaks! Red Teaming Isn't About Prevention — It's About Knowing How You'll Fail

The industry needs to stop treating red teaming as a security gate and start treating it as a continuous practice of failure cartography.

Last year, the UK AI Safety Institute & Gray Swan published something the industry had quietly been dreading: 1.8 million adversarial attacks across 22 frontier AI models, every result documented. Every single model was breached. Not most. Not the weaker ones. All of them.

A year on, the number still doesn’t surprise the researchers. It still surprises the executives only.

That gap between what practitioners know and what organizations believe — is the central crisis in AI security today. And it’s a crisis being sustained, in large part, by a dangerous misreading of what red teaming is actually for.

The Gate Metaphor Is Killing Us

Somewhere along the way, the industry adopted a metaphor for red teaming that fundamentally distorted the practice. Red teaming became a gate — something you pass through before launch, a box you check to earn the “responsibly deployed” label. Fail the red team exercise, fix the issues, run it again, pass, ship.

A big reason the gate metaphor persists is that we want LLMs to behave like traditional software components; deterministic, spec-following, and enforceable with the right controls. But the dominant failure mode in real deployments isn’t “the model forgot a rule.” It’s that the model is often placed in the role of a decision-making intermediary between untrusted inputs and privileged actions. However, they are probabilistic, high-dimensional systems that interact with an unbounded input space. The gate metaphor collapses under that reality.

This framing is intuitive. It mirrors how we think about penetration testing in traditional software security, code reviews before merge, or clinical trials before drug approval. It gives executives a clean story: we tested it, it’s safe, ship it.

The uncomfortable truth: “secure” is the wrong binary

Red teaming as a “ship/no-ship” gate assumes there is a stable end-state: you fix what you find, and you’re done.

But LLM systems don’t sit still:

  • Models change (provider updates, fine-tunes, system prompt edits).
  • Tools change (new integrations, permissions, APIs, data sources).
  • Attackers change (new prompt patterns, indirect injection tricks, multi-step social engineering).
  • Use changes (new workflows that turn “harmless text” into “privileged action”).

Even the UK AISI’s broader testing posture reflects this: it reports that safeguards are improving, but also that they’ve found vulnerabilities in every system tested, with large differences in the effort needed to jailbreak different models. That’s not a one-time certification story; it’s a moving distribution.

So instead of asking, “Did we pass red teaming?” the operationally honest question is:

What do our failures look like today and what’s our plan when they happen?

That’s “continuous risk characterization.”

What “knowing how you’ll fail” looks like in practice

If you treat red teaming as risk characterization, you stop chasing the illusion of “no breaks” and start building a map of failure modes with metrics you can manage.

Here’s a practical way to think about it.

1) A failure taxonomy specific to your system”

Generic red team templates are a starting point, not an endpoint. The organization needs to develop and maintain a living document of known failure modes, ranked by exploitability and consequence. This is your threat model. It should be updated after every red team cycle, every significant model update, and every major deployment context change. So, track it per release, per model config, per toolchain, per policy set.

This turns red teaming from theater into an engineering KPI.

2) Measure blast radius, not just “attack success”

A jailbreak that produces an off-policy paragraph is bad. A jailbreak that causes:

  • unauthorized data access,
  • irreversible financial actions,
  • exfiltration via tool calls,
  • or policy-bypassing workflow execution

…is categorically different.

The competition analysis describes policy violations that include unauthorized data access and illicit financial actions among the behaviors targeted in realistic deployments.

So score outcomes on impact tiers, and align mitigations to the highest tier, not the easiest-to-detect.

3) Measure transferability, because attackers reuse

The paper highlights strong generalization/transferability of attacks across diverse agents and policies. If an exploit pattern transfers, it’s not “a bug”; it’s a class.

That has two implications:

  • You need a library of adversarial patterns that you continuously replay.
  • You should expect “patch one scenario” fixes to leak elsewhere unless you address root causes (permissions, tool isolation, workflow design).

4) Treat the model as untrusted in privileged pathways

This is the hardest cultural shift for teams that grew up on traditional appsec, but it’s the one that matters most.

Put differently: don’t ask the model to be your security boundary.

Microsoft’s guidance on AI red teaming stresses probing end-to-end systems, because risks emerge from interactions among models, user inputs, and external systems—not just the base model. That’s a polite way of saying: the moment your agent can click buttons, move money, or query internal systems, you must design like the agent will eventually be compromised.

The Accountability Gap

The gate model persists because it produces a document, and our accountability systems reward documents. A red team report fits in a board deck, satisfies a regulator, and holds up in litigation. Continuous risk characterization doesn’t compress that cleanly, so it loses.

The failure runs through every layer. Regulators inherited auditing frameworks built for deterministic software; most binding AI governance, including the EU AI Act, still skews toward pre-market evaluation and leaves “assessed at launch” versus “monitored in production” largely unaddressed. Procurement teams accept red team reports without asking who owns adversarial testing post-deployment.

Boards treat AI risk as a reputational matter and ask the wrong question, “have we been red teamed?”, instead of the right one: “what’s our worst failure class, and how fast would we detect it in production?” Answering that requires infrastructure. Building it is the actual work.

This will change when a high-profile incident is traced not to missing pre-deployment testing, but to the absence of monitoring that would have caught a known failure before it caused harm. The industry shouldn’t wait for that moment to define the standard.

The Harder Conversation

There is a version of this argument that makes people uncomfortable, so let’s state it plainly.

The 100% breach rate finding does not mean AI models should not be deployed. It means they should be deployed with accurate risk intelligence, appropriate safeguards, and ongoing monitoring, not with the comfortable fiction that red teaming has rendered them safe.

The maturation of AI security will look less like “we found and fixed all the vulnerabilities” and more like “we understand our risk surface well enough to make informed deployment decisions and detect anomalies when they occur.” This is how we think about security in every other high-stakes domain — aviation, nuclear power, financial systems. We do not expect zero incidents. We build systems that fail gracefully, detect failures quickly, and learn from them rigorously.

Red teaming, in this framing, is not the last line of defense before deployment. It is the discipline of knowing your system well enough to deploy it honestly, and to keep knowing it as it changes.

Every model breaks. The organizations that understand this earliest will be the ones best positioned to deploy AI systems that are genuinely trustworthy, not just certified.