The Art of Adversarial Prompting

Traditional offensive security has choreography. You scan, enumerate, find a vector, exploit. Clean. Predictable. There is a playbook — even if it is unwritten — and a decade of tooling behind it.

LLMs broke that playbook overnight.

When the target is a language model, the attack surface is not a port, an endpoint, or a CVE. It is language itself. And language is infinite.

Why Adversarial Prompting Is Different

In conventional red teaming you work against systems — memory corruptions, injection flaws, misconfigurations. These are bugs in code that can be categorised, patched, and scored with a CVSS vector.

With adversarial prompting you work against understanding. The model’s understanding of what it should and should not do. That is a moving target with no fixed attack surface.

The most dangerous vulnerability in an LLM is not a bug. It is the gap between what the model was told to do and what it actually does.

The Taxonomy In a Nutshell (So Far)

After years of hands-on testing I landed on a general taxonomy grouping of adversarial attack surfaces:

Role Confusion – getting the model to adopt an identity that sidesteps its alignment.
Context Window Issues – burying malicious intent inside layers of benign-looking text.
Instruction-Hierarchy Attacks – exploiting the precedence between system prompts, user inputs, and injected instructions.
Semantic Drift – slowly steering a conversation toward restricted territory, one token at a time.

Each of these deserves its own deep-dive, and each has a different defensive posture.

The Uncomfortable Truth

Most of the defences I have seen are reactive. They are built after someone finds a break. The attack surface grows faster than the patches ship.

This is offensive-security 101 — and it is being repeated, verbatim, in the AI layer.

The old-school lessons still hold. The game board just changed.

This is part one. The technical breakdown of each technique — with code — is coming in the next post.