The Art of Adversarial Prompting
Traditional offensive security has choreography. You scan, enumerate, find a vector, exploit. Clean. Predictable. There is a playbook — even if it is unwritten — and a decade of tooling behind it.
LLMs broke that playbook overnight.
When the target is a language model, the attack surface is not a port, an endpoint, or a CVE. It is language itself. And language is infinite.
Why Adversarial Prompting Is Different
In conventional red teaming you work against systems — memory corruptions, injection flaws, misconfigurations. These are bugs in code that can be categorised, patched, and scored with a CVSS vector.
With adversarial prompting you work against understanding. The model’s understanding of what it should and should not do. That is a moving target with no fixed attack surface.
The most dangerous vulnerability in an LLM is not a bug. It is the gap between what the model was told to do and what it actually does.
The Taxonomy In a Nutshell (So Far)
After years of hands-on testing I landed on a general taxonomy grouping of adversarial attack surfaces:
- Role Confusion – getting the model to adopt an identity that sidesteps its alignment.
- Context Window Issues – burying malicious intent inside layers of benign-looking text.
- Instruction-Hierarchy Attacks – exploiting the precedence between system prompts, user inputs, and injected instructions.
- Semantic Drift – slowly steering a conversation toward restricted territory, one token at a time.
Each of these deserves its own deep-dive, and each has a different defensive posture.
The Uncomfortable Truth
Most of the defences I have seen are reactive. They are built after someone finds a break. The attack surface grows faster than the patches ship.
This is offensive-security 101 — and it is being repeated, verbatim, in the AI layer.
The old-school lessons still hold. The game board just changed.
This is part one. The technical breakdown of each technique — with code — is coming in the next post.