If your system faces users, it faces adversaries. This isn’t theoretical — it’s engineering.
The primary attack surface. Direct injection: user inputs instructions that override your system prompt. Indirect injection: retrieved documents contain hidden instructions. Defences: input sanitisation, instruction hierarchy (system > user), separate parsing from execution, and never trust user input as instructions.
Filter before the model sees it:
Filter before the user sees it:
| Framework | Approach |
|---|---|
| NeMo Guardrails (NVIDIA) | Define conversational rails in a config file |
| Guardrails AI | Validators for specific output properties |
| Anthropic constitutional AI | The model critiques its own output against principles |
LLMs inherit biases from training data. Test your system across demographic groups. Compare outputs for equivalent queries with different names, genders, or cultural contexts. This is testing, not philosophy.
The EU AI Act classifies AI systems by risk level. Know which tier your application falls into. Data privacy laws (GDPR, CCPA) apply to LLM inputs and outputs — especially if you’re storing conversations or fine-tuning on user data. End-user IDs in API calls help with audit trails and abuse detection.
OWASP Top 10 for LLMs · NeMo Guardrails · Guardrails AI · OpenAI Moderation · Anthropic safety · EU AI Act overview