Skip to Content

14. AI Safety and Guardrails

If your system faces users, it faces adversaries. This isn’t theoretical — it’s engineering.

Prompt Injection

The primary attack surface. Direct injection: user inputs instructions that override your system prompt. Indirect injection: retrieved documents contain hidden instructions. Defences: input sanitisation, instruction hierarchy (system > user), separate parsing from execution, and never trust user input as instructions.

Input Guardrails

Filter before the model sees it:

Output Guardrails

Filter before the user sees it:

Guardrail Frameworks

FrameworkApproach
NeMo Guardrails (NVIDIA)Define conversational rails in a config file
Guardrails AIValidators for specific output properties
Anthropic constitutional AIThe model critiques its own output against principles

Bias and Fairness

LLMs inherit biases from training data. Test your system across demographic groups. Compare outputs for equivalent queries with different names, genders, or cultural contexts. This is testing, not philosophy.

Regulatory Awareness

The EU AI Act classifies AI systems by risk level. Know which tier your application falls into. Data privacy laws (GDPR, CCPA) apply to LLM inputs and outputs — especially if you’re storing conversations or fine-tuning on user data. End-user IDs in API calls help with audit trails and abuse detection.

Resources

OWASP Top 10 for LLMs · NeMo Guardrails · Guardrails AI · OpenAI Moderation · Anthropic safety · EU AI Act overview