You can’t fix what you can’t see. Production LLM systems need the same observability as any backend service — plus LLM-specific signals.
Version prompts in code (same repo, same PR). Tag each LLM call with the prompt version. When quality drops, you need to know which prompt change caused it. Some teams use prompt registries — only adopt one if git isn’t enough.
Route a percentage of traffic to a new prompt or model. Compare quality scores, latency, and cost. Don’t A/B test without evals — you’ll have data but no signal.
| Tool | Strengths |
|---|---|
| LangSmith | Tracing, evals, prompt playground |
| Langfuse | Open-source alternative |
| Arize Phoenix | Open-source, strong on retrieval analysis |
| Portkey | Gateway with built-in observability |
| Weights & Biases | Experiment tracking, extends to LLM eval |
LangSmith docs · Langfuse docs · Arize Phoenix · Portkey docs