Build a golden set of 10–20 queries. Score an early RAG baseline with manual review and an LLM-as-judge. Re-score the full stack and track changes — regressions should be visible before you ship, not after.
LLM-as-judge is fast but has known failure modes: it favors longer answers, and the same model scoring its own output has self-serving bias. Cross-check scores against your manual judgements before trusting it.
Add “no surprise engineering” tests so you can iterate without burning credits:
pytest basics (assertions, fixtures) or your TS equivalentThen wire everything into one path:
query rewriting → retrieval → agent loop → tools → memory → validated structured output → response
OpenAI Evals · RAGAS · LangSmith · Anthropic eval guide · Braintrust