5. RAG and Retrieval Tuning
Build the full retrieve → filter → generate pipeline. Iterate on chunk size, overlap, top-k, and query rewriting. Poor results almost always trace to chunking or embedding choice — not the model.
Document Preprocessing
Before you chunk, you need clean data. Document preprocessing is where most RAG pipelines quietly fail:
- PDFs: use
unstructured, PyMuPDF, or pdfplumber — not every PDF parser handles tables and headers
- HTML: strip navigation and boilerplate; keep semantic structure
- Tables: flatten to text or extract as structured data — embeddings don’t understand grid layouts
- Code: chunk by function/class, not by token count
Production Essentials
Three things that matter in production but rarely appear in tutorials:
- Reranking — retrieve more candidates than needed, then use a cross-encoder or LLM to reorder before passing to the model. Significantly improves precision.
- No-answer handling — when nothing relevant is retrieved, the model will fabricate. Detect low retrieval confidence and return “I don’t know” explicitly.
- Stale data — know how you’ll re-index when documents change. A static corpus is fine for learning; it breaks in prod.
Advanced Patterns
- Graph RAG — combine vector retrieval with knowledge graphs. Entities and relationships give the model structural context that flat chunks miss. See Microsoft’s GraphRAG.
- Agentic RAG — the agent decides when and what to retrieve instead of always-retrieve. Reduces noise and cost on queries that don’t need context.
- RAG vs. fine-tuning — RAG for facts that change, fine-tuning for behaviour that doesn’t. Most teams need RAG; fewer need fine-tuning. Know the tradeoff before reaching for either.
Frameworks
LangChain and LlamaIndex are the dominant RAG frameworks. Haystack and RAGFlow are alternatives. Learn to build RAG from scratch first, then evaluate whether a framework saves you time or hides problems.
Resources
Pinecone semantic search · LlamaIndex RAG guide · Anthropic contextual retrieval · Microsoft GraphRAG