Build a multi-turn agent that answers questions about a document corpus, pulls in live data when needed, cites every claim, and evaluates its own outputs. Build it incrementally — by Module 10 all layers should be live. Extend with Modules 11–18 for production hardening.
These are the point:
web_search returns empty results 20% of the time — your system must not hallucinate in those casesRestate the question to improve retrieval recall before searching.
Embed → search corpus → rerank → return top-k chunks. Index at least 20–30 documents. Return “I don’t know” explicitly if retrieval confidence is below threshold — don’t pass weak context to the model.
Max 5 steps. Available tools:
| Tool | Signature |
|---|---|
web_search | web_search(query) |
calculator | calculator(expression) |
summarise_doc | summarise_doc(url) |
Truncate tool output above 500 tokens before injecting. Cite which tool produced which context.
Retrieve the 3 most relevant prior exchanges for the session. Inject above retrieved chunks, below the system prompt. Persist the new exchange after responding.
A second model call before returning: does every citation exist in the retrieved context? Does the confidence level match the evidence? Revise or flag low-confidence answers — don’t return unverified claims.
Validate every response against:
{
"answer": "string",
"citations": ["string"],
"confidence": "number",
"follow_up_questions": ["string"]
}
Retry once on schema failure.
Stream the response (Module 15). Show tool activity during the agent loop.
Build a golden set of 15 questions across three tiers:
| Tier | Description | Count |
|---|---|---|
| Tier 1 | Answerable from the corpus, no tools needed | 5 |
| Tier 2 | Require a tool call | 5 |
| Tier 3 | Multi-hop: retrieval + tool + memory from a prior turn | 5 |
Score each iteration: manual pass/fail on accuracy, LLM-as-judge on citation quality (0–2), schema first-pass validity. Track in a CSV — a regression in Tier 1 while improving Tier 3 is signal.
Minimum bar: Tier 1 >= 4/5 · Tier 2 >= 3/5 · Tier 3 >= 2/5 · Schema validity >= 85%