Skip to Content

Capstone: Research Oracle

Build a multi-turn agent that answers questions about a document corpus, pulls in live data when needed, cites every claim, and evaluates its own outputs. Build it incrementally — by Module 10 all layers should be live. Extend with Modules 11–18 for production hardening.

Constraints

These are the point:

System

1. Query Rewriting

Restate the question to improve retrieval recall before searching.

2. Retrieval

Embed → search corpus → rerank → return top-k chunks. Index at least 20–30 documents. Return “I don’t know” explicitly if retrieval confidence is below threshold — don’t pass weak context to the model.

3. Agent Loop

Max 5 steps. Available tools:

ToolSignature
web_searchweb_search(query)
calculatorcalculator(expression)
summarise_docsummarise_doc(url)

Truncate tool output above 500 tokens before injecting. Cite which tool produced which context.

4. Memory

Retrieve the 3 most relevant prior exchanges for the session. Inject above retrieved chunks, below the system prompt. Persist the new exchange after responding.

5. Critic

A second model call before returning: does every citation exist in the retrieved context? Does the confidence level match the evidence? Revise or flag low-confidence answers — don’t return unverified claims.

6. Structured Output

Validate every response against:

{
  "answer": "string",
  "citations": ["string"],
  "confidence": "number",
  "follow_up_questions": ["string"]
}

Retry once on schema failure.

7. Response to User

Stream the response (Module 15). Show tool activity during the agent loop.


Evaluation

Build a golden set of 15 questions across three tiers:

TierDescriptionCount
Tier 1Answerable from the corpus, no tools needed5
Tier 2Require a tool call5
Tier 3Multi-hop: retrieval + tool + memory from a prior turn5

Score each iteration: manual pass/fail on accuracy, LLM-as-judge on citation quality (0–2), schema first-pass validity. Track in a CSV — a regression in Tier 1 while improving Tier 3 is signal.

Minimum bar: Tier 1 >= 4/5 · Tier 2 >= 3/5 · Tier 3 >= 2/5 · Schema validity >= 85%


Stretch Goals