{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5d87aa4e",
   "metadata": {},
   "source": [
    "# Chapter 6 — Information Retrieval and Knowledge Agents\n",
    "\n",
    "**Book:** *Agents* by Imran Ahmad (Packt, 2026)\n",
    "**Author:** Imran Ahmad\n",
    "**Chapter pages:** 145–171\n",
    "\n",
    "> *\"The difference between the right word and the almost right word is the difference between lightning and a lightning bug.\"* — Mark Twain\n",
    "\n",
    "---\n",
    "\n",
    "## Introduction\n",
    "\n",
    "This notebook is the companion code for **Chapter 6** (pp. 145–171) of *Agents* by Imran Ahmad (Packt, 2026). In a world where information is doubling at a staggering pace, the ability to locate reliable, current, and contextually relevant knowledge is a decisive factor for innovation and decision-making. LLMs have revolutionized natural language understanding, yet their power is limited by a **static training snapshot** and the risk of producing unverifiable or outdated answers.\n",
    "\n",
    "**Knowledge agents** close this gap by blending intelligent reasoning with live, authoritative data sources — transforming LLMs from static archives into **dynamic, evidence-grounded collaborators**. They do not merely respond to questions; they search, verify, and ground their outputs in trusted data.\n",
    "\n",
    "### Three Agent Categories\n",
    "\n",
    "| Section | Agent Type | Chapter Reference | Book Pages | Key Concepts |\n",
    "|---------|-----------|-------------------|:----------:|--------------|\n",
    "| 1. Knowledge Retrieval Agent | §6.1 | Knowledge Retrieval Agents | 146–153 | RAG pipeline, FAISS, provenance tracking |\n",
    "| 2. Chunking Strategies Deep Dive | §6.1 | Chunking Strategies | 151 | Fixed, recursive, semantic chunking |\n",
    "| 3. Document Intelligence Agent | §6.2 | Document Intelligence Agents | 153–160 | OCR, confidence scoring, schema extraction |\n",
    "| 4. Scientific Research Agent | §6.3 | Scientific Research Agents | 161–168 | Literature synthesis, clustering, evidence tables |\n",
    "| 5. Knowledge Agent Spectrum | §Summary | The Knowledge Agent Spectrum | 168–170 | Capability comparison (Table 6.1) |\n",
    "\n",
    "### Key Figures\n",
    "- **Figure 6.1** (p. 148) — Modular architecture of a Knowledge Retrieval agent\n",
    "- **Figure 6.2** (p. 159) — Document intelligence pipeline (five-stage)\n",
    "- **Table 6.1** (p. 169) — Comparison of knowledge agent types\n",
    "\n",
    "### The Knowledge Pipeline\n",
    "\n",
    "Together, these agents form a **complete knowledge pipeline**: finding relevant information (Retrieval), extracting and structuring it (Document Intelligence), and synthesizing insights for decision-making (Scientific Research). Each type represents increasing sophistication in the **Agentic AI Progression Framework**, from Level 2 (Tool-Using) through Level 4 (Learning agent).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12a74ff1",
   "metadata": {},
   "source": [
    "## 0. Setup & Configuration\n",
    "\n",
    "This section initializes the environment, detects API keys, and sets the execution mode.\n",
    "\n",
    "**Execution Modes:**\n",
    "- **Simulation Mode** (default): No API key required. All outputs use chapter-derived mocks from `agent_utils.py`. Pedagogically equivalent to live output.\n",
    "- **Live Mode**: Requires a valid `OPENAI_API_KEY` in `.env`. Makes real API calls to OpenAI, live arXiv queries, and real OCR processing.\n",
    "\n",
    "> **Ref:** See `AGENTS.md` for the full capability declaration and persona prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8de89cb8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 0.1 Dependency Check ─────────────────────────────────────────\n",
    "# Ref: requirements.txt\n",
    "# Verify core packages are available before proceeding.\n",
    "\n",
    "import importlib\n",
    "import sys\n",
    "\n",
    "REQUIRED = [\n",
    "    \"dotenv\", \"numpy\", \"pandas\", \"langchain\", \"langchain_openai\",\n",
    "    \"langchain_community\", \"langchain_text_splitters\", \"faiss\",\n",
    "    \"PIL\", \"rapidfuzz\", \"sklearn\",\n",
    "]\n",
    "\n",
    "missing = []\n",
    "for pkg in REQUIRED:\n",
    "    try:\n",
    "        importlib.import_module(pkg)\n",
    "    except ImportError:\n",
    "        missing.append(pkg)\n",
    "\n",
    "if missing:\n",
    "    print(f\"⚠️  Missing packages: {missing}\")\n",
    "    print(\"   Run: pip install -r requirements.txt\")\n",
    "else:\n",
    "    print(\"✅ All core dependencies available.\")\n",
    "\n",
    "print(f\"   Python {sys.version}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2ec2f99",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multi-provider LLM support (OpenAI / Anthropic / Google Gemini)\n",
    "# Set LLM_PROVIDER in .env to choose: openai | anthropic | google | auto\n",
    "# Auto-detection uses the first available key.\n",
    "# See supporting/llm_provider.py for details.\n",
    "\n",
    "import sys, os\n",
    "sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath('.')), ''))\n",
    "sys.path.insert(0, '..')\n",
    "\n",
    "try:\n",
    "    from supporting.llm_provider import detect_provider, get_llm, PROVIDER_MODELS, print_provider_banner\n",
    "    _PROVIDER, _PROVIDER_KEY, _PROVIDER_MODE = detect_provider()\n",
    "    print_provider_banner(_PROVIDER, _PROVIDER_MODE)\n",
    "except ImportError:\n",
    "    print('[INFO] supporting/llm_provider.py not found — using default OpenAI path')\n",
    "    _PROVIDER, _PROVIDER_KEY, _PROVIDER_MODE = 'openai', os.getenv('OPENAI_API_KEY'), 'LIVE' if os.getenv('OPENAI_API_KEY') else 'SIMULATION'\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "843e944b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 0.2 Import Shared Utilities ──────────────────────────────────\n",
    "# Ref: agent_utils.py — ColorLogger, fail_gracefully, MockLLM, etc.\n",
    "\n",
    "import os\n",
    "import sys\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
    "warnings.filterwarnings(\"ignore\", category=UserWarning)\n",
    "\n",
    "from agent_utils import (\n",
    "    ColorLogger, log, fail_gracefully, get_api_key,\n",
    "    MockLLM, MockRetrievalQAResult, MockEmbeddings,\n",
    "    MockOcrToken, MOCK_INVOICE_TOKENS, MOCK_EXTRACTED_FIELDS,\n",
    "    mock_pytesseract_output, MOCK_ARXIV_PAPERS, mock_search_arxiv,\n",
    ")\n",
    "\n",
    "log.success(\"agent_utils loaded — all shared utilities available.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a87dbe50",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 0.3 API Key Detection & Mode Selection ───────────────────────\n",
    "# Ref: Zero-Hardcode Policy\n",
    "# Cascade: .env → os.getenv → getpass → SIMULATION MODE\n",
    "\n",
    "api_key = get_api_key(\"OPENAI_API_KEY\")\n",
    "SIMULATION_MODE = api_key is None\n",
    "\n",
    "# ── Mode Banner ──────────────────────────────────────────────────\n",
    "if SIMULATION_MODE:\n",
    "    print()\n",
    "    print(\"=\" * 65)\n",
    "    print(\"  🔬  SIMULATION MODE ACTIVE\")\n",
    "    print(\"  All outputs are chapter-derived mocks from agent_utils.py.\")\n",
    "    print(\"  To enable Live Mode, add OPENAI_API_KEY to .env\")\n",
    "    print(\"=\" * 65)\n",
    "else:\n",
    "    print()\n",
    "    print(\"=\" * 65)\n",
    "    try:\n",
    "        sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath('.')), '..'))\n",
    "        from supporting.llm_provider import detect_provider, print_provider_banner\n",
    "        _prov = os.environ.get(\"LLM_PROVIDER\", \"auto\")\n",
    "        if _prov == \"auto\":\n",
    "            _prov, _, _ = detect_provider()\n",
    "        print_provider_banner(_prov, \"LIVE\")\n",
    "    except ImportError:\n",
    "        print(\"  🌐  LIVE MODE ACTIVE\")\n",
    "        print(\"  Real API calls enabled.\")\n",
    "    print(\"=\" * 65)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1fdbb9ed",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 1. Knowledge Retrieval Agent (pp. 146–153)\n",
    "\n",
    "**Ref:** §6.1 — Knowledge Retrieval Agents (pp. 146–153)\n",
    "\n",
    "A Knowledge Retrieval agent is the lifeline connecting an LLM's static training data to the living, ever-changing world of information. By linking to live sources such as databases and APIs, these agents directly address two critical LLM weaknesses: **knowledge cutoff** and **hallucination risk**, anchoring outputs in verifiable evidence.\n",
    "\n",
    "### Modular Architecture (Figure 6.1, p. 148)\n",
    "\n",
    "The agent operates through four sequential stages with parallel provenance tracking:\n",
    "\n",
    "1. **Query Understanding** — Parse intent, disambiguate, reformulate into a precise search query\n",
    "2. **Retrieval** — Execute the search plan against vector databases or search APIs (lexical, semantic, or hybrid)\n",
    "3. **Preprocessing** — Chunk documents, generate embeddings, filter irrelevant results\n",
    "4. **Synthesis** — Integrate retrieved content into the LLM prompt; generate a grounded answer with provenance\n",
    "\n",
    "> **Architecture:** See Figure 6.1 (p. 148) for the full modular architecture diagram showing the parallel **Provenance** component that collects citations, metadata, and confidence metrics throughout the pipeline.\n",
    "\n",
    "### Implementation Patterns (pp. 148–149)\n",
    "\n",
    "Three retrieval workflow patterns exist, each suited to different query complexity:\n",
    "- **Single-stage retrieval** — Direct query to one source. Low latency, limited recall.\n",
    "- **Multi-stage retrieval** — Broad search refined through targeted filters. Higher latency, better for exploratory queries.\n",
    "- **Hybrid retrieval** — Combines keyword (lexical/BM25) and vector similarity (semantic) search. Best recall for mixed-content corpora.\n",
    "\n",
    "> **Note — Agent Capability Level** (p. 146): A Knowledge Retrieval agent typically operates at **Level 2** (Tool-Using agent), parsing requests and chaining tool operations. More advanced agents that decompose high-level goals and maintain memory across steps can exhibit **Level 3** (Planning agent) behaviors.\n",
    "\n",
    "In this section, we implement an end-to-end RAG pipeline using LangChain, OpenAI embeddings (or mock equivalents), and FAISS as the vector store.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "908eb8f8",
   "metadata": {},
   "source": [
    "### Figure 6.1 — Modular Architecture of a Knowledge Retrieval Agent (p. 148)\n",
    "\n",
    "```\n",
    "                    ┌────────────────────────────────────┐\n",
    "  User Query ──────▶│  Query Understanding Layer         │\n",
    "                    │  Intent parsing, disambiguation,   │\n",
    "                    │  query reformulation               │\n",
    "                    └────────────────┬───────────────────┘\n",
    "                                     ▼\n",
    " ┌─────────────┐   ┌────────────────────────────────────┐   ┌──────────────────┐\n",
    " │ Vector DB   │──▶│  Retriever Module                  │   │  Provenance       │\n",
    " │ Search API  │──▶│  Lexical, semantic, or hybrid      │   │  ┌──────────────┐│\n",
    " │ Relational  │──▶│  retrieval from multiple sources   │   │  │ Citations    ││\n",
    " └─────────────┘   └────────────────┬───────────────────┘   │  │ Metadata     ││\n",
    "                                     ▼                       │  │ Traceability ││\n",
    "                    ┌────────────────────────────────────┐   │  │ Confidence   ││\n",
    "                    │  Preprocessing                     │◀──│  │ metrics      ││\n",
    "                    │  Chunking, embedding generation,   │   │  └──────────────┘│\n",
    "                    │  filtering, deduplication          │   └──────────────────┘\n",
    "                    └────────────────┬───────────────────┘\n",
    "                                     ▼\n",
    "                    ┌────────────────────────────────────┐\n",
    "                    │  Reasoning and Generation          │\n",
    "                    │  Synthesis within LLM context      │──────▶ Grounded Answer\n",
    "                    │  using only retrieved sources      │\n",
    "                    └────────────────────────────────────┘\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9074010a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 1.1 Load Documents ────────────────────────────────────────────\n",
    "# Ref: §6.1, RAG Pipeline Step 2 — DirectoryLoader (p. 149)\n",
    "#\n",
    "# We load from the docs/ directory which contains:\n",
    "#   - knowledge_base_rag.txt: RAG concepts, strategies, limitations\n",
    "#   - compliance_policy.txt:  Corporate policy (data retention, refunds)\n",
    "\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "log.info(\"Loading documents from docs/ directory...\")\n",
    "\n",
    "doc_dir = \"docs\"\n",
    "documents = []\n",
    "\n",
    "for fname in sorted(os.listdir(doc_dir)):\n",
    "    fpath = os.path.join(doc_dir, fname)\n",
    "    if os.path.isfile(fpath) and fname.endswith(\".txt\"):\n",
    "        with open(fpath, \"r\", encoding=\"utf-8\") as f:\n",
    "            content = f.read()\n",
    "        documents.append({\"content\": content, \"source\": fpath})\n",
    "        log.info(f\"  Loaded: {fname} ({len(content):,} chars)\")\n",
    "\n",
    "log.success(f\"Loaded {len(documents)} documents from {doc_dir}/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "395dca77",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 1.2 Split Documents into Chunks ───────────────────────────────\n",
    "# Ref: §6.1, Chunking Strategies (p. 151)\n",
    "#\n",
    "# Parameters from the chapter's RAG pipeline example (p. 149):\n",
    "#   chunk_size=1000, chunk_overlap=200\n",
    "#\n",
    "# RecursiveCharacterTextSplitter splits on natural boundaries\n",
    "# (paragraphs → sentences → words) in descending order.\n",
    "\n",
    "splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=1000,\n",
    "    chunk_overlap=200,\n",
    ")\n",
    "\n",
    "all_chunks = []\n",
    "all_metadatas = []\n",
    "\n",
    "for doc in documents:\n",
    "    chunks = splitter.split_text(doc[\"content\"])\n",
    "    for chunk in chunks:\n",
    "        all_chunks.append(chunk)\n",
    "        all_metadatas.append({\"source\": doc[\"source\"]})\n",
    "\n",
    "log.success(\n",
    "    f\"Split {len(documents)} documents into {len(all_chunks)} chunks \"\n",
    "    f\"(chunk_size=1000, overlap=200)\"\n",
    ")\n",
    "\n",
    "# Preview first chunk\n",
    "print(f\"\\n--- Chunk 0 preview (first 200 chars) ---\")\n",
    "print(all_chunks[0][:200] + \"...\")\n",
    "print(f\"Source: {all_metadatas[0]['source']}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f158beb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 1.3 Create Embeddings & FAISS Vector Store ────────────────────\n",
    "# Ref: §6.1, Step 3 — OpenAIEmbeddings + FAISS.from_texts (pp. 149–150)\n",
    "#\n",
    "# In Simulation Mode: MockEmbeddings produces deterministic 256-dim\n",
    "# vectors via seeded hashing — no API key needed.\n",
    "# In Live Mode: OpenAIEmbeddings with text-embedding-3-large.\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "if SIMULATION_MODE:\n",
    "    log.info(\"[SIMULATION MODE] Using MockEmbeddings (256-dim, hash-seeded)\")\n",
    "    embeddings = MockEmbeddings()\n",
    "else:\n",
    "    from langchain_openai import OpenAIEmbeddings\n",
    "    embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n",
    "    log.info(\"[LIVE MODE] Using OpenAIEmbeddings (text-embedding-3-large)\")\n",
    "\n",
    "# Build FAISS index\n",
    "from langchain_community.vectorstores import FAISS\n",
    "\n",
    "@fail_gracefully(fallback_return=None, section_ref=\"6.1\")\n",
    "def build_faiss_index(chunks, metadatas, embed_model):\n",
    "    \"\"\"Create FAISS vector store from document chunks.\"\"\"\n",
    "    vectorstore = FAISS.from_texts(\n",
    "        texts=chunks,\n",
    "        embedding=embed_model,\n",
    "        metadatas=metadatas,\n",
    "    )\n",
    "    return vectorstore\n",
    "\n",
    "vectorstore = build_faiss_index(all_chunks, all_metadatas, embeddings)\n",
    "\n",
    "if vectorstore is not None:\n",
    "    log.success(f\"FAISS index built with {len(all_chunks)} vectors.\")\n",
    "else:\n",
    "    log.error(\"FAISS index creation failed — will use MockRetrievalQAResult.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c8c30e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 1.4 Build Retrieval + Generation Chain ────────────────────────\n",
    "# Ref: §6.1, Step 4 — RetrievalQA.from_chain_type (p. 150)\n",
    "#\n",
    "# The retriever returns the top k=3 most similar chunks.\n",
    "# These chunks are passed to the LLM as context for grounded generation.\n",
    "\n",
    "@fail_gracefully(fallback_return=None, section_ref=\"6.1\")\n",
    "def build_qa_chain(vstore, simulation_mode):\n",
    "    \"\"\"Build RetrievalQA chain with real or mock LLM.\"\"\"\n",
    "    if simulation_mode or vstore is None:\n",
    "        return None  # Will use MockRetrievalQAResult instead\n",
    "\n",
    "    from langchain_openai import ChatOpenAI\n",
    "try:\n",
    "    sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath('.')), '..'))\n",
    "    from supporting.llm_provider import get_llm\n",
    "except ImportError:\n",
    "    get_llm = None\n",
    "    from langchain.chains import RetrievalQA\n",
    "\n",
    "    retriever = vstore.as_retriever(search_kwargs={\"k\": 3})\n",
    "    _prov = os.environ.get(\"LLM_PROVIDER\", \"openai\")\n",
    "if get_llm is not None and not SIMULATION_MODE:\n",
    "    try:\n",
    "        llm = get_llm(provider=_prov, temperature=0)\n",
    "        log.success(f\"LLM initialized: {_prov}\")\n",
    "    except Exception:\n",
    "        llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0)\n",
    "        log.success(\"LLM initialized: OpenAI (fallback)\")\n",
    "elif not SIMULATION_MODE:\n",
    "    llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0)\n",
    "    log.success(\"LLM initialized: OpenAI\")\n",
    "    qa_chain = RetrievalQA.from_chain_type(\n",
    "        llm=llm,\n",
    "        retriever=retriever,\n",
    "        return_source_documents=True,\n",
    "    )\n",
    "    return qa_chain\n",
    "\n",
    "qa_chain = build_qa_chain(vectorstore, SIMULATION_MODE)\n",
    "\n",
    "if qa_chain is not None:\n",
    "    log.success(\"RetrievalQA chain ready (Live Mode).\")\n",
    "else:\n",
    "    log.info(\"[SIMULATION MODE] Using MockRetrievalQAResult for queries.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3050fe82",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 1.5 Run a Query — Grounded Answer with Provenance ─────────────\n",
    "# Ref: §6.1, Step 5 — Query execution (p. 150)\n",
    "#\n",
    "# Query: \"What are the main limitations of retrieval-augmented generation?\"\n",
    "# This matches the exact query from the chapter's RAG pipeline example.\n",
    "\n",
    "query = \"What are the main limitations of retrieval-augmented generation?\"\n",
    "log.info(f\"Query: {query}\")\n",
    "print()\n",
    "\n",
    "if qa_chain is not None:\n",
    "    # Live Mode: use real RetrievalQA chain\n",
    "    result = qa_chain({\"query\": query})\n",
    "    answer = result[\"result\"]\n",
    "    sources = result.get(\"source_documents\", [])\n",
    "else:\n",
    "    # Simulation Mode: use chapter-derived mock\n",
    "    mock_result = MockRetrievalQAResult(query).run()\n",
    "    answer = mock_result[\"result\"]\n",
    "    sources = mock_result[\"source_documents\"]\n",
    "\n",
    "# ── Display Answer ────────────────────────────────────────────────\n",
    "print(\"=\" * 65)\n",
    "print(\"ANSWER:\")\n",
    "print(\"=\" * 65)\n",
    "print(answer)\n",
    "\n",
    "# ── Display Sources (Provenance) ──────────────────────────────────\n",
    "print()\n",
    "print(\"SOURCES:\")\n",
    "print(\"-\" * 40)\n",
    "for i, doc in enumerate(sources, 1):\n",
    "    if isinstance(doc, dict):\n",
    "        src = doc.get(\"metadata\", {}).get(\"source\", \"unknown\")\n",
    "    else:\n",
    "        src = getattr(doc, \"metadata\", {}).get(\"source\", \"unknown\")\n",
    "    print(f\"  [{i}] {src}\")\n",
    "\n",
    "log.success(\"Knowledge Retrieval Agent query completed with provenance.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34e924a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 1.6 Diagnostic Query — Refund Policy Scenario ─────────────────\n",
    "# Ref: §6.1, Diagnosing Retrieval Failures (p. 152)\n",
    "#\n",
    "# The chapter describes a scenario: a user asks \"What is our refund\n",
    "# policy for subscriptions?\" and gets generic billing terms instead of\n",
    "# the specific subscription clause. This demonstrates why source\n",
    "# inspection and metadata filtering matter.\n",
    "\n",
    "diag_query = \"What is our refund policy for subscriptions?\"\n",
    "log.info(f\"Diagnostic query: {diag_query}\")\n",
    "print()\n",
    "\n",
    "if qa_chain is not None:\n",
    "    result = qa_chain({\"query\": diag_query})\n",
    "    answer = result[\"result\"]\n",
    "    sources = result.get(\"source_documents\", [])\n",
    "else:\n",
    "    mock_result = MockRetrievalQAResult(diag_query).run()\n",
    "    answer = mock_result[\"result\"]\n",
    "    sources = mock_result[\"source_documents\"]\n",
    "\n",
    "print(\"=\" * 65)\n",
    "print(\"ANSWER:\")\n",
    "print(\"=\" * 65)\n",
    "print(answer)\n",
    "print()\n",
    "print(\"SOURCES:\")\n",
    "print(\"-\" * 40)\n",
    "for i, doc in enumerate(sources, 1):\n",
    "    if isinstance(doc, dict):\n",
    "        src = doc.get(\"metadata\", {}).get(\"source\", \"unknown\")\n",
    "    else:\n",
    "        src = getattr(doc, \"metadata\", {}).get(\"source\", \"unknown\")\n",
    "    print(f\"  [{i}] {src}\")\n",
    "\n",
    "log.success(\"Diagnostic query completed — inspect sources for retrieval quality.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3155a380",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 2. Chunking Strategies Deep Dive (p. 151)\n",
    "\n",
    "**Ref:** §6.1 — Chunking Strategies (p. 151)\n",
    "\n",
    "Chunking is the **most consequential configuration decision** in a RAG system. The chunk is the atomic unit retrieved by the vector index; its size determines how much text the LLM receives as context for each match.\n",
    "\n",
    "### Three Strategies (p. 151)\n",
    "\n",
    "1. **Fixed-size chunking** — Splits at a fixed character/token boundary. Simplest approach; suits uniform documents.\n",
    "2. **Recursive chunking** — Splits on natural boundaries (paragraphs → sentences → words) in descending order. **Recommended default** for mixed-content corpora.\n",
    "3. **Semantic chunking** — Uses embedding similarity to detect topic shifts. Highest retrieval fidelity for narrative text; higher computational cost at ingestion time.\n",
    "\n",
    "### The Size-Overlap Trade-Off (p. 151)\n",
    "\n",
    "- **Smaller chunks** (200–500 chars): better precision, but risk losing surrounding context\n",
    "- **Larger chunks** (1,000–2,000 chars): richer context, but diluted embedding signal reducing recall\n",
    "- **Overlap** (e.g., 200 chars on 1,000-char chunks): ensures boundary sentences are captured by at least one chunk\n",
    "\n",
    "> **Production Warning** (p. 151): Misconfiguring chunk size and overlap is the most common source of retrieval-quality degradation in production. Overly large chunks introduce irrelevant context, overly small chunks produce incomplete answers, and insufficient overlap creates boundary artifacts where key facts fall between chunks.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "089e0639",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 2.1 Sample Text for Chunking Comparison ───────────────────────\n",
    "# Ref: §6.1, Chunking Strategies (p. 151)\n",
    "#\n",
    "# We use the first document from our corpus to demonstrate all three\n",
    "# chunking strategies side by side.\n",
    "\n",
    "with open(\"docs/knowledge_base_rag.txt\", \"r\") as f:\n",
    "    sample_text = f.read()\n",
    "\n",
    "log.info(f\"Sample text length: {len(sample_text):,} characters\")\n",
    "print(f\"First 200 chars: {sample_text[:200]}...\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "143f2b86",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 2.2 Fixed-Size Chunking ────────────────────────────────────────\n",
    "# Ref: §6.1, \"Fixed-size chunking divides text at a fixed character\n",
    "#       or token boundary\" (p. 151)\n",
    "\n",
    "from langchain_text_splitters import CharacterTextSplitter\n",
    "\n",
    "fixed_splitter = CharacterTextSplitter(\n",
    "    separator=\"\",           # Pure character boundary\n",
    "    chunk_size=500,\n",
    "    chunk_overlap=0,        # No overlap for contrast\n",
    ")\n",
    "fixed_chunks = fixed_splitter.split_text(sample_text)\n",
    "\n",
    "log.info(f\"Fixed-size chunking: {len(fixed_chunks)} chunks (size=500, overlap=0)\")\n",
    "for i, chunk in enumerate(fixed_chunks[:3]):\n",
    "    print(f'  Chunk {i}: {len(chunk)} chars | \"{chunk[:60]}...\"')\n",
    "\n",
    "print(f\"  ... ({len(fixed_chunks)} total chunks)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b3386fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 2.3 Recursive Chunking (Recommended Default) ──────────────────\n",
    "# Ref: §6.1, \"Recursive chunking attempts to split on natural\n",
    "#       boundaries (paragraphs, sentences, words)\" (p. 151)\n",
    "#\n",
    "# Parameters match the chapter's RAG pipeline: chunk_size=1000, overlap=200\n",
    "\n",
    "recursive_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=1000,\n",
    "    chunk_overlap=200,\n",
    ")\n",
    "recursive_chunks = recursive_splitter.split_text(sample_text)\n",
    "\n",
    "log.info(f\"Recursive chunking: {len(recursive_chunks)} chunks (size=1000, overlap=200)\")\n",
    "for i, chunk in enumerate(recursive_chunks[:3]):\n",
    "    print(f'  Chunk {i}: {len(chunk)} chars | \"{chunk[:60]}...\"')\n",
    "\n",
    "print(f\"  ... ({len(recursive_chunks)} total chunks)\")\n",
    "\n",
    "# Demonstrate overlap preservation\n",
    "if len(recursive_chunks) >= 2:\n",
    "    tail = recursive_chunks[0][-50:]\n",
    "    head = recursive_chunks[1][:50]\n",
    "    overlap_found = any(\n",
    "        tail[i:i+20] in recursive_chunks[1][:250]\n",
    "        for i in range(len(tail) - 20)\n",
    "    )\n",
    "    print(f\"\\n  Overlap check (chunk 0 tail → chunk 1 head):\")\n",
    "    print(f\"    Chunk 0 ends:   ...{repr(recursive_chunks[0][-60:])}\")\n",
    "    print(f\"    Chunk 1 starts: {repr(recursive_chunks[1][:60])}...\")\n",
    "    print(f\"    Overlap preserved: {'Yes ✓' if overlap_found else 'Minimal (natural boundary split)'}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7934a5aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 2.4 Semantic Chunking (Simulated) ─────────────────────────────\n",
    "# Ref: §6.1, \"Semantic chunking uses embedding similarity to detect\n",
    "#       natural topic shifts before splitting\" (p. 151)\n",
    "#\n",
    "# In a production system, this would use an embedding model to compute\n",
    "# similarity between adjacent sentences and split where similarity\n",
    "# drops below a threshold. Here we simulate the concept.\n",
    "\n",
    "log.info(\"Semantic chunking (simulated via paragraph boundaries)\")\n",
    "\n",
    "# Approximate semantic chunking by splitting on double-newlines (paragraphs)\n",
    "# then merging adjacent paragraphs that are semantically related (by length heuristic)\n",
    "paragraphs = [p.strip() for p in sample_text.split(\"\\n\\n\") if p.strip()]\n",
    "\n",
    "semantic_chunks = []\n",
    "current_chunk = \"\"\n",
    "for para in paragraphs:\n",
    "    if len(current_chunk) + len(para) < 1200:\n",
    "        current_chunk += (\"\\n\\n\" + para if current_chunk else para)\n",
    "    else:\n",
    "        if current_chunk:\n",
    "            semantic_chunks.append(current_chunk)\n",
    "        current_chunk = para\n",
    "if current_chunk:\n",
    "    semantic_chunks.append(current_chunk)\n",
    "\n",
    "log.info(f\"Semantic chunking: {len(semantic_chunks)} chunks (paragraph-aligned)\")\n",
    "for i, chunk in enumerate(semantic_chunks[:3]):\n",
    "    print(f'  Chunk {i}: {len(chunk)} chars | \"{chunk[:60]}...\"')\n",
    "\n",
    "print(f\"  ... ({len(semantic_chunks)} total chunks)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04b6bb67",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 2.5 Chunking Comparison Summary ────────────────────────────────\n",
    "# Ref: §6.1, Chunking Strategies (p. 151)\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "comparison = pd.DataFrame({\n",
    "    \"Strategy\": [\"Fixed-size\", \"Recursive (recommended)\", \"Semantic (simulated)\"],\n",
    "    \"Chunks\": [len(fixed_chunks), len(recursive_chunks), len(semantic_chunks)],\n",
    "    \"Avg Size (chars)\": [\n",
    "        int(sum(len(c) for c in fixed_chunks) / max(len(fixed_chunks), 1)),\n",
    "        int(sum(len(c) for c in recursive_chunks) / max(len(recursive_chunks), 1)),\n",
    "        int(sum(len(c) for c in semantic_chunks) / max(len(semantic_chunks), 1)),\n",
    "    ],\n",
    "    \"Min Size\": [\n",
    "        min(len(c) for c in fixed_chunks),\n",
    "        min(len(c) for c in recursive_chunks),\n",
    "        min(len(c) for c in semantic_chunks),\n",
    "    ],\n",
    "    \"Max Size\": [\n",
    "        max(len(c) for c in fixed_chunks),\n",
    "        max(len(c) for c in recursive_chunks),\n",
    "        max(len(c) for c in semantic_chunks),\n",
    "    ],\n",
    "    \"Overlap\": [\"None\", \"200 chars\", \"Natural\"],\n",
    "    \"Best For\": [\n",
    "        \"Uniform documents\",\n",
    "        \"Mixed-content corpora (default)\",\n",
    "        \"Narrative text (higher cost)\",\n",
    "    ],\n",
    "})\n",
    "\n",
    "print(\"=\" * 80)\n",
    "print(\"CHUNKING STRATEGY COMPARISON\")\n",
    "print(\"=\" * 80)\n",
    "print(comparison.to_string(index=False))\n",
    "print()\n",
    "log.success(\"Chunking deep dive complete — recursive chunking with 1000/200 is the recommended default.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92d40d6c",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 3. Document Intelligence Agent (pp. 153–160)\n",
    "\n",
    "**Ref:** §6.2 — Document Intelligence Agents (pp. 153–160)\n",
    "\n",
    "While Knowledge Retrieval agents find the right file, Document Intelligence agents cross the document boundary to **extract, parse, and transform** messy content into structured, machine-readable data. In regulated industries where accuracy and traceability are paramount, these agents can shrink contract review cycles, automate claims processing, and feed analytics platforms with reliable data.\n",
    "\n",
    "### The Five-Stage Pipeline (Figure 6.2, p. 159)\n",
    "\n",
    "1. **Ingestion & Triage** — Classify document type (invoice, lab report, contract), detect language, route to specialized workflow. Uses MIME-type detection as first pass, with lightweight classifiers for ambiguous cases.\n",
    "2. **OCR & Preprocessing** — Deskew, denoise, convert images to text with **confidence scores** (0–100). Low-confidence tokens are flagged for human review downstream.\n",
    "3. **Structural Segmentation** — Identify headings, tables, key-value pairs; reconstruct reading order essential for understanding context.\n",
    "4. **Information Extraction** — **Schema-driven extraction** of entities and relationships (e.g., Invoice Number, Total Amount). Output is structured JSON with confidence scores and provenance per field.\n",
    "5. **Validation & Integration** — Route high-confidence results to downstream systems (ERPs, CRMs); flag low-confidence for **human-in-the-loop (HITL)** review.\n",
    "\n",
    "> **Architecture:** See Figure 6.2 (p. 159) for the full five-stage pipeline diagram showing confidence scoring, schema-driven extraction, and the HITL review loop.\n",
    "\n",
    "**Key parameters from the chapter:**\n",
    "- `CONFIDENCE_THRESHOLD = 60` — Tesseract confidence score (0–100) below which tokens are flagged for review (p. 154)\n",
    "- Schema fields: `invoice_number`, `invoice_date`, `total_amount` (p. 155)\n",
    "- Target accuracy: **95%+ on critical fields**; human review **< 8%** (p. 159)\n",
    "\n",
    "> **Note — Agent Capability Level** (p. 153): A typical Document Intelligence agent operates at **Level 2** (Tool-Using), orchestrating OCR engines, layout parsers, and extraction models. More advanced agents that dynamically re-plan based on document complexity demonstrate **Level 3** (Planning) capabilities.\n",
    "\n",
    "> **Note — Development Best Practices** (p. 160): The Agent Development Lifecycle (ADL) emphasizes **curated datasets** for training/testing, **resilient design** with cascading extraction strategies for low-confidence fields, **HITL integration** from day one, and **full provenance** preserving page numbers, bounding box coordinates, and token indices for every extracted field.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "004389ad",
   "metadata": {},
   "source": [
    "### Figure 6.2 — Document Intelligence Pipeline (p. 159)\n",
    "\n",
    "```\n",
    "  Confidence                                                       \n",
    "   scoring         ┌──────────────────────┐                        \n",
    "      │            │  1. Ingest & Triage   │  Classify, route      \n",
    "      │            │     by document type  │                       \n",
    "      │            └──────────┬───────────┘                        \n",
    "      │                       ▼                                     \n",
    "      ├───────────▶┌──────────────────────┐                        \n",
    "      │            │  2. OCR & Preprocess  │  Clean, deskew,       \n",
    "      │            │     extract text      │  confidence scores    \n",
    "  Schema           └──────────┬───────────┘                        \n",
    "  driven                      ▼                                     \n",
    "      ├───────────▶┌──────────────────────┐                        \n",
    "      │            │  3. Layout Parse      │  Tables, blocks,      \n",
    "      │            │     reading order     │                       \n",
    "      │            └──────────┬───────────┘   95%+ accuracy        \n",
    "      │                       ▼               target               \n",
    "      └───────────▶┌──────────────────────┐                        \n",
    "                   │  4. Extract Data      │  Entities, relations, \n",
    "                   │     provenance        │                       \n",
    "                   └──────────┬───────────┘   HITL loop            \n",
    "                              ▼               <8% review           \n",
    "                   ┌──────────────────────┐                        \n",
    "                   │  5. Integrate         │  ERP, CRM,            \n",
    "                   │     human review      │  downstream systems   \n",
    "                   └──────────────────────┘                        \n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a148046",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 3.1 Generate Synthetic Invoice Image ──────────────────────────\n",
    "# Ref: §6.2, Strategy Item #4 (p. 155)\n",
    "#\n",
    "# We use Pillow to create a deterministic invoice PNG with:\n",
    "#   - Header: \"INVOICE\"\n",
    "#   - Invoice No: INV-2026-00142, Date: 2026-03-15\n",
    "#   - Line items table (3 rows)\n",
    "#   - Total Due: $4,750.00\n",
    "#   - One smudged region to demonstrate confidence thresholding\n",
    "\n",
    "import os\n",
    "from PIL import Image, ImageDraw, ImageFont\n",
    "\n",
    "@fail_gracefully(fallback_return=\"samples/sample_invoice.png\", section_ref=\"6.2\")\n",
    "def generate_invoice_image(output_path=\"samples/sample_invoice.png\"):\n",
    "    \"\"\"Generate a synthetic invoice PNG for OCR demonstration.\"\"\"\n",
    "    os.makedirs(os.path.dirname(output_path), exist_ok=True)\n",
    "\n",
    "    # Canvas setup\n",
    "    width, height = 600, 500\n",
    "    img = Image.new(\"RGB\", (width, height), \"white\")\n",
    "    draw = ImageDraw.Draw(img)\n",
    "\n",
    "    # Use default font (available everywhere)\n",
    "    try:\n",
    "        font_large = ImageFont.truetype(\"/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf\", 28)\n",
    "        font_med = ImageFont.truetype(\"/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf\", 16)\n",
    "        font_small = ImageFont.truetype(\"/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf\", 13)\n",
    "    except (OSError, IOError):\n",
    "        font_large = ImageFont.load_default()\n",
    "        font_med = ImageFont.load_default()\n",
    "        font_small = ImageFont.load_default()\n",
    "\n",
    "    # Header\n",
    "    draw.text((40, 30), \"INVOICE\", fill=\"black\", font=font_large)\n",
    "    draw.line([(40, 65), (560, 65)], fill=\"black\", width=2)\n",
    "\n",
    "    # Invoice details\n",
    "    draw.text((40, 80), \"Invoice No:\", fill=\"gray\", font=font_med)\n",
    "    draw.text((170, 80), \"INV-2026-00142\", fill=\"black\", font=font_med)\n",
    "    draw.text((40, 105), \"Date:\", fill=\"gray\", font=font_med)\n",
    "    draw.text((170, 105), \"2026-03-15\", fill=\"black\", font=font_med)\n",
    "    draw.text((40, 130), \"Bill To:\", fill=\"gray\", font=font_med)\n",
    "    draw.text((170, 130), \"Acme Corporation\", fill=\"black\", font=font_med)\n",
    "\n",
    "    # Line items header\n",
    "    y_table = 175\n",
    "    draw.line([(40, y_table), (560, y_table)], fill=\"black\", width=1)\n",
    "    draw.text((40, y_table + 5), \"Description\", fill=\"black\", font=font_med)\n",
    "    draw.text((320, y_table + 5), \"Qty\", fill=\"black\", font=font_med)\n",
    "    draw.text((390, y_table + 5), \"Unit Price\", fill=\"black\", font=font_med)\n",
    "    draw.text((490, y_table + 5), \"Amount\", fill=\"black\", font=font_med)\n",
    "    draw.line([(40, y_table + 28), (560, y_table + 28)], fill=\"gray\", width=1)\n",
    "\n",
    "    # Line items\n",
    "    items = [\n",
    "        (\"AI Consulting Services\", \"10\", \"$350.00\", \"$3,500.00\"),\n",
    "        (\"Data Pipeline Setup\", \"1\", \"$750.00\", \"$750.00\"),\n",
    "        (\"Documentation Package\", \"2\", \"$250.00\", \"$500.00\"),\n",
    "    ]\n",
    "    y = y_table + 35\n",
    "    for desc, qty, unit, amount in items:\n",
    "        draw.text((40, y), desc, fill=\"black\", font=font_small)\n",
    "        draw.text((330, y), qty, fill=\"black\", font=font_small)\n",
    "        draw.text((390, y), unit, fill=\"black\", font=font_small)\n",
    "        draw.text((490, y), amount, fill=\"black\", font=font_small)\n",
    "        y += 25\n",
    "\n",
    "    # Total\n",
    "    draw.line([(40, y + 5), (560, y + 5)], fill=\"black\", width=2)\n",
    "    draw.text((390, y + 12), \"Total Due:\", fill=\"black\", font=font_med)\n",
    "    draw.text((490, y + 12), \"$4,750.00\", fill=\"black\", font=font_med)\n",
    "\n",
    "    # Smudged region (simulates low-confidence OCR area)\n",
    "    smudge_y = y + 55\n",
    "    for sy in range(smudge_y, smudge_y + 20):\n",
    "        for sx in range(40, 180):\n",
    "            import random\n",
    "            random.seed(sx * 1000 + sy)\n",
    "            if random.random() < 0.4:\n",
    "                gray = random.randint(100, 200)\n",
    "                draw.point((sx, sy), fill=(gray, gray, gray))\n",
    "    draw.text((45, smudge_y + 2), \"Smudged region\", fill=(150, 150, 150), font=font_small)\n",
    "\n",
    "    # Save\n",
    "    img.save(output_path)\n",
    "    return output_path\n",
    "\n",
    "invoice_path = generate_invoice_image()\n",
    "log.success(f\"Invoice image saved to: {invoice_path}\")\n",
    "\n",
    "# Display the image in notebook\n",
    "from IPython.display import display, Image as IPImage\n",
    "display(IPImage(filename=invoice_path, width=450))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5d66413",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 3.2 OCR Processing with Confidence Scoring ───────────────────\n",
    "# Ref: §6.2, Stage 2 — Preprocessing and OCR (p. 154)\n",
    "#\n",
    "# CONFIDENCE_THRESHOLD = 60 (from chapter, p. 12)\n",
    "# Tokens below this threshold are flagged for human review.\n",
    "#\n",
    "# In Live Mode:  pytesseract.image_to_data() on the invoice image\n",
    "# In Sim Mode:   mock_pytesseract_output() from agent_utils.py\n",
    "\n",
    "CONFIDENCE_THRESHOLD = 60  # Tesseract confidence: 0-100\n",
    "\n",
    "@fail_gracefully(fallback_return=lambda: mock_pytesseract_output(), section_ref=\"6.2\")\n",
    "def run_ocr(image_path):\n",
    "    \"\"\"Run OCR on invoice image, returning pytesseract-format dict.\"\"\"\n",
    "    if SIMULATION_MODE:\n",
    "        raise RuntimeError(\"Simulation Mode — bypassing Tesseract\")\n",
    "\n",
    "    import pytesseract\n",
    "    from PIL import Image\n",
    "    img = Image.open(image_path)\n",
    "    data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)\n",
    "    return data\n",
    "\n",
    "log.info(f\"Running OCR on {invoice_path} (threshold={CONFIDENCE_THRESHOLD})...\")\n",
    "ocr_data = run_ocr(invoice_path)\n",
    "\n",
    "# Filter tokens by confidence\n",
    "all_tokens = []\n",
    "high_conf_tokens = []\n",
    "low_conf_tokens = []\n",
    "\n",
    "for i in range(len(ocr_data[\"text\"])):\n",
    "    text = ocr_data[\"text\"][i].strip()\n",
    "    if not text:\n",
    "        continue\n",
    "    conf = int(ocr_data[\"conf\"][i])\n",
    "    token_info = {\n",
    "        \"text\": text,\n",
    "        \"confidence\": conf,\n",
    "        \"x\": ocr_data[\"left\"][i],\n",
    "        \"y\": ocr_data[\"top\"][i],\n",
    "        \"line_id\": ocr_data[\"line_num\"][i],\n",
    "    }\n",
    "    all_tokens.append(token_info)\n",
    "    if conf >= CONFIDENCE_THRESHOLD:\n",
    "        high_conf_tokens.append(token_info)\n",
    "    else:\n",
    "        low_conf_tokens.append(token_info)\n",
    "\n",
    "log.success(f\"OCR complete: {len(all_tokens)} tokens total\")\n",
    "print(f\"  High confidence (>={CONFIDENCE_THRESHOLD}): {len(high_conf_tokens)} tokens\")\n",
    "print(f\"  Low confidence  (<{CONFIDENCE_THRESHOLD}):  {len(low_conf_tokens)} tokens (flagged for review)\")\n",
    "\n",
    "print()\n",
    "print(\"All OCR tokens:\")\n",
    "print(f\"  {'Token':<20} {'Conf':>5}  {'Status'}\")\n",
    "print(f\"  {'-'*20} {'-'*5}  {'-'*12}\")\n",
    "for t in all_tokens:\n",
    "    status = \"✓ accepted\" if t[\"confidence\"] >= CONFIDENCE_THRESHOLD else \"⚠ flagged\"\n",
    "    print(f'  {t[\"text\"]:<20} {t[\"confidence\"]:>5}  {status}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08e81bec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 3.3 Schema-Driven Field Extraction ────────────────────────────\n",
    "# Ref: §6.2, Stage 4 — Information Extraction (p. 155)\n",
    "#\n",
    "# SCHEMA from the chapter (p. 155):\n",
    "#   invoice_number: [\"invoice no\", \"invoice number\", \"inv no\"]\n",
    "#   invoice_date:   [\"date\", \"invoice date\"]\n",
    "#   total_amount:   [\"total\", \"amount due\", \"balance due\", \"total due\"]\n",
    "\n",
    "SCHEMA = {\n",
    "    \"invoice_number\": [\"invoice no\", \"invoice number\", \"inv no\"],\n",
    "    \"invoice_date\": [\"date\", \"invoice date\"],\n",
    "    \"total_amount\": [\"total\", \"amount due\", \"balance due\", \"total due\"],\n",
    "}\n",
    "\n",
    "@fail_gracefully(fallback_return=lambda: MOCK_EXTRACTED_FIELDS.copy(), section_ref=\"6.2\")\n",
    "def extract_fields(tokens, schema):\n",
    "    \"\"\"\n",
    "    Extract fields using keyword proximity matching on OCR tokens.\n",
    "    Strategy: find a line containing a schema cue keyword, then\n",
    "    return the value token(s) to the right of / below the keyword.\n",
    "    Ref: §6.2, extract_near_keyword pattern (pp. 156–158)\n",
    "    \"\"\"\n",
    "    # Group tokens by line\n",
    "    lines = {}\n",
    "    for t in tokens:\n",
    "        lid = t[\"line_id\"]\n",
    "        lines.setdefault(lid, []).append(t)\n",
    "    # Sort tokens within each line by x position\n",
    "    for lid in lines:\n",
    "        lines[lid] = sorted(lines[lid], key=lambda t: t[\"x\"])\n",
    "\n",
    "    results = {}\n",
    "    for field_name, cue_keywords in schema.items():\n",
    "        results[field_name] = \"\"\n",
    "        for lid, line_tokens in sorted(lines.items()):\n",
    "            line_text = \" \".join(t[\"text\"].lower() for t in line_tokens)\n",
    "            matched = any(kw in line_text for kw in cue_keywords)\n",
    "            if not matched:\n",
    "                continue\n",
    "            # Find value: tokens after the cue keyword on the same line\n",
    "            # Heuristic: skip tokens that are part of the keyword itself\n",
    "            cue_token_count = max(len(kw.split()) for kw in cue_keywords if kw in line_text)\n",
    "            value_tokens = line_tokens[cue_token_count:]\n",
    "            value = \" \".join(\n",
    "                t[\"text\"] for t in value_tokens\n",
    "                if t[\"text\"].lower() not in (\":\", \"-\")\n",
    "            ).strip(\": \")\n",
    "            if value:\n",
    "                results[field_name] = value\n",
    "                break\n",
    "    return results\n",
    "\n",
    "extracted = extract_fields(high_conf_tokens, SCHEMA)\n",
    "\n",
    "# Display results\n",
    "print(\"=\" * 55)\n",
    "print(\"EXTRACTED FIELDS (Schema-Driven)\")\n",
    "print(\"=\" * 55)\n",
    "\n",
    "import json\n",
    "print(json.dumps(extracted, indent=2))\n",
    "\n",
    "print()\n",
    "print(\"PROVENANCE:\")\n",
    "print(f\"  Source image: {invoice_path}\")\n",
    "print(f\"  OCR tokens used: {len(high_conf_tokens)} (above threshold {CONFIDENCE_THRESHOLD})\")\n",
    "print(f\"  Low-confidence tokens flagged: {len(low_conf_tokens)}\")\n",
    "print(f\"  Schema fields: {list(SCHEMA.keys())}\")\n",
    "\n",
    "# Validate against expected values\n",
    "expected = MOCK_EXTRACTED_FIELDS\n",
    "all_match = True\n",
    "for field, expected_val in expected.items():\n",
    "    actual = extracted.get(field, \"\")\n",
    "    match = expected_val in actual or actual in expected_val\n",
    "    symbol = \"✓\" if match else \"✗\"\n",
    "    print(f\"  {symbol} {field}: extracted='{actual}' expected='{expected_val}'\")\n",
    "    if not match:\n",
    "        all_match = False\n",
    "\n",
    "if all_match:\n",
    "    log.success(\"All fields extracted correctly — Document Intelligence pipeline complete.\")\n",
    "else:\n",
    "    log.info(\"Some fields may differ in Simulation vs Live mode — review above.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9793d1c1",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 4. Scientific Research Agent (pp. 161–168)\n",
    "\n",
    "**Ref:** §6.3 — Scientific Research Agents (pp. 161–168)\n",
    "\n",
    "Scientific Research agents represent a sophisticated evolution in agent capabilities, demonstrating a layered approach to information synthesis that goes beyond simple retrieval. They integrate knowledge from multiple databases, reconcile conflicting results, and produce synthesis reports highlighting consensus, divergence, and gaps in knowledge.\n",
    "\n",
    "### Three-Phase Research Workflow (pp. 162–166)\n",
    "\n",
    "1. **Broad Literature Scanning** — Query academic databases (PubMed, arXiv, IEEE Xplore, Scopus) using semantic search to capture conceptually relevant studies (p. 162)\n",
    "2. **Thematic Clustering & Summarization** — Group retrieved papers by shared themes (methodology, findings, domain) using embeddings and clustering (pp. 164–165)\n",
    "3. **Synthesis & Insight Generation** — Produce structured outputs: comparative tables, evidence maps, summaries highlighting consensus, divergence, and gaps (p. 166)\n",
    "\n",
    "### Cognitive Loop Mapping (p. 162)\n",
    "\n",
    "- *Perception*: User query (e.g., \"find new treatments for a rare disease\")\n",
    "- *Reasoning*: Translate into broad semantic search strategy\n",
    "- *Planning*: Multi-phase research strategy with dependencies\n",
    "- *Action*: Query databases, cluster papers, traverse citation graphs\n",
    "- *Learning*: Produce synthesis reports, identify promising directions\n",
    "\n",
    "### Advanced Technical Architecture (p. 166)\n",
    "\n",
    "- **Multi-database querying** — Parallel searches across diverse repositories\n",
    "- **Citation graph traversal** — Following citation chains to discover related studies\n",
    "- **Entity linking** — Unifying related concepts across sources\n",
    "- **Multi-hop reasoning** — Drawing connections between findings from separate studies\n",
    "- **Multi-vector retrieval** — Capturing different aspects of papers (methodology, findings, implications)\n",
    "\n",
    "> **Note — Citation Graph Traversal** (p. 162): Citation graph traversal maps papers as nodes and their citations as edges, allowing the agent to identify influential works, discover clusters of related research, and track how ideas evolve over time.\n",
    "\n",
    "> **Note — MCP and A2A Interoperability** (p. 167): Scientific Research agents rarely operate in isolation. **MCP** (Model Context Protocol) enables dynamic interfacing with academic database APIs without hardcoded logic. **A2A** (Agent-to-Agent) protocols support multi-agent setups where one agent specializes in searching and another in synthesizing.\n",
    "\n",
    "> **Note — Limitations** (pp. 167–168): These agents have no true scientific understanding (they process text statistically), carry hallucination risk even with RAG grounding, cannot generate genuinely new knowledge, and face context window constraints that force trade-offs between breadth and depth.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37223e52",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 4.1 Broad Literature Scanning (Phase 1) ───────────────────────\n",
    "# Ref: §6.3, Phase 1 — search_arxiv (pp. 163–164)\n",
    "#\n",
    "# Query: \"large language models retrieval augmented generation evaluation\"\n",
    "# MAX_RESULTS = 12 (matches strategy mock data count)\n",
    "#\n",
    "# In Simulation Mode: mock_search_arxiv() returns 12 chapter-derived papers\n",
    "# In Live Mode: arxiv.Search() queries the real arXiv API\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "QUERY = \"large language models retrieval augmented generation evaluation\"\n",
    "MAX_RESULTS = 12\n",
    "CLUSTERS = 4\n",
    "\n",
    "@fail_gracefully(\n",
    "    fallback_return=lambda: mock_search_arxiv(QUERY, MAX_RESULTS),\n",
    "    section_ref=\"6.3\"\n",
    ")\n",
    "def search_literature(query, max_results):\n",
    "    \"\"\"Search arXiv for papers matching the research query.\"\"\"\n",
    "    if SIMULATION_MODE:\n",
    "        return mock_search_arxiv(query, max_results)\n",
    "\n",
    "    import arxiv\n",
    "    results = []\n",
    "    for r in arxiv.Search(\n",
    "        query=query,\n",
    "        max_results=max_results,\n",
    "        sort_by=arxiv.SortCriterion.Relevance,\n",
    "    ).results():\n",
    "        results.append({\n",
    "            \"title\": r.title.strip(),\n",
    "            \"summary\": r.summary.strip(),\n",
    "            \"authors\": \", \".join(a.name for a in r.authors),\n",
    "            \"published\": r.published.strftime(\"%Y-%m-%d\"),\n",
    "            \"url\": r.entry_id,\n",
    "        })\n",
    "    return pd.DataFrame(results)\n",
    "\n",
    "df = search_literature(QUERY, MAX_RESULTS)\n",
    "\n",
    "# Combine title + abstract as the unit of meaning (per chapter, p. 23)\n",
    "df[\"text\"] = df[\"title\"].astype(str) + \". \" + df[\"summary\"].astype(str)\n",
    "\n",
    "log.success(f\"Phase 1 complete: {len(df)} papers retrieved\")\n",
    "print(f\"Query: '{QUERY}'\")\n",
    "print(f\"\\nFirst 3 papers:\")\n",
    "for i, row in df.head(3).iterrows():\n",
    "    print(f'  [{i}] {row[\"title\"]}')\n",
    "    print(f'      {row[\"authors\"]} | {row[\"published\"]}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f430e4ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 4.2 Thematic Clustering (Phase 2) ─────────────────────────────\n",
    "# Ref: §6.3, Phase 2 — SentenceTransformer + KMeans (pp. 164–165)\n",
    "#\n",
    "# Model: all-MiniLM-L6-v2 (~80MB, produces 384-dim embeddings)\n",
    "# Clustering: KMeans with k=4 clusters (matching CLUSTERS constant from chapter)\n",
    "\n",
    "import numpy as np\n",
    "import re\n",
    "from collections import Counter\n",
    "\n",
    "# ── Compute embeddings ────────────────────────────────────────────\n",
    "@fail_gracefully(fallback_return=None, section_ref=\"6.3\")\n",
    "def compute_embeddings(texts):\n",
    "    \"\"\"Generate sentence embeddings using SentenceTransformer.\"\"\"\n",
    "    from sentence_transformers import SentenceTransformer\n",
    "    model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n",
    "    return model.encode(texts, normalize_embeddings=True)\n",
    "\n",
    "log.info(\"Computing embeddings for paper abstracts...\")\n",
    "emb = compute_embeddings(df[\"text\"].tolist())\n",
    "\n",
    "if emb is None:\n",
    "    # Fallback: use MockEmbeddings for clustering demo\n",
    "    log.info(\"[FALLBACK] Using MockEmbeddings for clustering demonstration\")\n",
    "    mock_emb = MockEmbeddings()\n",
    "    emb = np.array(mock_emb.embed_documents(df[\"text\"].tolist()))\n",
    "    log.success(f\"Mock embeddings generated: shape {emb.shape}\")\n",
    "else:\n",
    "    log.success(f\"Embeddings computed: shape {emb.shape}\")\n",
    "\n",
    "# ── KMeans clustering ─────────────────────────────────────────────\n",
    "from sklearn.cluster import KMeans\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "\n",
    "kmeans = KMeans(n_clusters=CLUSTERS, n_init=\"auto\", random_state=42)\n",
    "labels = kmeans.fit_predict(emb)\n",
    "df[\"cluster\"] = labels\n",
    "\n",
    "log.success(f\"Phase 2 complete: {len(df)} papers clustered into {CLUSTERS} groups\")\n",
    "print(f\"Cluster distribution: {dict(Counter(labels))}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "220a644d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 4.3 Cluster Labeling & Extractive Summarization ───────────────\n",
    "# Ref: §6.3, label_cluster + summarize_cluster (pp. 164–166)\n",
    "#\n",
    "# Label: top frequent terms from titles nearest to cluster centroid\n",
    "# Summary: most informative sentences from central abstracts\n",
    "\n",
    "def clean_tokens(s):\n",
    "    \"\"\"Tokenize and filter stopwords for cluster labeling.\"\"\"\n",
    "    s = s.lower()\n",
    "    s = re.sub(r\"[^a-z0-9\\s\\-]\", \" \", s)\n",
    "    toks = [t for t in s.split() if len(t) > 2]\n",
    "    stop = set(\n",
    "        \"the and for with from into using among toward towards on in of to via \"\n",
    "        \"model models data paper study approach results method methods novel new \"\n",
    "        \"large language based task tasks text query queries augment retrieval\".split()\n",
    "    )\n",
    "    return [t for t in toks if t not in stop]\n",
    "\n",
    "\n",
    "def label_cluster(c_idx, k=6, sample_n=6):\n",
    "    \"\"\"Generate a descriptive label from top terms near the cluster centroid.\"\"\"\n",
    "    centroid = kmeans.cluster_centers_[c_idx].reshape(1, -1)\n",
    "    sims = cosine_similarity(emb, centroid).ravel()\n",
    "    top_ids = sims.argsort()[-sample_n:]\n",
    "    words = []\n",
    "    for i in top_ids:\n",
    "        words.extend(clean_tokens(df.iloc[i][\"title\"]))\n",
    "    top = [w for w, _ in Counter(words).most_common(k)]\n",
    "    return \", \".join(top) if top else \"mixed theme\"\n",
    "\n",
    "\n",
    "def summarize_cluster(c_idx, sentences=3):\n",
    "    \"\"\"Extractive summary from abstracts closest to the cluster centroid.\"\"\"\n",
    "    centroid = kmeans.cluster_centers_[c_idx].reshape(1, -1)\n",
    "    sims = cosine_similarity(emb, centroid).ravel()\n",
    "    ids = np.argsort(sims)[-8:]\n",
    "    cand_sentences = []\n",
    "    for i in ids:\n",
    "        text = df.iloc[i][\"summary\"]\n",
    "        for s in re.split(r\"(?<=[.!?])\\s+\", text):\n",
    "            if 40 < len(s) < 300:\n",
    "                cand_sentences.append((s, i))\n",
    "    if not cand_sentences:\n",
    "        return \"Cluster summary not available.\"\n",
    "    s_emb_local = compute_embeddings([s for s, _ in cand_sentences])\n",
    "    if s_emb_local is None:\n",
    "        # Fallback\n",
    "        mock_e = MockEmbeddings()\n",
    "        s_emb_local = np.array(mock_e.embed_documents([s for s, _ in cand_sentences]))\n",
    "    s_sims = cosine_similarity(s_emb_local, centroid).ravel()\n",
    "    top_idx = np.argsort(s_sims)[-sentences:]\n",
    "    picked = [cand_sentences[i][0] for i in top_idx]\n",
    "    return \" \".join(picked)\n",
    "\n",
    "log.info(\"Generating cluster labels and extractive summaries...\")\n",
    "cluster_data = {}\n",
    "for c in sorted(df[\"cluster\"].unique()):\n",
    "    cluster_data[c] = {\n",
    "        \"label\": label_cluster(c),\n",
    "        \"summary\": summarize_cluster(c),\n",
    "    }\n",
    "log.success(\"Cluster labeling and summarization complete.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e448b89",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 4.4 Synthesis Report (Phase 3) ────────────────────────────────\n",
    "# Ref: §6.3, Phase 3 — Synthesis and Reporting (pp. 165–166)\n",
    "#\n",
    "# For each cluster: label, extractive summary, representative papers\n",
    "\n",
    "print(\"=\" * 70)\n",
    "print(f\"SCIENTIFIC RESEARCH SYNTHESIS REPORT\")\n",
    "print(f\"=\" * 70)\n",
    "print(f\"Query: {QUERY}\")\n",
    "print(f\"Papers retrieved: {len(df)} | Clusters: {CLUSTERS}\")\n",
    "print()\n",
    "\n",
    "for c in sorted(df[\"cluster\"].unique()):\n",
    "    info = cluster_data[c]\n",
    "    print(f\"{'─' * 70}\")\n",
    "    print(f\"CLUSTER {c}: {info['label']}\")\n",
    "    print(f\"{'─' * 70}\")\n",
    "    print(f\"Synthesis: {info['summary']}\")\n",
    "    print()\n",
    "\n",
    "    # Representative papers (most recent in cluster)\n",
    "    reps = df[df[\"cluster\"] == c].sort_values(\"published\", ascending=False).head(3)\n",
    "    for _, r in reps.iterrows():\n",
    "        print(f'  • {r[\"title\"]}')\n",
    "        print(f'    {r[\"authors\"]} | {r[\"published\"]}')\n",
    "        print(f'    {r[\"url\"]}')\n",
    "    print()\n",
    "\n",
    "log.success(\"Phase 3 complete — synthesis report generated.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd689539",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 4.5 Evidence Table ────────────────────────────────────────────\n",
    "# Ref: §6.3, \"Optional: simple evidence table\" (p. 166)\n",
    "\n",
    "print(\"=\" * 70)\n",
    "print(\"EVIDENCE TABLE (top 3 per cluster)\")\n",
    "print(\"=\" * 70)\n",
    "\n",
    "for c in sorted(df[\"cluster\"].unique()):\n",
    "    cluster_df = df[df[\"cluster\"] == c].sort_values(\"published\", ascending=False).head(3)\n",
    "    print(f\"\\n--- Cluster {c}: {cluster_data[c]['label']} ---\")\n",
    "    for _, r in cluster_df.iterrows():\n",
    "        print(f'  {r[\"published\"]} | {r[\"title\"][:65]}...' if len(r[\"title\"]) > 65 else f'  {r[\"published\"]} | {r[\"title\"]}')\n",
    "        print(f'           | {r[\"authors\"]}')\n",
    "\n",
    "print()\n",
    "log.success(\"Evidence table complete — Scientific Research Agent workflow finished.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16c54231",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 5. Knowledge Agent Spectrum (pp. 168–170)\n",
    "\n",
    "**Ref:** §Summary — The Knowledge Agent Spectrum (pp. 168–170, Table 6.1)\n",
    "\n",
    "The three agent types explored in this chapter form a **complete knowledge pipeline** — from discovery to decision-ready insights:\n",
    "\n",
    "| Agent Type | Primary Role | Key Capabilities | Typical Use Cases | Capability Level |\n",
    "|---|---|---|---|---|\n",
    "| **Knowledge Retrieval** (§6.1) | Connect LLMs to live, authoritative sources | RAG, structured + unstructured retrieval, provenance | Legal research, market intelligence, enterprise KB assistants | Level 2–3: Tool-using to early planning |\n",
    "| **Document Intelligence** (§6.2) | Convert unstructured documents into structured data | OCR, layout parsing, entity extraction, HITL validation | Healthcare claims, financial contracts, supply-chain docs | Level 2–3: Tool-using to early planning |\n",
    "| **Scientific Research** (§6.3) | Synthesize information across databases for discovery | Clustering, citation graph traversal, multi-hop reasoning, hypothesis generation | Drug discovery, climate science, policy analysis | Level 4: Learning agent with cross-domain synthesis |\n",
    "\n",
    "### Progressive Capability (pp. 168–169)\n",
    "\n",
    "Each agent type represents increasing sophistication in the **Agentic AI Progression Framework**:\n",
    "\n",
    "- **Knowledge Retrieval agents** operate primarily at **Level 2** (Tool-Using), orchestrating search APIs and vector databases to ground responses in evidence. Advanced implementations with multi-stage retrieval begin exhibiting **Level 3** (Planning) behaviors.\n",
    "\n",
    "- **Document Intelligence agents** similarly span **Level 2–3**, orchestrating OCR engines, layout parsers, and extraction models. Agents that dynamically re-plan based on document complexity demonstrate planning capabilities.\n",
    "\n",
    "- **Scientific Research agents** reach **Level 4** (Learning), capable of independent cross-domain synthesis, hypothesis generation, and continuous improvement from new publications.\n",
    "\n",
    "### Key Takeaways\n",
    "\n",
    "1. **Provenance is non-negotiable** — every generated answer must trace back to its source with citations, metadata, and confidence metrics\n",
    "2. **Chunking determines RAG quality** — the size-overlap trade-off is the most consequential configuration decision\n",
    "3. **Confidence scoring enables trust** — route high-confidence results automatically; flag low-confidence for human review\n",
    "4. **Schema-driven extraction** ensures structured, auditable outputs from unstructured documents\n",
    "5. **Multi-phase synthesis** (scan → cluster → synthesize) scales to large research corpora\n",
    "\n",
    "> **Future Directions** (p. 160): Key trends include self-improving agents that learn from human corrections, long-context multimodal transformers reasoning across entire documents, and autonomous enterprise workflows where Document Intelligence agents act as first-class participants in business processes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "589aff64",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ── 5.1 Summary — All Agents Complete ─────────────────────────────\n",
    "\n",
    "print(\"=\" * 65)\n",
    "print(\"  CHAPTER 6 — NOTEBOOK EXECUTION COMPLETE\")\n",
    "print(\"=\" * 65)\n",
    "print()\n",
    "\n",
    "summary = {\n",
    "    \"§6.1 Knowledge Retrieval Agent\": \"RAG pipeline with FAISS + provenance\",\n",
    "    \"§6.1 Chunking Deep Dive\": \"Fixed, recursive, semantic comparison\",\n",
    "    \"§6.2 Document Intelligence Agent\": \"OCR + schema extraction pipeline\",\n",
    "    \"§6.3 Scientific Research Agent\": \"Literature clustering + synthesis\",\n",
    "    \"§Summary Knowledge Spectrum\": \"Capability comparison (Table 6.1)\",\n",
    "}\n",
    "\n",
    "for section, description in summary.items():\n",
    "    log.success(f\"{section}: {description}\")\n",
    "\n",
    "print()\n",
    "mode_label = \"SIMULATION\" if SIMULATION_MODE else \"LIVE\"\n",
    "print(f\"  Execution mode: {mode_label}\")\n",
    "print(f\"  All outputs are pedagogically equivalent in both modes.\")\n",
    "print()\n",
    "print(\"  Author: Imran Ahmad\")\n",
    "print(\"  Book: 30 Agents Every AI Engineer Must Build (Packt, 2026)\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}