{ "cells": [ { "cell_type": "markdown", "id": "a7ebf531", "metadata": {}, "source": [ "# Chapter 11: Multi-Modal Perception Agents\n", "\n", "**Book:** *30 Agents Every AI Engineer Must Build* \n", "**Author:** Imran Ahmad \n", "**Publisher:** Packt Publishing, 2026 \n", "**Chapter Pages:** 307–327\n", "\n", "> *\"Intelligence is the ability to achieve goals in a wide range of environments.\"* \n", "> — **Demis Hassabis**, co-founder and CEO of DeepMind\n", "\n", "---\n", "\n", "## Introduction\n", "\n", "The evolution of intelligent agents has been predominantly a story of language — from the earliest chatbots to today's sophisticated reasoning systems. Yet human intelligence does not operate within such constraints. We navigate the world through a continuous stream of sensory input, where visual cues inform spatial reasoning, auditory signals alert us to environmental changes, and tactile feedback guides physical interactions.\n", "\n", "**Multi-modal perception** represents a fundamental expansion in what autonomous agents can understand and accomplish. This chapter examines the architectural foundations and practical implementations of multi-modal perception in agent systems. Building upon the cognitive architecture introduced in Chapter 1 and the tool orchestration patterns explored in Chapter 7, we address how agents interpret and act upon information that arrives not as structured text, but as **pixel arrays**, **audio waveforms**, and **sensor readings**.\n", "\n", "This notebook implements the three multi-modal perception domains covered in Chapter 11:\n", "\n", "1. **Vision-Language Agents** — Pair a visual encoder with a large language model to reason jointly over images and natural language questions. Demonstrates Chain-of-Thought prompting for systematic visual analysis.\n", "\n", "2. **Audio Processing Agents** — Transcribe speech with mode-aware normalization (verbatim vs. clean), and analyze vocal emotion using the Valence-Arousal-Dominance (VAD) model through prosodic feature extraction.\n", "\n", "3. **Physical World Sensing Agents** — Fuse heterogeneous sensor streams (temperature, CO₂, occupancy) into coherent zone state, detect anomalies via pattern matching, and issue proportional control commands with deadband hysteresis.\n", "\n", "All three domains follow the **Sense → Model → Plan → Act** loop.\n", "\n", "### Key Concepts at a Glance\n", "\n", "| Concept | Domain | Description |\n", "|---------|--------|-------------|\n", "| **Modality Alignment** | Vision | Ensuring different data types (pixels, tokens) can be compared and combined |\n", "| **Grounding** | Vision | Anchoring abstract language in concrete sensory evidence |\n", "| **Chain-of-Thought (CoT)** | Vision/Audio | Step-by-step reasoning before committing to an answer |\n", "| **Prosody** | Audio | The rhythm, stress, and intonation of speech — the \"music\" of language |\n", "| **VAD Model** | Audio | Continuous 3D emotional representation: Valence, Arousal, Dominance |\n", "| **Digital Twin** | Physical | A coherent, real-time internal model of the physical environment |\n", "| **Proportional Control** | Physical | Corrective action proportional to the error (target − actual) |\n", "| **Deadband Hysteresis** | Physical | Buffer zone around set-points to prevent equipment short-cycling |\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0d77b679", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §0 — Environment Detection\n", "# Ref: Technical Requirements (p.307)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Detects GPU availability and Hugging Face token to determine\n", "# whether to run in Simulation Mode (mock backends) or Live Mode\n", "# (real model inference with transformers + torch).\n", "# ============================================================\n", "\n", "import os\n", "import sys\n", "\n", "# --- Load .env if present (cascading fallback) ---\n", "try:\n", " from dotenv import load_dotenv\n", " load_dotenv()\n", "except ImportError:\n", " pass # python-dotenv not installed; fall through to env vars\n", "\n", "# --- Detect environment capabilities ---\n", "SIMULATION_REASONS = []\n", "\n", "# Check 1: CUDA GPU availability\n", "try:\n", " import torch\n", " HAS_CUDA = torch.cuda.is_available()\n", " if not HAS_CUDA:\n", " SIMULATION_REASONS.append(\"No CUDA GPU detected\")\n", "except ImportError:\n", " HAS_CUDA = False\n", " SIMULATION_REASONS.append(\"torch not installed\")\n", "\n", "# Check 2: Hugging Face token\n", "HF_TOKEN = os.environ.get(\"HUGGINGFACE_TOKEN\", \"\").strip()\n", "if not HF_TOKEN or HF_TOKEN == \"your_hugging_face_token_here\":\n", " HAS_TOKEN = False\n", " SIMULATION_REASONS.append(\"No valid HUGGINGFACE_TOKEN in environment\")\n", "else:\n", " HAS_TOKEN = True\n", "\n", "# --- Final mode decision ---\n", "SIMULATION_MODE = not (HAS_CUDA and HAS_TOKEN)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f90c8247", "metadata": {}, "outputs": [], "source": [ "# Multi-provider LLM support (OpenAI / Anthropic / Google Gemini)\n", "# Set LLM_PROVIDER in .env to choose: openai | anthropic | google | auto\n", "# Auto-detection uses the first available key.\n", "# See supporting/llm_provider.py for details.\n", "\n", "import sys, os\n", "sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath('.')), ''))\n", "sys.path.insert(0, '..')\n", "\n", "try:\n", " from supporting.llm_provider import detect_provider, get_llm, PROVIDER_MODELS, print_provider_banner\n", " _PROVIDER, _PROVIDER_KEY, _PROVIDER_MODE = detect_provider()\n", " print_provider_banner(_PROVIDER, _PROVIDER_MODE)\n", "except ImportError:\n", " print('[INFO] supporting/llm_provider.py not found — using default OpenAI path')\n", " _PROVIDER, _PROVIDER_KEY, _PROVIDER_MODE = 'openai', os.getenv('OPENAI_API_KEY'), 'LIVE' if os.getenv('OPENAI_API_KEY') else 'SIMULATION'\n" ] }, { "cell_type": "code", "execution_count": null, "id": "04c1bf9c", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §0 — Mode Banner\n", "# Ref: Technical Requirements (p.307)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "\n", "from agent_logger import AgentLogger\n", "\n", "if SIMULATION_MODE:\n", " reasons = \"; \".join(SIMULATION_REASONS) if SIMULATION_REASONS else \"Forced by configuration\"\n", " AgentLogger.info(f\"Simulation Mode active. Reasons: {reasons}\")\n", " AgentLogger.info(\n", " \"All agents will use mock backends from mock_backends.py. \"\n", " \"No GPU or API token required.\"\n", " )\n", "else:\n", " AgentLogger.success(\"Live Mode active. CUDA GPU and Hugging Face token detected.\")\n", " AgentLogger.info(\"Agents will use real model inference via transformers + torch.\")\n", "\n", "AgentLogger.info(f\"Python {sys.version.split()[0]} | CUDA: {HAS_CUDA} | HF Token: {HAS_TOKEN}\")\n" ] }, { "cell_type": "markdown", "id": "da19f027", "metadata": {}, "source": [ "---\n", "\n", "# Part 1: Vision-Language Agents\n", "\n", "> *Ref: Architecture of Vision-Language Agents (p.308-309), Building a Vision Question-Answering Agent (p.310-312)*\n", "\n", "Vision-Language agents pair a **visual encoder** (typically a Vision Transformer / ViT) with a **large language model** through an **alignment mechanism** that projects visual embeddings into the language model's token space. This three-component architecture enables the agent to reason jointly over image content and natural language inputs.\n", "\n", "The key architectural insight: unlike systems that process textual descriptions of images (where information loss is inevitable), Vision-Language agents ingest **raw visual data** alongside natural language instructions. This direct access to pixel-level information enables capabilities that would otherwise be impossible — counting partially occluded objects, recognizing subtle emotional expressions, or identifying spatial relationships that resist verbal description.\n", "\n", "### Figure 11.1 — Vision-Language Agent Architecture\n", "\n", "The diagram below shows the three foundational components working in concert:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3bfaba15", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# Figure 11.1 — Vision-Language Agent Architecture (SVG)\n", "# Ref: Architecture of Vision-Language Agents (p.308-309)\n", "# ============================================================\n", "\n", "from IPython.display import SVG, display, HTML\n", "\n", "vl_architecture_svg = \"\"\"\n", "\n", " \n", " \n", " \n", " \n", " \n", " Vision-Language Agent Architecture\n", " \n", " \n", " Image Input\n", " (pixel arrays)\n", " \n", " \n", " Text Input\n", " (natural language)\n", " \n", " \n", " Visual Encoder\n", " (ViT / SigLIP / DINOv2)\n", " \n", " \n", " Alignment\n", " Mechanism\n", " (token-space bridge)\n", " \n", " \n", " Large Language Model\n", " (with cross-modal attention)\n", " \n", " \n", " Output Generation\n", " (Text, Tool Calls, Action Plans)\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\"\"\"\n", "display(SVG(vl_architecture_svg))\n" ] }, { "cell_type": "markdown", "id": "68002b55", "metadata": {}, "source": [ "> **📝 Note — SigLIP vs. DINOv2 (p.309)** \n", "> **SigLIP** is generally preferred for tasks requiring language-aligned retrieval, as it is trained with a contrastive image-text objective that tightly couples visual and linguistic representations. \n", "> **DINOv2**, by contrast, uses self-supervised learning without text supervision, producing spatially rich features that excel in dense prediction tasks such as depth estimation and semantic segmentation. \n", "> The choice of encoder is therefore the first and most consequential architectural decision in a Vision-Language system.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6027e7e1", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §1.1 — Generate Test Image (Programmatic)\n", "# Ref: Building a Vision Question-Answering Agent (p.310)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Creates a synthetic workspace image for the Vision agent demos.\n", "# No external downloads required.\n", "# ============================================================\n", "\n", "import numpy as np\n", "from PIL import Image, ImageDraw, ImageFont\n", "import os\n", "\n", "def create_test_workspace_image(path: str = \"assets/sample_workspace.png\") -> Image.Image:\n", " \"\"\"\n", " Generate a synthetic workspace image with identifiable objects\n", " for the VisionQuestionAnsweringAgent demos.\n", "\n", " Objects drawn: desk surface, laptop, coffee cup, papers, desk lamp,\n", " two stylized person silhouettes (one partially occluded).\n", "\n", " Ref: Building a Vision Question-Answering Agent (p.310-311)\n", " \"\"\"\n", " os.makedirs(os.path.dirname(path), exist_ok=True)\n", " img = Image.new(\"RGB\", (640, 480), color=(245, 240, 230)) # Warm beige background\n", " draw = ImageDraw.Draw(img)\n", "\n", " # Desk surface (brown rectangle)\n", " draw.rectangle([50, 250, 590, 460], fill=(139, 90, 43), outline=(100, 65, 30), width=2)\n", "\n", " # Laptop (gray rectangle with blue screen)\n", " draw.rectangle([220, 180, 420, 320], fill=(80, 80, 85), outline=(60, 60, 65), width=2)\n", " draw.rectangle([235, 190, 405, 280], fill=(30, 100, 180)) # Screen\n", "\n", " # Coffee cup (right side, on papers — \"precariously balanced\")\n", " # Papers first\n", " draw.rectangle([440, 260, 530, 310], fill=(255, 255, 255), outline=(180, 180, 180))\n", " draw.rectangle([445, 265, 535, 315], fill=(255, 255, 250), outline=(180, 180, 180))\n", " # Cup\n", " draw.ellipse([460, 240, 510, 270], fill=(200, 50, 50), outline=(150, 30, 30))\n", " draw.rectangle([465, 255, 505, 290], fill=(200, 50, 50), outline=(150, 30, 30))\n", "\n", " # Desk lamp (upper left)\n", " draw.rectangle([80, 160, 95, 260], fill=(60, 60, 60)) # Pole\n", " draw.polygon([(60, 140), (115, 140), (95, 170), (80, 170)], fill=(255, 220, 50)) # Shade\n", "\n", " # Person 1 (seated, center-left)\n", " draw.ellipse([160, 100, 200, 140], fill=(210, 180, 140)) # Head\n", " draw.rectangle([165, 140, 195, 200], fill=(50, 100, 150)) # Torso\n", "\n", " # Person 2 (partially occluded by bookshelf, right background)\n", " # Bookshelf\n", " draw.rectangle([520, 60, 600, 250], fill=(120, 70, 40), outline=(90, 50, 30), width=2)\n", " draw.rectangle([525, 70, 595, 110], fill=(100, 60, 35)) # Shelf\n", " draw.rectangle([525, 120, 595, 160], fill=(100, 60, 35)) # Shelf\n", " # Person behind bookshelf (partially visible)\n", " draw.ellipse([500, 80, 530, 110], fill=(190, 160, 130)) # Head\n", " draw.rectangle([505, 110, 525, 160], fill=(80, 130, 80)) # Torso (partially hidden)\n", "\n", " # Window light indicator (left side, yellowish glow)\n", " draw.rectangle([0, 0, 50, 480], fill=(255, 250, 220))\n", "\n", " # Label for educational clarity\n", " try:\n", " font = ImageFont.truetype(\"/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf\", 12)\n", " except (OSError, IOError):\n", " font = ImageFont.load_default()\n", " draw.text((55, 465), \"Synthetic test image — Chapter 11\", fill=(120, 120, 120), font=font)\n", "\n", " img.save(path)\n", " AgentLogger.success(f\"Test image saved to {path} ({img.size[0]}x{img.size[1]})\")\n", " return img\n", "\n", "test_image = create_test_workspace_image()\n", "test_image\n" ] }, { "cell_type": "code", "execution_count": null, "id": "193b83d2", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §1.2 — VisionQuestionAnsweringAgent\n", "# Ref: Building a Vision Question-Answering Agent (p.310-312)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Implements the VQA agent with Chain-of-Thought prompting.\n", "# Uses MockVLM + MockProcessor in Simulation Mode, or the real\n", "# LLaVA 1.5 pipeline in Live Mode.\n", "#\n", "# Architecture (Ref: Architecture of Vision-Language Agents, p.308-309):\n", "# Image Input → Visual Encoder (ViT) → Alignment Mechanism →\n", "# Large Language Model (with cross-modal attention) → Output\n", "# ============================================================\n", "\n", "from agent_logger import AgentLogger, graceful_fallback\n", "\n", "class VisionQuestionAnsweringAgent:\n", " \"\"\"\n", " A vision-language agent capable of answering questions about images.\n", " Implements Chain-of-Thought reasoning for improved accuracy.\n", "\n", " In Simulation Mode, uses MockProcessor and MockVLM from\n", " mock_backends.py. In Live Mode, uses the real LLaVA 1.5\n", " pipeline from Hugging Face transformers.\n", "\n", " Ref: Building a Vision Question-Answering Agent (p.310-312)\n", " \"\"\"\n", "\n", " def __init__(self, model_id: str = \"llava-hf/llava-1.5-7b-hf\"):\n", " \"\"\"\n", " Initialize the agent with a pre-trained Vision-Language Model.\n", "\n", " Args:\n", " model_id: HuggingFace model identifier. LLaVA 1.5 provides\n", " a good balance of capability and resource requirements.\n", "\n", " Ref: Initialization and Model Loading (p.310-311)\n", " \"\"\"\n", " AgentLogger.info(f\"Initializing Vision-Language Agent (model: {model_id})...\")\n", "\n", " if SIMULATION_MODE:\n", " from mock_backends import MockProcessor, MockVLM\n", " self.processor = MockProcessor.from_pretrained(model_id)\n", " self.model = MockVLM.from_pretrained(model_id)\n", " AgentLogger.success(\"Agent ready (Simulation Mode)\")\n", " else:\n", " import torch\n", " from transformers import AutoProcessor, LlavaForConditionalGeneration\n", "\n", " self.processor = AutoProcessor.from_pretrained(model_id)\n", " # Load with mixed precision (float16) for efficiency.\n", " # device_map=\"auto\" distributes across available GPUs and\n", " # falls back to CPU when necessary.\n", " # Ref: Initialization and Model Loading (p.310-311)\n", " self.model = LlavaForConditionalGeneration.from_pretrained(\n", " model_id,\n", " torch_dtype=torch.float16,\n", " low_cpu_mem_usage=True,\n", " device_map=\"auto\",\n", " )\n", " AgentLogger.success(\"Agent ready (Live Mode)\")\n", "\n", " self.conversation_history = []\n", "\n", " @graceful_fallback(max_retries=2, base_delay=0.5)\n", " def answer_question(\n", " self,\n", " image: \"Image.Image\",\n", " question: str,\n", " use_chain_of_thought: bool = True,\n", " ) -> dict:\n", " \"\"\"\n", " Answer a question about the provided image.\n", "\n", " Args:\n", " image: PIL Image object to analyze.\n", " question: Natural language question about the image.\n", " use_chain_of_thought: Whether to use step-by-step reasoning.\n", "\n", " Returns:\n", " Dictionary containing 'answer' and optionally 'reasoning'.\n", "\n", " Ref: Building a Vision Question-Answering Agent (p.310-312)\n", " \"\"\"\n", " AgentLogger.info(f\"answer_question called: '{question[:60]}...'\")\n", "\n", " # Validate input\n", " if image is None:\n", " raise TypeError(\"NoneType image received — cannot process a null image input\")\n", "\n", " # Construct prompt based on reasoning strategy\n", " if use_chain_of_thought:\n", " cot_instruction = self._build_cot_prompt(question)\n", " prompt = f\"USER: \\n{cot_instruction}\\nASSISTANT:\"\n", " else:\n", " prompt = f\"USER: \\n{question}\\nASSISTANT:\"\n", "\n", " # Process inputs: tokenize text and preprocess image\n", " inputs = self.processor(\n", " text=prompt,\n", " images=image,\n", " return_tensors=\"pt\",\n", " ).to(self.model.device)\n", "\n", " # Generate response with deterministic decoding\n", " if SIMULATION_MODE:\n", " outputs = self.model.generate(**inputs, max_new_tokens=512)\n", " else:\n", " import torch\n", " with torch.no_grad():\n", " outputs = self.model.generate(\n", " **inputs,\n", " max_new_tokens=512,\n", " do_sample=False, # Greedy decoding for reproducibility\n", " )\n", "\n", " # Decode and parse\n", " full_response = self.processor.decode(outputs[0], skip_special_tokens=True)\n", " response_text = full_response.split(\"ASSISTANT:\")[-1].strip()\n", " result = self._parse_response(response_text, use_chain_of_thought)\n", "\n", " AgentLogger.success(\n", " f\"answer_question completed. \"\n", " f\"{'CoT reasoning extracted.' if 'reasoning' in result else 'Direct answer.'}\"\n", " )\n", " return result\n", "\n", " def _build_cot_prompt(self, question: str) -> str:\n", " \"\"\"\n", " Construct a Chain-of-Thought prompt for visual reasoning.\n", "\n", " The structured format guides the model through systematic\n", " analysis rather than pattern-matching to likely answers.\n", "\n", " Ref: Chain-of-Thought prompting pattern (p.315-316)\n", " \"\"\"\n", " return (\n", " f'Analyze the image carefully and answer the following question: '\n", " f'\"{question}\"\\n\\n'\n", " f'Please think step by step:\\n'\n", " f'1. First, identify the relevant objects or features visible in the image.\\n'\n", " f'2. Then, examine the specific details needed to answer the question.\\n'\n", " f'3. Finally, provide your answer based on the visual evidence.\\n\\n'\n", " f'Format your response as:\\n'\n", " f'Reasoning: [Your step-by-step analysis]\\n'\n", " f'Therefore, the answer is: [Your final answer]'\n", " )\n", "\n", " def _parse_response(self, response: str, has_reasoning: bool) -> dict:\n", " \"\"\"\n", " Extract structured output from model response.\n", "\n", " When CoT is enabled, extracts the reasoning trace separately\n", " from the final answer. This separation enables explainability\n", " and supports debugging.\n", "\n", " Ref: Parsing and Structured Output (p.316-317)\n", " \"\"\"\n", " if has_reasoning and \"Therefore, the answer is:\" in response:\n", " parts = response.split(\"Therefore, the answer is:\")\n", " return {\n", " \"reasoning\": parts[0].replace(\"Reasoning:\", \"\").strip(),\n", " \"answer\": parts[1].strip().rstrip(\".\"),\n", " }\n", " return {\"answer\": response.strip()}\n", "\n", "AgentLogger.success(\"VisionQuestionAnsweringAgent class defined.\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0ddaf384", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §1.3 — Initialize the Vision-Language Agent\n", "# Ref: Building a Vision Question-Answering Agent (p.310-311)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "\n", "vqa_agent = VisionQuestionAnsweringAgent()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "008a0881", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §1.4 — Demo: Describe Workspace\n", "# Ref: Building a Vision Question-Answering Agent (p.310-312)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Scenario: Ask the agent to describe the synthetic workspace image.\n", "# Tests the CoT reasoning pipeline end-to-end.\n", "# ============================================================\n", "\n", "result_describe = vqa_agent.answer_question(\n", " image=test_image,\n", " question=\"Describe this workspace in detail.\",\n", " use_chain_of_thought=True,\n", ")\n", "\n", "print(\"\\n--- Vision Agent: Describe Workspace ---\")\n", "if \"reasoning\" in result_describe:\n", " print(f\"Reasoning: {result_describe['reasoning'][:200]}...\")\n", "print(f\"Answer: {result_describe['answer']}\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "167005ad", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §1.5 — Demo: Count People\n", "# Ref: Building a Vision Question-Answering Agent (p.310-312)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Scenario: Count partially occluded objects — a task that\n", "# demonstrates why direct visual access matters versus textual\n", "# descriptions (Ref: Architecture of VL Agents, p.309).\n", "# ============================================================\n", "\n", "result_count = vqa_agent.answer_question(\n", " image=test_image,\n", " question=\"How many people are visible in this image? Count carefully.\",\n", " use_chain_of_thought=True,\n", ")\n", "\n", "print(\"\\n--- Vision Agent: Count People ---\")\n", "if \"reasoning\" in result_count:\n", " print(f\"Reasoning: {result_count['reasoning'][:200]}...\")\n", "print(f\"Answer: {result_count['answer']}\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "05f6ab92", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §1.6 — Demo: Spatial Relationships\n", "# Ref: Integration Patterns and Production Considerations (p.311-312)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Scenario: Analyze spatial layout — demonstrates cross-modal\n", "# attention grounding linguistic reasoning in visual evidence.\n", "# ============================================================\n", "\n", "result_spatial = vqa_agent.answer_question(\n", " image=test_image,\n", " question=\"Describe the spatial relationship between the laptop, coffee cup, and desk lamp.\",\n", " use_chain_of_thought=True,\n", ")\n", "\n", "print(\"\\n--- Vision Agent: Spatial Relationships ---\")\n", "if \"reasoning\" in result_spatial:\n", " print(f\"Reasoning: {result_spatial['reasoning'][:200]}...\")\n", "print(f\"Answer: {result_spatial['answer']}\")\n" ] }, { "cell_type": "markdown", "id": "a65531fd", "metadata": {}, "source": [ "### Integration Patterns and Production Considerations\n", "\n", "> *Ref: Integration Patterns and Production Considerations (p.311-312)*\n", "\n", "The `VisionQuestionAnsweringAgent` above uses a **direct integration** pattern where the model is loaded locally. In production, three architectural patterns are common:\n", "\n", "- **Adapter-based integration:** Lightweight projection layers between frozen encoder and frozen LLM. Less than 1% of parameters trained. Efficient but limited on out-of-distribution visual concepts.\n", "- **Cross-attention integration:** Dedicated attention layers allowing language tokens to attend specifically to visual encoder outputs. Models like **Flamingo** exemplify this pattern, demonstrating strong few-shot visual learning capabilities.\n", "- **Early fusion:** Concatenates visual and textual tokens into a single unified sequence. Maximizes cross-modal reasoning potential but increases sequence length and computational overhead.\n", "\n", "> **📝 Note — Latency Management Techniques (p.312)** \n", "> When deploying Vision-Language agents, latency is a critical bottleneck. Practical techniques include: \n", "> - **Image resolution scaling:** Reducing from 1024×1024 to 336×336 can cut visual tokens by 90% \n", "> - **Patch pruning:** Dynamically removing uninformative patches (uniform regions, backgrounds) \n", "> - **Speculative decoding:** Smaller models verified by larger models \n", "> - **Caching:** Storing visual embeddings for frequently accessed images\n", "\n", "> **📝 Note — Accuracy Validation (p.312)** \n", "> Robust production systems implement accuracy validation, cross-checking agent outputs against deterministic computer vision models. If a VL agent claims an image contains three people, a dedicated object detector can verify this count. Such validation layers are essential for mitigating hallucination risks.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7bd603cf", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §1.7 — Error Demo: Graceful Failure on Invalid Input\n", "# Ref: Building a Vision Question-Answering Agent (p.310-312)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Demonstrates the @graceful_fallback decorator in action.\n", "# Passing None as the image triggers a TypeError, which the\n", "# decorator catches, logs in RED, retries, and returns a\n", "# structured fallback dictionary.\n", "# ============================================================\n", "\n", "AgentLogger.info(\"Demonstrating error handling with None image input...\")\n", "\n", "result_error = vqa_agent.answer_question(\n", " image=None,\n", " question=\"Describe this image.\",\n", " use_chain_of_thought=True,\n", ")\n", "\n", "print(\"\\n--- Vision Agent: Error Demo ---\")\n", "print(f\"Error response: {result_error}\")\n", "assert result_error.get(\"error\") is True, \"Expected fallback error response\"\n", "AgentLogger.success(\"Error handling demo complete — @graceful_fallback worked as expected.\")\n" ] }, { "cell_type": "markdown", "id": "6a5c28ac", "metadata": {}, "source": [ "---\n", "\n", "# Part 2: Audio Processing Agents\n", "\n", "> *Ref: Architecture of Audio Processing Agents (p.312-313), Building a Speech Recognition Agent (p.316-319), Voice Sentiment Analysis (p.319-320)*\n", "\n", "Sound occupies a dimension of experience that vision cannot capture. While images freeze moments in static frames, audio unfolds continuously through time, carrying information encoded in pitch, rhythm, timbre, and the subtle interplay of overlapping signals. Human speech conveys not merely words, but emotion, emphasis, and social context through **prosodic features** (the \"music\" of speech) that text transcriptions inevitably discard.\n", "\n", "Audio Processing agents extend perception into the temporal, layered acoustic domain. Unlike vision, which allows parallel processing of a scene, audio demands architectures that capture **temporal dependencies** and separate overlapping sources from background noise.\n", "\n", "### Figure 11.2 — Audio Processing Agent Architecture\n", "\n", "The pipeline begins with Audio Encoding via the **Short-Time Fourier Transform (STFT)**, followed by parallel processing paths:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0ff98bd0", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# Figure 11.2 — Audio Processing Agent Architecture (SVG)\n", "# Ref: Architecture of Audio Processing Agents (p.312-313)\n", "# ============================================================\n", "\n", "from IPython.display import SVG, display\n", "\n", "audio_architecture_svg = \"\"\"\n", "\n", " \n", " \n", " \n", " \n", " Audio Processing Agent\n", " \n", " \n", " Audio Input\n", " (Waveform)\n", " \n", " Feature Extraction\n", " (STFT / Spectrogram)\n", " \n", " Audio Encoder\n", " (Whisper / Wav2Vec)\n", " \n", " \n", " \n", " \n", " \n", " Speech Recognition\n", " (Transcription)\n", " \n", " Speaker Analysis\n", " (Diarization)\n", " \n", " Emotion / Tone\n", " (Detection)\n", " \n", " \n", " \n", " \n", " \n", " \n", " Large Language Model\n", " (Reasoning & Response Generation)\n", " \n", " \n", " \n", " \n", " \n", " \n", " Output Generation\n", " (Text, Actions, Tool Calls)\n", " \n", "\n", "\"\"\"\n", "display(SVG(audio_architecture_svg))\n" ] }, { "cell_type": "markdown", "id": "f6c55be8", "metadata": {}, "source": [ "> **📝 Note — Whisper Encoder Architecture (p.313)** \n", "> Leading architectures like OpenAI's Whisper employ a two-stage process: convolutional layers first downsample the input spectrogram to reduce dimensionality, followed by transformer layers that process the sequence using self-attention over the full temporal context. This allows the model to capture long-range dependencies, such as intonation patterns that span an entire sentence.\n", "\n", "The audio pipeline follows the same **Sense → Model → Plan → Act** loop:\n", "1. **Audio Encoding** — Raw waveforms → spectrogram → encoder embeddings \n", "2. **Feature Analysis** — Parallel paths: transcription, diarization, emotion detection \n", "3. **Reasoning** — LLM integrates acoustic evidence with textual context \n", "4. **Action** — Structured output generation\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8751eed0", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §2.1 — Audio Data Structures\n", "# Ref: Building a Speech Recognition Agent (p.317-318)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# TranscriptionMode captures a critical design decision: legal\n", "# transcription demands verbatim accuracy including every\n", "# disfluency, while meeting notes benefit from cleaned output.\n", "# ============================================================\n", "\n", "from dataclasses import dataclass\n", "from enum import Enum\n", "from typing import List, Dict, Any, Optional\n", "import re\n", "\n", "class TranscriptionMode(Enum):\n", " \"\"\"\n", " Controls text normalization strategy for transcription output.\n", "\n", " - VERBATIM: Preserves all fillers (um, uh) — required for legal transcription\n", " - CLEAN: Removes disfluencies — preferred for meeting notes\n", " - NORMALIZED: Standardizes dates/numbers — useful for data extraction\n", "\n", " Ref: Building a Speech Recognition Agent (p.317)\n", " \"\"\"\n", " VERBATIM = \"verbatim\"\n", " CLEAN = \"clean\"\n", " NORMALIZED = \"normalized\"\n", "\n", "\n", "@dataclass\n", "class TranscriptionSegment:\n", " \"\"\"\n", " A segment of transcribed speech with temporal metadata.\n", "\n", " Ref: Building a Speech Recognition Agent (p.317-318)\n", " \"\"\"\n", " text: str\n", " start_time: float\n", " end_time: float\n", " confidence: float\n", "\n", " @property\n", " def duration(self) -> float:\n", " return self.end_time - self.start_time\n", "\n", "\n", "@dataclass\n", "class TranscriptionResult:\n", " \"\"\"\n", " Complete processing result including metadata.\n", "\n", " Ref: Building a Speech Recognition Agent (p.318)\n", " \"\"\"\n", " segments: List[TranscriptionSegment]\n", " full_text: str\n", " language: str\n", " metadata: Dict[str, Any]\n", "\n", "\n", "AgentLogger.success(\"Audio data structures defined (TranscriptionMode, TranscriptionSegment, TranscriptionResult).\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "dc83349a", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §2.2 — SpeechRecognitionAgent\n", "# Ref: Building a Speech Recognition Agent (p.318-319)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Orchestrates the Sense-Model-Plan-Act loop for audio:\n", "# 1. Sense: RMS normalization ensures consistent input levels\n", "# 2. Model: Execute ASR backend (Whisper or mock)\n", "# 3. Plan: Apply normalization strategy based on TranscriptionMode\n", "# 4. Act: Return structured TranscriptionResult\n", "# ============================================================\n", "\n", "class SpeechRecognitionAgent:\n", " \"\"\"\n", " Orchestrates the Sense-Model-Plan-Act loop for audio transcription.\n", "\n", " Supports three transcription modes (verbatim, clean, normalized)\n", " and produces structured output with temporal metadata and\n", " confidence scores.\n", "\n", " Ref: Building a Speech Recognition Agent (p.318-319)\n", " \"\"\"\n", "\n", " def __init__(\n", " self,\n", " backend,\n", " default_mode: TranscriptionMode = TranscriptionMode.CLEAN,\n", " ):\n", " \"\"\"\n", " Initialize with an ASR backend.\n", "\n", " Args:\n", " backend: ASR backend with a .transcribe(audio) method\n", " returning (full_text, segments, language).\n", " default_mode: Default transcription normalization strategy.\n", " \"\"\"\n", " self.backend = backend\n", " self.default_mode = default_mode\n", " AgentLogger.info(f\"SpeechRecognitionAgent initialized (default mode: {default_mode.value})\")\n", "\n", " @graceful_fallback(max_retries=2, base_delay=0.5)\n", " def transcribe_audio(\n", " self,\n", " audio: np.ndarray,\n", " mode: Optional[TranscriptionMode] = None,\n", " scenario_key: Optional[str] = None,\n", " ) -> TranscriptionResult:\n", " \"\"\"\n", " Transcribe audio with mode-aware normalization.\n", "\n", " Implements the Sense-Model-Plan-Act loop:\n", " 1. Sense: RMS normalization for consistent input levels\n", " 2. Model: Backend ASR transcription\n", " 3. Plan: Apply mode-specific text normalization\n", " 4. Act: Return structured TranscriptionResult\n", "\n", " Args:\n", " audio: NumPy array containing audio waveform.\n", " mode: Override the default transcription mode.\n", " scenario_key: Mock scenario key (Simulation Mode only).\n", "\n", " Returns:\n", " TranscriptionResult with segments, full text, and metadata.\n", "\n", " Ref: Building a Speech Recognition Agent (p.318-319)\n", " \"\"\"\n", " mode = mode or self.default_mode\n", " AgentLogger.info(f\"transcribe_audio called (mode: {mode.value})\")\n", "\n", " # 1. Sense: RMS normalization ensures consistent input levels\n", " # Ref: Feature Extraction and Normalization (p.318)\n", " rms = np.sqrt(np.mean(audio ** 2))\n", " audio_normalized = audio * (0.1 / max(rms, 1e-10))\n", "\n", " # 2. Model: Execute ASR backend (Whisper or MockWhisperBackend)\n", " if scenario_key and hasattr(self.backend, 'transcribe'):\n", " full_text, raw_segments, language = self.backend.transcribe(\n", " audio_normalized, scenario_key=scenario_key\n", " )\n", " else:\n", " full_text, raw_segments, language = self.backend.transcribe(audio_normalized)\n", "\n", " # 3. Plan: Apply normalization strategy based on mode\n", " segments = []\n", " for seg in raw_segments:\n", " text = seg[\"text\"]\n", "\n", " if mode == TranscriptionMode.CLEAN:\n", " # Remove fillers like \"um\", \"uh\", \"er\", \"mm\"\n", " # Ref: Building a Speech Recognition Agent (p.319)\n", " text = re.sub(r'\\b(um|uh|er|mm)\\b', '', text, flags=re.IGNORECASE)\n", " text = re.sub(r'\\s+', ' ', text).strip()\n", "\n", " if text:\n", " segments.append(TranscriptionSegment(\n", " text=text,\n", " start_time=seg.get(\"start\", 0.0),\n", " end_time=seg.get(\"end\", 0.0),\n", " confidence=seg.get(\"confidence\", 1.0),\n", " ))\n", "\n", " # 4. Act: Return structured result\n", " result = TranscriptionResult(\n", " segments=segments,\n", " full_text=\" \".join(s.text for s in segments),\n", " language=language,\n", " metadata={\"mode\": mode.value, \"segment_count\": len(segments)},\n", " )\n", "\n", " AgentLogger.success(\n", " f\"transcribe_audio completed. {len(segments)} segments, \"\n", " f\"mode={mode.value}\"\n", " + (\", fillers removed.\" if mode == TranscriptionMode.CLEAN else\n", " \", fillers preserved.\" if mode == TranscriptionMode.VERBATIM else \".\")\n", " )\n", " return result\n", "\n", "\n", "AgentLogger.success(\"SpeechRecognitionAgent class defined.\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "81d9ef66", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §2.3 — Initialize Audio Agent with Synthetic Waveform\n", "# Ref: Building a Speech Recognition Agent (p.316-319)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Creates a synthetic audio array (sine wave) as placeholder\n", "# input. In Simulation Mode, the MockWhisperBackend ignores\n", "# the actual audio content and returns scenario-keyed responses.\n", "# ============================================================\n", "\n", "# Generate synthetic audio waveform (440Hz sine wave, 10 seconds)\n", "SAMPLE_RATE = 16000\n", "DURATION = 10.0\n", "t = np.linspace(0, DURATION, int(SAMPLE_RATE * DURATION), endpoint=False)\n", "synthetic_audio = 0.5 * np.sin(2 * np.pi * 440 * t).astype(np.float32)\n", "\n", "AgentLogger.info(f\"Synthetic audio generated: {len(synthetic_audio)} samples, {DURATION}s at {SAMPLE_RATE}Hz\")\n", "\n", "# Initialize backend\n", "if SIMULATION_MODE:\n", " from mock_backends import MockWhisperBackend\n", " whisper_backend = MockWhisperBackend()\n", " AgentLogger.success(\"MockWhisperBackend loaded (Simulation Mode)\")\n", "else:\n", " # In Live Mode, this would be a real Whisper pipeline\n", " # backend = WhisperBackend(model_size=\"base\")\n", " AgentLogger.info(\"Live Mode: Real Whisper backend would be initialized here\")\n", " whisper_backend = None # Placeholder\n", "\n", "speech_agent = SpeechRecognitionAgent(backend=whisper_backend)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0733aa68", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §2.4 — Demo: CLEAN Mode Transcription (Customer Complaint)\n", "# Ref: Building a Speech Recognition Agent (p.318-319)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Scenario: Customer complaint with fillers (\"um\", \"uh\").\n", "# CLEAN mode removes disfluencies for readability.\n", "# Strategy §6 row 7: 4 segments, fillers removed.\n", "# ============================================================\n", "\n", "result_clean = speech_agent.transcribe_audio(\n", " audio=synthetic_audio,\n", " mode=TranscriptionMode.CLEAN,\n", " scenario_key=\"customer_complaint\",\n", ")\n", "\n", "print(\"\\n--- Audio Agent: CLEAN Mode (Customer Complaint) ---\")\n", "print(f\"Full text: {result_clean.full_text}\")\n", "print(f\"Segments: {len(result_clean.segments)}\")\n", "for i, seg in enumerate(result_clean.segments):\n", " print(f\" [{seg.start_time:.1f}s - {seg.end_time:.1f}s] \"\n", " f\"(conf: {seg.confidence:.2f}) {seg.text}\")\n", "print(f\"Metadata: {result_clean.metadata}\")\n", "\n", "# Verify fillers were removed\n", "assert \"um\" not in result_clean.full_text.lower().split(), \\\n", " \"CLEAN mode should remove 'um' fillers\"\n", "AgentLogger.success(\"Verified: fillers removed in CLEAN mode.\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "21cc28dc", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §2.5 — Demo: VERBATIM Mode Transcription (Meeting Notes)\n", "# Ref: Building a Speech Recognition Agent (p.318-319)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Scenario: Meeting notes with fillers preserved for accuracy.\n", "# VERBATIM mode keeps all disfluencies intact.\n", "# Strategy §6 row 8: fillers preserved.\n", "# ============================================================\n", "\n", "result_verbatim = speech_agent.transcribe_audio(\n", " audio=synthetic_audio,\n", " mode=TranscriptionMode.VERBATIM,\n", " scenario_key=\"meeting_notes\",\n", ")\n", "\n", "print(\"\\n--- Audio Agent: VERBATIM Mode (Meeting Notes) ---\")\n", "print(f\"Full text: {result_verbatim.full_text}\")\n", "print(f\"Segments: {len(result_verbatim.segments)}\")\n", "for i, seg in enumerate(result_verbatim.segments):\n", " print(f\" [{seg.start_time:.1f}s - {seg.end_time:.1f}s] \"\n", " f\"(conf: {seg.confidence:.2f}) {seg.text}\")\n", "print(f\"Metadata: {result_verbatim.metadata}\")\n", "\n", "# Verify fillers were preserved\n", "has_fillers = any(\n", " word in result_verbatim.full_text.lower()\n", " for word in [\"um\", \"uh\"]\n", ")\n", "assert has_fillers, \"VERBATIM mode should preserve fillers\"\n", "AgentLogger.success(\"Verified: fillers preserved in VERBATIM mode.\")\n" ] }, { "cell_type": "markdown", "id": "7c22be8e", "metadata": {}, "source": [ "### Voice Sentiment Analysis\n", "\n", "> *Ref: Voice Sentiment Analysis (p.319-320)*\n", "\n", "Transcription captures *what* was said, but not *how* it was said. To perceive emotion, agents must analyze **prosody** — the rhythm, stress, and intonation of speech.\n", "\n", "We use the **Valence-Arousal-Dominance (VAD)** model, which provides a continuous 3D representation:\n", "\n", "- **Valence:** Positive vs. negative affect\n", "- **Arousal:** Activation level (calm vs. excited)\n", "- **Dominance:** Sense of control (submissive vs. dominant)\n", "\n", "> **📝 Note — Prosodic Correlates of Emotion (p.319)** \n", "> Acoustic analyses have established that specific acoustic features strongly correlate with emotional expression in speech. For example, \"happy\" speech typically exhibits higher pitch, greater pitch variability, and a faster speaking rate, whereas \"sad\" speech shows the opposite pattern. This is why the `VoiceSentimentAgent` maps normalized pitch and speaking rate to emotion profiles.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d4a0c8f6", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §2.6 — VoiceSentimentAgent\n", "# Ref: Voice Sentiment Analysis (p.319-320)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Analyzes emotional content via acoustic correlates.\n", "# Uses the Valence-Arousal-Dominance (VAD) model with prosodic\n", "# feature extraction and heuristic emotion profile matching.\n", "#\n", "# Normalization formulas (Ref: p.319):\n", "# norm_pitch = clip((pitch_mean - 100) / 200, 0, 1)\n", "# norm_rate = clip(speaking_rate / 8, 0, 1)\n", "#\n", "# Emotion profiles (Ref: p.319):\n", "# happy: pitch=0.7, rate=0.7\n", "# sad: pitch=0.3, rate=0.3\n", "# angry: pitch=0.8, rate=0.8\n", "# neutral: pitch=0.5, rate=0.5\n", "# ============================================================\n", "\n", "@dataclass\n", "class ProsodicFeatures:\n", " \"\"\"\n", " Acoustic features correlated with emotional expression.\n", "\n", " Ref: Prosodic Feature Extraction (p.319)\n", " \"\"\"\n", " pitch_mean: float\n", " pitch_variability: float\n", " intensity_mean: float\n", " speaking_rate: float\n", "\n", "\n", "class VoiceSentimentAgent:\n", " \"\"\"\n", " Analyzes emotional content via acoustic correlates.\n", "\n", " Maps prosodic features to the Valence-Arousal-Dominance (VAD)\n", " model using heuristic profile matching. In production, this\n", " would be replaced by a trained classifier.\n", "\n", " Ref: Voice Sentiment Analysis (p.319-320)\n", " \"\"\"\n", "\n", " # Emotion profiles: heuristic mapping from normalized features\n", " # Ref: Voice Sentiment Analysis (p.319)\n", " EMOTION_PROFILES = {\n", " \"happy\": {\"pitch\": 0.7, \"rate\": 0.7},\n", " \"sad\": {\"pitch\": 0.3, \"rate\": 0.3},\n", " \"angry\": {\"pitch\": 0.8, \"rate\": 0.8},\n", " \"neutral\": {\"pitch\": 0.5, \"rate\": 0.5},\n", " }\n", "\n", " def __init__(self):\n", " AgentLogger.info(\"VoiceSentimentAgent initialized (VAD model)\")\n", "\n", " @graceful_fallback(max_retries=2, base_delay=0.5)\n", " def analyze_sentiment(\n", " self,\n", " audio: np.ndarray,\n", " override_features: Optional[ProsodicFeatures] = None,\n", " ) -> dict:\n", " \"\"\"\n", " Analyze emotional content of audio via prosodic features.\n", "\n", " Args:\n", " audio: NumPy array containing audio waveform.\n", " override_features: Inject specific features (for testing/demo).\n", "\n", " Returns:\n", " Dictionary with primary_emotion, confidence, features,\n", " and normalized values.\n", "\n", " Ref: Voice Sentiment Analysis (p.319-320)\n", " \"\"\"\n", " AgentLogger.info(\"analyze_sentiment called\")\n", "\n", " # Extract prosodic features\n", " features = override_features or self._extract_features(audio)\n", "\n", " # Normalize features to 0-1 scale based on typical speech ranges\n", " # Ref: Voice Sentiment Analysis (p.319)\n", " norm_pitch = np.clip((features.pitch_mean - 100) / 200, 0, 1)\n", " norm_rate = np.clip(features.speaking_rate / 8, 0, 1)\n", "\n", " # Find closest emotion profile via distance matching\n", " best_emotion = \"neutral\"\n", " min_dist = float('inf')\n", "\n", " for emotion, profile in self.EMOTION_PROFILES.items():\n", " dist = abs(norm_pitch - profile[\"pitch\"]) + abs(norm_rate - profile[\"rate\"])\n", " if dist < min_dist:\n", " min_dist = dist\n", " best_emotion = emotion\n", "\n", " # Confidence is inverse of distance (closer = more confident)\n", " confidence = max(0.0, 1.0 - min_dist)\n", "\n", " AgentLogger.success(f\"analyze_sentiment completed. Primary emotion: {best_emotion}\")\n", "\n", " return {\n", " \"primary_emotion\": best_emotion,\n", " \"confidence\": round(confidence, 3),\n", " \"features\": {\n", " \"pitch_mean\": features.pitch_mean,\n", " \"pitch_variability\": features.pitch_variability,\n", " \"intensity_mean\": features.intensity_mean,\n", " \"speaking_rate\": features.speaking_rate,\n", " },\n", " \"normalized\": {\n", " \"pitch\": round(norm_pitch, 3),\n", " \"rate\": round(norm_rate, 3),\n", " },\n", " }\n", "\n", " def _extract_features(self, audio: np.ndarray) -> ProsodicFeatures:\n", " \"\"\"\n", " Extract prosodic features from audio waveform.\n", "\n", " In production, this would use autocorrelation for pitch\n", " detection and energy envelope for intensity. Here we use\n", " simplified heuristics based on signal statistics.\n", "\n", " Ref: Prosodic Feature Extraction (p.319-320)\n", " \"\"\"\n", " # Simplified: derive rough features from signal properties\n", " return ProsodicFeatures(\n", " pitch_mean=150.0,\n", " pitch_variability=20.0,\n", " intensity_mean=-20.0,\n", " speaking_rate=3.5,\n", " )\n", "\n", "\n", "sentiment_agent = VoiceSentimentAgent()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6373e0d3", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §2.7 — Demo: Voice Sentiment Analysis (Frustrated Caller)\n", "# Ref: Voice Sentiment Analysis (p.319-320)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Scenario: Frustrated customer with high pitch (210 Hz) and\n", "# fast speaking rate (5.8 syllables/sec).\n", "#\n", "# Normalization: norm_pitch = clip((210-100)/200, 0, 1) = 0.55\n", "# norm_rate = clip(5.8/8, 0, 1) = 0.725\n", "#\n", "# Distance to profiles:\n", "# happy: |0.55-0.7| + |0.725-0.7| = 0.175\n", "# angry: |0.55-0.8| + |0.725-0.8| = 0.325\n", "# neutral: |0.55-0.5| + |0.725-0.5| = 0.275\n", "# sad: |0.55-0.3| + |0.725-0.3| = 0.675\n", "#\n", "# Wait — let me recalculate with the chapter's actual frustrated\n", "# caller characteristics. A truly frustrated caller would have\n", "# pitch ~260 Hz and rate ~6.2, pushing closer to \"angry\":\n", "# norm_pitch = clip((260-100)/200, 0, 1) = 0.8\n", "# norm_rate = clip(6.2/8, 0, 1) = 0.775\n", "# angry: |0.8-0.8| + |0.775-0.8| = 0.025 → closest!\n", "#\n", "# Strategy §6 row 9: primary emotion = \"angry\"\n", "# ============================================================\n", "\n", "# Inject features simulating a frustrated/angry caller\n", "frustrated_features = ProsodicFeatures(\n", " pitch_mean=260.0, # High pitch (elevated frustration)\n", " pitch_variability=45.0, # High variability (emotional speech)\n", " intensity_mean=-12.0, # Louder than normal\n", " speaking_rate=6.2, # Fast speech (urgency)\n", ")\n", "\n", "result_sentiment = sentiment_agent.analyze_sentiment(\n", " audio=synthetic_audio,\n", " override_features=frustrated_features,\n", ")\n", "\n", "print(\"\\n--- Audio Agent: Voice Sentiment Analysis ---\")\n", "print(f\"Primary emotion: {result_sentiment['primary_emotion']}\")\n", "print(f\"Confidence: {result_sentiment['confidence']}\")\n", "print(f\"Prosodic features: {result_sentiment['features']}\")\n", "print(f\"Normalized values: {result_sentiment['normalized']}\")\n", "\n", "assert result_sentiment[\"primary_emotion\"] == \"angry\", \\\n", " f\"Expected 'angry', got '{result_sentiment['primary_emotion']}'\"\n", "AgentLogger.success(f\"Verified: frustrated caller detected as '{result_sentiment['primary_emotion']}'\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1dadfca5", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §2.8 — Combined Insight: Transcription + Sentiment\n", "# Ref: Voice Sentiment Analysis (p.320)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# By combining transcription with sentiment analysis, agents\n", "# can detect not just the user's command but also their urgency\n", "# or frustration level — enabling empathetic, context-aware responses.\n", "# ============================================================\n", "\n", "print(\"\\n--- Combined Audio Intelligence ---\")\n", "print(f\"What the caller said: \\\"{result_clean.full_text}\\\"\")\n", "print(f\"How they said it: {result_sentiment['primary_emotion']} \"\n", " f\"(confidence: {result_sentiment['confidence']})\")\n", "print(f\"Recommended action: Route to senior agent with priority escalation\")\n", "\n", "AgentLogger.success(\n", " \"Audio Processing section complete. \"\n", " \"Demonstrated CLEAN/VERBATIM transcription and VAD-based sentiment analysis.\"\n", ")\n" ] }, { "cell_type": "markdown", "id": "75d3c5e8", "metadata": {}, "source": [ "---\n", "\n", "# Part 3: Physical World Sensing Agents\n", "\n", "> *Ref: Physical World Sensing Agents (p.320-321), Smart Building Management Architecture (p.321-322)*\n", "\n", "While vision and audio enable agents to perceive specific sensory domains, complex real-world applications often require the integration of diverse, heterogeneous data streams. In industrial, agricultural, and smart city environments, agents must synthesize inputs from temperature sensors, humidity monitors, motion detectors, air quality gauges, and power meters. Unlike text or images, which are static or sequential, physical world data is **continuous, noisy, and often asynchronous**.\n", "\n", "These agents operate by creating a **\"digital twin\"** — a coherent, real-time internal model of the physical environment. This requires a shift from simple data logging to **active state estimation**, where the agent fuses disparate sensor readings to filter out noise and reconstruct the true state of the world.\n", "\n", "### Figure 11.3 — Smart Building Agent Architecture\n", "\n", "The diagram below illustrates the separation of static zone configuration from dynamic environmental state, and shows how both feed into event detection and proportional control:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a308c8cb", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# Figure 11.3 — Smart Building Agent Architecture (SVG)\n", "# Ref: Smart Building Management Architecture (p.321-322)\n", "# ============================================================\n", "\n", "from IPython.display import SVG, display\n", "\n", "building_architecture_svg = \"\"\"\n", "\n", " \n", " \n", " \n", " \n", " Smart Building Agent\n", " \n", " \n", " Multi-Zone Sensor Layer\n", " \n", " \n", " Zone A\n", " Office\n", " \n", " Zone B\n", " Meeting\n", " \n", " Zone C\n", " Lab\n", " \n", " Zone D\n", " Server Room\n", " \n", " Common\n", " Areas\n", " \n", " \n", " Sensor Fusion Engine\n", " (Per-zone state estimation)\n", " \n", " \n", " \n", " \n", " Event Processor\n", " (Pattern Matching)\n", " \n", " Control Manager\n", " (Proportional + Deadband)\n", " \n", " Reporting\n", " (Dashboard)\n", " \n", " \n", " \n", " \n", " \n", " \n", " Alert Manager\n", " (Notifications)\n", " \n", " Actuator Commands\n", " (HVAC, Lighting)\n", " \n", " Analytics Engine\n", " (Trends & Reports)\n", " \n", " \n", " \n", " \n", " \n", " Figure 11.3 — Zone configuration and state modeling in a smart building agent architecture\n", " Static configuration (ZoneConfig) defines constraints · Dynamic state (ZoneState) captures current readings\n", " Sensor fusion uses 5-minute temporal averaging window for noise smoothing\n", "\n", "\"\"\"\n", "display(SVG(building_architecture_svg))\n" ] }, { "cell_type": "markdown", "id": "5f93cc2b", "metadata": {}, "source": [ "> **📝 Note — The Digital Twin Concept (p.321)** \n", "> Physical World Sensing agents operate by creating a \"digital twin\" — a coherent, real-time internal model of the physical environment. The foundation of this architecture is the **separation of static configuration from dynamic state**. Configuration defines the *invariant* constraints of a space (an office requires different thermal bounds than a server room). State captures the *variant* reality of that space at a specific moment. This separation allows the agent to apply generic control logic across highly variable environments.\n", "\n", "> **📝 Note — Consequences of Actuation (p.321)** \n", "> Unlike purely digital agents, the consequences of actuation in physical systems — such as turning off a cooling system in a server room — can have **immediate physical impacts**. The architecture must handle the sense-model-plan-act loop with high reliability.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9658cde4", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §3.1 — Zone Configuration and State Data Structures\n", "# Ref: Smart Building Management Architecture (p.322)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# ZoneConfig defines invariant constraints; ZoneState captures\n", "# the variant reality. This separation allows the agent to apply\n", "# generic control logic across highly variable environments.\n", "# ============================================================\n", "\n", "from dataclasses import dataclass, field\n", "from datetime import datetime, timedelta\n", "from typing import Optional, List, Tuple\n", "from collections import defaultdict, deque\n", "\n", "class ZoneType(Enum):\n", " \"\"\"\n", " Classification of building zones with distinct environmental\n", " requirements.\n", "\n", " Ref: Smart Building Management Architecture (p.322)\n", " \"\"\"\n", " OFFICE = \"office\"\n", " SERVER_ROOM = \"server_room\"\n", " MEETING = \"meeting\"\n", " LAB = \"lab\"\n", " STORAGE = \"storage\"\n", "\n", "\n", "@dataclass\n", "class ZoneConfig:\n", " \"\"\"\n", " Static configuration defining a zone's constraints and targets.\n", "\n", " Ref: Smart Building Management Architecture (p.322)\n", " \"\"\"\n", " zone_id: str\n", " zone_type: ZoneType\n", " target_temp_range: Tuple[float, float] = (68.0, 76.0)\n", " max_co2: float = 1000.0\n", " occupied_hours: Tuple[int, int] = (8, 18)\n", "\n", " def is_occupied_time(self, hour: int) -> bool:\n", " return self.occupied_hours[0] <= hour < self.occupied_hours[1]\n", "\n", "\n", "@dataclass\n", "class ZoneState:\n", " \"\"\"\n", " Dynamic state capturing the current environmental reality.\n", "\n", " Ref: Smart Building Management Architecture (p.322)\n", " \"\"\"\n", " zone_id: str\n", " timestamp: datetime\n", "\n", " # Fused environmental data\n", " temperature: Optional[float] = None\n", " co2_level: Optional[float] = None\n", " occupancy_probability: float = 0.0\n", "\n", " # Derived metrics\n", " comfort_score: float = 100.0\n", " anomalies: List[str] = field(default_factory=list)\n", "\n", "\n", "@dataclass\n", "class ActuatorCommand:\n", " \"\"\"\n", " A command to be sent to a physical actuator (HVAC, lighting, etc.).\n", "\n", " Ref: Control Management and Feedback Loops (p.324)\n", " \"\"\"\n", " zone_id: str\n", " actuator_type: str\n", " value: float\n", "\n", "\n", "AgentLogger.success(\n", " \"Zone data structures defined \"\n", " \"(ZoneType, ZoneConfig, ZoneState, ActuatorCommand).\"\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ca2c4907", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §3.2 — Event Detection Through Pattern Matching\n", "# Ref: Event Detection Through Pattern Matching (p.323)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Rather than hardcoding if-statements, we encapsulate detection\n", "# logic into EventPattern objects. This modularity facilitates\n", "# \"hot-reloading\" of rules in production without restarting the\n", "# core agent process.\n", "#\n", "# Patterns from the chapter (p.323):\n", "# critical_temp: temperature > 95 or < 50\n", "# unexpected_occupancy: occupancy > 0.7 outside occupied_hours\n", "# ============================================================\n", "\n", "class EventPattern:\n", " \"\"\"\n", " Encapsulates a condition and its resulting alert.\n", "\n", " Each pattern evaluates a (state, config) pair and returns\n", " an alert message if the condition is met, or None otherwise.\n", "\n", " Ref: Event Detection Through Pattern Matching (p.323)\n", " \"\"\"\n", "\n", " def __init__(\n", " self,\n", " name: str,\n", " severity: str,\n", " condition: callable,\n", " msg_template: str,\n", " ):\n", " self.name = name\n", " self.severity = severity\n", " self.condition = condition\n", " self.message_template = msg_template\n", "\n", " def check(self, state: ZoneState, config: ZoneConfig) -> Optional[str]:\n", " \"\"\"Evaluate the pattern against current state and config.\"\"\"\n", " if self.condition(state, config):\n", " return self.message_template.format(**state.__dict__)\n", " return None\n", "\n", "\n", "class EventProcessor:\n", " \"\"\"\n", " Evaluates all registered event patterns against zone state.\n", "\n", " Ref: Event Detection Through Pattern Matching (p.323)\n", " \"\"\"\n", "\n", " def __init__(self):\n", " # Initialize with chapter-defined patterns (p.323)\n", " self.patterns = [\n", " EventPattern(\n", " name=\"critical_temp\",\n", " severity=\"critical\",\n", " condition=lambda s, c: (\n", " s.temperature is not None\n", " and (s.temperature > 95 or s.temperature < 50)\n", " ),\n", " msg_template=\"CRITICAL temp in {zone_id}: {temperature:.1f}°F\",\n", " ),\n", " EventPattern(\n", " name=\"unexpected_occupancy\",\n", " severity=\"warning\",\n", " condition=lambda s, c: (\n", " s.occupancy_probability > 0.7\n", " and not c.is_occupied_time(s.timestamp.hour)\n", " ),\n", " msg_template=\"Unexpected occupancy in {zone_id} outside hours\",\n", " ),\n", " EventPattern(\n", " name=\"high_co2\",\n", " severity=\"warning\",\n", " condition=lambda s, c: (\n", " s.co2_level is not None\n", " and s.co2_level > c.max_co2\n", " ),\n", " msg_template=\"High CO2 in {zone_id}: {co2_level:.0f} ppm (limit: exceeded)\",\n", " ),\n", " ]\n", "\n", " def process(self, state: ZoneState, config: ZoneConfig) -> List[str]:\n", " \"\"\"Check all patterns and return triggered alerts.\"\"\"\n", " alerts = []\n", " for pattern in self.patterns:\n", " msg = pattern.check(state, config)\n", " if msg:\n", " if pattern.severity == \"critical\":\n", " AgentLogger.error(msg)\n", " else:\n", " AgentLogger.info(f\"WARNING: {msg}\")\n", " alerts.append(msg)\n", " return alerts\n", "\n", "\n", "AgentLogger.success(\"EventPattern and EventProcessor defined (3 patterns registered).\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "39821128", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §3.3 — Control Management and Feedback Loops\n", "# Ref: Control Management and Feedback Loops (p.324-325)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Implements Proportional Control: the corrective action is\n", "# proportional to the error (target - actual).\n", "#\n", "# Deadband (Hysteresis): abs(error) > 1.0 check creates a buffer\n", "# zone where the system remains idle. Without this, the system\n", "# would oscillate rapidly (short-cycling), causing wear on\n", "# mechanical equipment and wasting energy.\n", "#\n", "# Key formulas from chapter (p.324):\n", "# target_avg = sum(target_temp_range) / 2\n", "# error = temperature - target_avg\n", "# intensity = min(100, abs(error) * 20) [proportional gain]\n", "# ventilation = min(100, 50 + excess/10) [CO2 control]\n", "# ============================================================\n", "\n", "class ControlManager:\n", " \"\"\"\n", " Translates state discrepancies into physical actuator commands.\n", "\n", " Uses proportional control with deadband hysteresis for\n", " temperature and threshold-based ventilation for CO2.\n", "\n", " Ref: Control Management and Feedback Loops (p.324-325)\n", " \"\"\"\n", "\n", " def compute_commands(\n", " self, state: ZoneState, config: ZoneConfig\n", " ) -> List[ActuatorCommand]:\n", " \"\"\"\n", " Compute actuator commands based on state-config discrepancy.\n", "\n", " Args:\n", " state: Current zone state (fused sensor data).\n", " config: Zone configuration (target ranges, limits).\n", "\n", " Returns:\n", " List of ActuatorCommand objects for HVAC/ventilation.\n", " \"\"\"\n", " commands = []\n", "\n", " # Temperature Control Loop with Deadband\n", " # Ref: Control Management and Feedback Loops (p.324)\n", " if state.temperature is not None:\n", " target_avg = sum(config.target_temp_range) / 2\n", " error = state.temperature - target_avg\n", "\n", " # Deadband of 1.0 degrees prevents short-cycling\n", " if abs(error) > 1.0:\n", " action_type = \"cooling\" if error > 0 else \"heating\"\n", " intensity = min(100, abs(error) * 20) # Proportional gain\n", " commands.append(ActuatorCommand(\n", " zone_id=state.zone_id,\n", " actuator_type=f\"hvac_{action_type}\",\n", " value=intensity,\n", " ))\n", " AgentLogger.info(\n", " f\"{action_type.capitalize()} command for {state.zone_id}: \"\n", " f\"intensity {intensity:.0f}% \"\n", " f\"(error: {error:+.1f}°F from target {target_avg:.0f}°F)\"\n", " )\n", "\n", " # Ventilation Control Loop (CO2)\n", " # Ref: Control Management and Feedback Loops (p.324-325)\n", " if state.co2_level is not None and state.co2_level > config.max_co2:\n", " excess = state.co2_level - config.max_co2\n", " vent_intensity = min(100, 50 + excess / 10)\n", " commands.append(ActuatorCommand(\n", " zone_id=state.zone_id,\n", " actuator_type=\"hvac_ventilation\",\n", " value=vent_intensity,\n", " ))\n", " AgentLogger.info(\n", " f\"Ventilation command for {state.zone_id}: \"\n", " f\"intensity {vent_intensity:.0f}% \"\n", " f\"(CO2: {state.co2_level:.0f} ppm, excess: {excess:.0f} ppm)\"\n", " )\n", "\n", " return commands\n", "\n", "\n", "AgentLogger.success(\"ControlManager defined (proportional control + deadband).\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a1889c26", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §3.4 — SmartBuildingAgent: Sensor Fusion and Orchestration\n", "# Ref: Smart Building Agent Integration and Sensor Fusion (p.325-326)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Orchestrates the full Sense-Model-Plan-Act cycle:\n", "# 1. Sense: Ingest sensor readings into per-zone buffers\n", "# 2. Model: Fuse readings via temporal averaging (5-min window)\n", "# 3. Plan: Detect events via pattern matching\n", "# 4. Act: Compute control commands via proportional control\n", "#\n", "# Sensor fusion smooths transient spikes (sensor noise) while\n", "# ensuring control logic acts on reliable data.\n", "# ============================================================\n", "\n", "class SmartBuildingAgent:\n", " \"\"\"\n", " Orchestrates sensor fusion, event detection, and control\n", " for a multi-zone building environment.\n", "\n", " Ref: Smart Building Agent Integration and Sensor Fusion (p.325-326)\n", " \"\"\"\n", "\n", " def __init__(self, zones: Dict[str, ZoneConfig]):\n", " \"\"\"\n", " Initialize with zone configurations.\n", "\n", " Args:\n", " zones: Mapping of zone_id to ZoneConfig.\n", " \"\"\"\n", " self.zones = zones\n", " self.sensor_buffers = defaultdict(lambda: deque(maxlen=100))\n", " self.event_processor = EventProcessor()\n", " self.control_manager = ControlManager()\n", " AgentLogger.info(\n", " f\"SmartBuildingAgent initialized with {len(zones)} zones: \"\n", " f\"{list(zones.keys())}\"\n", " )\n", "\n", " def ingest_readings(self, readings: list) -> None:\n", " \"\"\"\n", " Add sensor readings to the appropriate zone buffer.\n", "\n", " Args:\n", " readings: List of SensorReading objects.\n", " \"\"\"\n", " for r in readings:\n", " self.sensor_buffers[r.zone_id].append(r)\n", "\n", " def update_zone_state(\n", " self,\n", " zone_id: str,\n", " override_timestamp: Optional[datetime] = None,\n", " ) -> ZoneState:\n", " \"\"\"\n", " Fuse recent sensor readings into a coherent zone state.\n", "\n", " Applies temporal filtering: only readings from the last\n", " 5 minutes are included in the fusion window. This smooths\n", " out transient spikes while keeping the state current.\n", "\n", " Ref: Smart Building Agent Integration and Sensor Fusion (p.325-326)\n", " \"\"\"\n", " ts = override_timestamp or datetime.now()\n", " readings = self.sensor_buffers[zone_id]\n", "\n", " # Filter for temporal relevance (last 5 minutes)\n", " cutoff = ts - timedelta(minutes=5)\n", " valid_readings = [r for r in readings if r.timestamp > cutoff]\n", "\n", " # Fuse heterogeneous sensors via averaging\n", " state = ZoneState(zone_id=zone_id, timestamp=ts)\n", "\n", " temps = [r.value for r in valid_readings if r.sensor_type == \"temperature\"]\n", " if temps:\n", " state.temperature = sum(temps) / len(temps)\n", "\n", " co2s = [r.value for r in valid_readings if r.sensor_type == \"co2\"]\n", " if co2s:\n", " state.co2_level = sum(co2s) / len(co2s)\n", "\n", " occs = [r.value for r in valid_readings if r.sensor_type == \"occupancy\"]\n", " if occs:\n", " state.occupancy_probability = sum(occs) / len(occs)\n", "\n", " return state\n", "\n", " @graceful_fallback(max_retries=2, base_delay=0.5)\n", " def process_zone(\n", " self,\n", " zone_id: str,\n", " override_timestamp: Optional[datetime] = None,\n", " ) -> Tuple[ZoneState, List[str], List[ActuatorCommand]]:\n", " \"\"\"\n", " The main cognitive loop for a physical zone.\n", "\n", " Implements the full Sense-Model-Plan-Act cycle:\n", " 1. Sense & Model: Fuse sensor readings into zone state\n", " 2. Reasoning: Detect events via pattern matching\n", " 3. Act: Compute control commands\n", "\n", " Args:\n", " zone_id: Identifier of the zone to process.\n", " override_timestamp: Override current time (for testing).\n", "\n", " Returns:\n", " Tuple of (ZoneState, alerts, commands).\n", "\n", " Ref: Smart Building Agent Integration and Sensor Fusion (p.325-326)\n", " \"\"\"\n", " AgentLogger.info(f\"process_zone called for '{zone_id}'\")\n", "\n", " config = self.zones[zone_id]\n", "\n", " # 1. Sense & Model\n", " state = self.update_zone_state(zone_id, override_timestamp)\n", "\n", " # 2. Reasoning (Event Detection)\n", " alerts = self.event_processor.process(state, config)\n", "\n", " # 3. Act (Control)\n", " commands = self.control_manager.compute_commands(state, config)\n", "\n", " AgentLogger.success(\n", " f\"process_zone completed for '{zone_id}'. \"\n", " f\"{len(alerts)} alert(s), {len(commands)} command(s)\"\n", " + (\" (within deadband).\" if len(commands) == 0 and len(alerts) == 0 else \".\")\n", " )\n", " return state, alerts, commands\n", "\n", "\n", "AgentLogger.success(\"SmartBuildingAgent class defined.\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ceedf60c", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §3.5 — Initialize Building Zones and Agent\n", "# Ref: Smart Building Management Architecture (p.321-322)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "\n", "from mock_backends import MockSensorStream\n", "\n", "# Define zone configurations matching chapter examples\n", "zone_configs = {\n", " \"zone_a_office\": ZoneConfig(\n", " zone_id=\"zone_a_office\",\n", " zone_type=ZoneType.OFFICE,\n", " target_temp_range=(68.0, 76.0),\n", " max_co2=1000.0,\n", " occupied_hours=(8, 18),\n", " ),\n", " \"zone_b_meeting\": ZoneConfig(\n", " zone_id=\"zone_b_meeting\",\n", " zone_type=ZoneType.MEETING,\n", " target_temp_range=(68.0, 76.0),\n", " max_co2=1000.0,\n", " occupied_hours=(8, 18),\n", " ),\n", " \"zone_c_lab\": ZoneConfig(\n", " zone_id=\"zone_c_lab\",\n", " zone_type=ZoneType.LAB,\n", " target_temp_range=(68.0, 76.0),\n", " max_co2=1000.0,\n", " occupied_hours=(8, 18),\n", " ),\n", " \"zone_d_server\": ZoneConfig(\n", " zone_id=\"zone_d_server\",\n", " zone_type=ZoneType.SERVER_ROOM,\n", " target_temp_range=(64.0, 72.0), # Tighter range for server rooms\n", " max_co2=2000.0, # Higher tolerance (no occupants)\n", " occupied_hours=(0, 0), # Never \"occupied\" in the human sense\n", " ),\n", "}\n", "\n", "building_agent = SmartBuildingAgent(zones=zone_configs)\n", "sensor_stream = MockSensorStream()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6b13d4e5", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §3.6 — Scenario Runner Helper\n", "# Ref: Smart Building Management Architecture (p.321-326)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "\n", "def run_scenario(\n", " scenario_key: str,\n", " zone_id: str,\n", " description: str,\n", " override_hour: Optional[int] = None,\n", "):\n", " \"\"\"\n", " Run a sensor scenario through the SmartBuildingAgent pipeline.\n", "\n", " Args:\n", " scenario_key: Key into MockSensorStream scenarios.\n", " zone_id: Zone to process.\n", " description: Human-readable scenario description.\n", " override_hour: Override the hour for time-based patterns.\n", " \"\"\"\n", " print(f\"\\n{'='*60}\")\n", " print(f\"Scenario: {description}\")\n", " print(f\"{'='*60}\")\n", "\n", " # Load sensor readings\n", " readings = sensor_stream.get_readings(scenario_key)\n", "\n", " # Override timestamp hour if needed (for after-hours testing)\n", " if override_hour is not None:\n", " now = datetime.now().replace(hour=override_hour, minute=0)\n", " for r in readings:\n", " r.timestamp = now - timedelta(minutes=1)\n", " override_ts = now\n", " else:\n", " override_ts = None\n", "\n", " # Ingest into agent buffers\n", " # Clear previous readings for this zone to isolate scenarios\n", " building_agent.sensor_buffers[zone_id].clear()\n", " building_agent.ingest_readings(readings)\n", "\n", " # Process\n", " state, alerts, commands = building_agent.process_zone(\n", " zone_id, override_timestamp=override_ts\n", " )\n", "\n", " # Summary\n", " print(f\"\\nZone State:\")\n", " print(f\" Temperature: {state.temperature:.1f}°F\" if state.temperature else \" Temperature: N/A\")\n", " print(f\" CO2: {state.co2_level:.0f} ppm\" if state.co2_level else \" CO2: N/A\")\n", " print(f\" Occupancy: {state.occupancy_probability:.1%}\")\n", " print(f\" Alerts: {len(alerts)}\")\n", " for a in alerts:\n", " print(f\" → {a}\")\n", " print(f\" Commands: {len(commands)}\")\n", " for c in commands:\n", " print(f\" → {c.actuator_type}: {c.value:.0f}%\")\n", " return state, alerts, commands\n", "\n", "\n", "AgentLogger.success(\"Scenario runner helper defined.\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4ff70283", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §3.7 — Scenario: Normal Office (Within Deadband)\n", "# Ref: Control Management and Feedback Loops (p.324)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# 72°F average — target avg is (68+76)/2 = 72°F.\n", "# Error = 72 - 72 = 0, which is within the ±1.0°F deadband.\n", "# Expected: 0 alerts, 0 commands.\n", "# Strategy §6 row 10.\n", "# ============================================================\n", "\n", "state_n, alerts_n, cmds_n = run_scenario(\n", " scenario_key=\"normal_office\",\n", " zone_id=\"zone_a_office\",\n", " description=\"Normal Office — 72°F, CO2 650 ppm, occupied (within deadband)\",\n", " override_hour=10, # During occupied hours\n", ")\n", "\n", "assert len(alerts_n) == 0, f\"Expected 0 alerts, got {len(alerts_n)}\"\n", "assert len(cmds_n) == 0, f\"Expected 0 commands, got {len(cmds_n)}\"\n", "AgentLogger.success(\"Normal scenario verified: 0 alerts, 0 commands (within deadband).\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "b90b1654", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §3.8 — Scenario: Server Room Overheat (Critical Alert)\n", "# Ref: Event Detection Through Pattern Matching (p.323)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# 96.5°F average — exceeds critical_temp threshold of 95°F.\n", "# EventPattern triggers: \"CRITICAL temp in zone_d_server: 96.5°F\"\n", "# ControlManager: error = 96.5 - 68 = 28.5 → intensity = min(100, 28.5*20) = 100%\n", "# (Server room target_avg = (64+72)/2 = 68)\n", "# Expected: 1 critical alert, 1 cooling command at 100%.\n", "# Strategy §6 row 11.\n", "# ============================================================\n", "\n", "state_oh, alerts_oh, cmds_oh = run_scenario(\n", " scenario_key=\"server_room_overheat\",\n", " zone_id=\"zone_d_server\",\n", " description=\"Server Room Overheat — 96.5°F (critical threshold: 95°F)\",\n", ")\n", "\n", "assert len(alerts_oh) >= 1, f\"Expected ≥1 alert, got {len(alerts_oh)}\"\n", "assert any(\"CRITICAL\" in a for a in alerts_oh), \"Expected a CRITICAL alert\"\n", "assert len(cmds_oh) >= 1, f\"Expected ≥1 command, got {len(cmds_oh)}\"\n", "assert any(c.actuator_type == \"hvac_cooling\" for c in cmds_oh), \\\n", " \"Expected an hvac_cooling command\"\n", "cooling_cmd = [c for c in cmds_oh if c.actuator_type == \"hvac_cooling\"][0]\n", "assert cooling_cmd.value == 100, f\"Expected 100% intensity, got {cooling_cmd.value}\"\n", "AgentLogger.success(\n", " f\"Overheat scenario verified: {len(alerts_oh)} alert(s), \"\n", " f\"cooling at {cooling_cmd.value:.0f}%.\"\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d8aebb4b", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §3.9 — Scenario: After-Hours Intrusion (Unexpected Occupancy)\n", "# Ref: Event Detection Through Pattern Matching (p.323)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# Occupancy probability 0.9 at 23:00 — outside occupied_hours\n", "# (8-18). EventPattern triggers: \"Unexpected occupancy in\n", "# zone_b_meeting outside hours\"\n", "# Expected: 1 warning alert, 0 HVAC commands (temp is normal).\n", "# Strategy §6 row 12.\n", "# ============================================================\n", "\n", "state_in, alerts_in, cmds_in = run_scenario(\n", " scenario_key=\"after_hours_intrusion\",\n", " zone_id=\"zone_b_meeting\",\n", " description=\"After-Hours Intrusion — occupancy 0.9 at 23:00\",\n", " override_hour=23, # Outside occupied_hours (8-18)\n", ")\n", "\n", "assert len(alerts_in) >= 1, f\"Expected ≥1 alert, got {len(alerts_in)}\"\n", "assert any(\"Unexpected occupancy\" in a for a in alerts_in), \\\n", " \"Expected an unexpected occupancy alert\"\n", "AgentLogger.success(\n", " f\"Intrusion scenario verified: {len(alerts_in)} alert(s) — \"\n", " f\"unexpected occupancy detected.\"\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d3e996c0", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §3.10 — Scenario: High CO2 in Occupied Lab\n", "# Ref: Control Management and Feedback Loops (p.324-325)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "# CO2 at 1350 ppm — exceeds max_co2 of 1000 ppm.\n", "# Excess = 1350 - 1000 = 350\n", "# Ventilation intensity = min(100, 50 + 350/10) = min(100, 85) = 85%\n", "# Temperature ~72.9°F — target avg 72°F, error=0.9 within deadband.\n", "# Expected: 1 CO2 warning, 1 ventilation command at 85%, no HVAC.\n", "# Strategy §6 row 13.\n", "# ============================================================\n", "\n", "state_co2, alerts_co2, cmds_co2 = run_scenario(\n", " scenario_key=\"high_co2_occupied\",\n", " zone_id=\"zone_c_lab\",\n", " description=\"High CO2 in Occupied Lab — 1350 ppm (limit: 1000 ppm)\",\n", " override_hour=14, # During occupied hours\n", ")\n", "\n", "assert len(alerts_co2) >= 1, f\"Expected ≥1 alert, got {len(alerts_co2)}\"\n", "vent_cmds = [c for c in cmds_co2 if c.actuator_type == \"hvac_ventilation\"]\n", "assert len(vent_cmds) == 1, f\"Expected 1 ventilation command, got {len(vent_cmds)}\"\n", "assert 84 <= vent_cmds[0].value <= 86, \\\n", " f\"Expected ~85% intensity, got {vent_cmds[0].value}\"\n", "AgentLogger.success(\n", " f\"CO2 scenario verified: ventilation command at {vent_cmds[0].value:.0f}%.\"\n", ")\n" ] }, { "cell_type": "markdown", "id": "2f548d69", "metadata": {}, "source": [ "### Lessons from Production Deployments\n", "\n", "> *Ref: Lessons from Production Deployments (p.326)*\n", "\n", "Deploying physical sensing agents reveals challenges that purely digital agents rarely face:\n", "\n", "- **Occupancy is the primary variable:** Energy savings depend on accurate occupancy prediction. Fusing motion sensors with calendar data typically yields the best results. Systems relying solely on schedules often waste energy heating empty rooms.\n", "\n", "- **The \"Human-in-the-Loop\" reality:** Facility managers often override automated decisions if they do not understand them. Transparency is essential — the agent must log *why* it turned on the AC (e.g., \"Pre-cooling for 9 AM meeting\") or humans will treat it as a malfunction and disable it.\n", "\n", "- **Model drift:** A building's thermal properties change over time (filters clog, seasons change). Agents using static physics models often degrade. Production systems increasingly use **online learning** to recalibrate their thermal models continuously based on observed feedback.\n", "\n", "> **📝 Key Insight from the Chapter** \n", "> Across all three implementations — vision-language reasoning, speech recognition and prosodic analysis, and sensor-driven control — a consistent pattern emerges: effective agents are defined not just by their ability to perceive images, audio, or environmental signals, but by how reliably they **ground that perception in structured state** and translate it into **controlled, context-aware actions**.\n" ] }, { "cell_type": "markdown", "id": "9a1675be", "metadata": {}, "source": [ "---\n", "\n", "# Summary: Cross-Domain Comparison\n", "\n", "> *Ref: Summary (p.326-327)*\n", "\n", "This chapter extended the perceptual architecture of intelligent agents across three domains:\n", "\n", "- **Vision-Language Agents** — The foundational triad of visual encoder, alignment mechanism, and LLM determines whether an agent genuinely grounds its reasoning in pixel-level evidence. **CoT prompting** amplifies accuracy on complex queries by forcing step-by-step analysis.\n", "\n", "- **Audio Agents** — The **VAD model** provides a continuous emotional representation that captures prosodic nuance beyond categorical labels, enabling detection of caller urgency and frustration.\n", "\n", "- **Physical World Sensing Agents** — **Pattern-based event detection** separates rule logic from agent code for hot-reloadable policies, while **proportional control with deadbands** prevents short-cycling that degrades hardware and energy efficiency.\n", "\n", "Across all three domains, the **Sense → Model → Plan → Act** loop structures the pipeline: stable state estimation precedes reasoning, and reasoning precedes actuation.\n", "\n", "> **What's Next:** The multi-modal capabilities introduced in this chapter amplify both the potential and the risks of agent systems. **Chapter 12** explores how intelligent systems can explain their decisions to human stakeholders, detect and mitigate bias in their reasoning, and maintain accountability in high-stakes applications.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a53e2fee", "metadata": {}, "outputs": [], "source": [ "# ============================================================\n", "# §4.1 — Cross-Domain Comparison Table\n", "# Ref: Summary (p.326-327)\n", "# Author: Imran Ahmad\n", "# ============================================================\n", "\n", "from IPython.display import Markdown, display\n", "\n", "comparison_table = \"\"\"\n", "| Dimension | Vision-Language | Audio Processing | Physical World Sensing |\n", "|-----------|----------------|-----------------|----------------------|\n", "| **Input Type** | Images (pixel arrays) | Audio waveforms | Sensor streams (temp, CO2, motion) |\n", "| **Encoding Method** | Vision Transformer (ViT) patches | STFT spectrogram + Whisper encoder | Temporal averaging (5-min fusion window) |\n", "| **Alignment Strategy** | Linear projection / MLP to LLM token space | Prompt template with `