{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a8b61957",
   "metadata": {
    "papermill": {
     "duration": 0.004021,
     "end_time": "2026-05-28T02:14:53.257233+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:53.253212+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# 21 · Self-Consistency — sample N reasoning paths, majority-vote the answer\n",
    "\n",
    "> **TL;DR.** Sample `N` independent chain-of-thought reasoning paths at non-zero temperature, extract each path's final answer, and let **Python** majority-vote. The LLM never sees the votes. Simple, deterministic, and a textbook deterministic-picker application.\n",
    ">\n",
    "> **Reach for it when** the task has a single discrete answer (a number, a name, a category) and the LLM produces correct reasoning *some* of the time but slips on a fraction of paths.\n",
    "> **Avoid when** the answer is free text (no clean way to tally), or when one wrong sample is catastrophic (safety-critical decisions).\n",
    "\n",
    "| Property | Value |\n",
    "|---|---|\n",
    "| Origin | Wang et al., *Self-Consistency Improves Chain of Thought Reasoning* (Google 2022). [arXiv:2203.11171](https://arxiv.org/abs/2203.11171) |\n",
    "| Picker | `collections.Counter(answers).most_common(1)` — Python, deterministic |\n",
    "| Sampling | `N` paths at `sample_temperature` (default 0.8 — high to maximise path diversity) |\n",
    "| LLM-as-Scorer? | **None** — Python counts the votes; LLMs only produce paths |\n",
    "| Default LLM | Llama-3.3-70B (we *want* variance across samples; Qwen-Thinking would mostly produce identical paths) |\n",
    "| Cost | N structured-output calls per task |\n",
    "\n",
    "**Why this is the canonical deterministic-picker pattern.** Each sample emits a structured `(reasoning, answer)` pair. The reasoning is free text (any number of paths). The answer field is normalised (lowercased, stripped) and counted by Python. The flat-scoring pathology that infects single-LLM-as-Judge architectures (Mental Loop nb 10 §11, Ensemble nb 13 §11) has no surface here — the deciding signal is a Python `Counter`, not an LLM number."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16d4ae0e",
   "metadata": {
    "papermill": {
     "duration": 0.005773,
     "end_time": "2026-05-28T02:14:53.269034+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:53.263261+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## 2 · Architecture at a glance\n",
    "\n",
    "```mermaid\n",
    "flowchart LR\n",
    "    A([task]) --> S[SAMPLE × N<br/><sub>N independent CoT calls<br/>at high temperature</sub>]\n",
    "    S --> V[VOTE<br/><sub>Counter on normalised<br/>answer strings</sub>]\n",
    "    V --> Z([modal answer])\n",
    "\n",
    "    style S fill:#e3f2fd,stroke:#1976d2\n",
    "    style V fill:#e8f5e9,stroke:#388e3c\n",
    "```\n",
    "\n",
    "Two nodes. SAMPLE produces a list of `(reasoning, answer)` records. VOTE is pure Python — normalise each answer, tally, return the modal one."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "630c3b8b",
   "metadata": {
    "papermill": {
     "duration": 0.004019,
     "end_time": "2026-05-28T02:14:53.280224+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:53.276205+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## 3 · Theory\n",
    "\n",
    "### 3.0 · Why temperature ≠ 0?\n",
    "\n",
    "Chain-of-thought at greedy decoding (temperature 0) is deterministic — every run produces the same path. Self-Consistency needs *path diversity* to be useful. We bind `sample_temperature` (default 0.8) per call so each sample explores a different reasoning trajectory. Across N samples, even if any individual path has a (say) 20% chance of being wrong, the *modal* answer across diverse paths is much more likely to be right.\n",
    "\n",
    "This is essentially a small-N ensemble where the ensemble members are *the same model at different random seeds*.\n",
    "\n",
    "### 3.1 · Why Python tallies the votes\n",
    "\n",
    "If we asked the LLM to \"pick the best of these 5 answers\", we'd be back to LLM-as-Scorer flatness. The whole point of Self-Consistency is to make the picker deterministic and explainable. `collections.Counter` is the simplest deterministic picker imaginable — count occurrences, take the max. If there's a tie, `most_common(1)` breaks it by insertion order (first-seen wins), which is what we want.\n",
    "\n",
    "### 3.2 · Why a non-reasoning LLM (Llama) here, not Qwen-Thinking\n",
    "\n",
    "Counter-intuitive call: every Phase-3 reasoning notebook (19, 20, 22) defaults to Qwen-Thinking. Self-Consistency is the **exception**. A reasoning model with `<think>` tokens is highly deterministic within its private deliberation — across N samples it tends to produce the SAME answer because it's converging on the same chain inside the thinking phase. There's nothing for majority-vote to rescue.\n",
    "\n",
    "A less-deterministic model (Llama-3.3-70B at temperature 0.9) actually shows the pattern Self-Consistency exists to handle: *most* paths land on the correct answer, but *some* slip on classic traps. That's where the modal vote earns its keep.\n",
    "\n",
    "### 3.3 · Where this sits\n",
    "\n",
    "| Pattern | Hallucination strategy |\n",
    "|---|---|\n",
    "| Plain CoT (single sample) | Hope the one sample is right |\n",
    "| [CoVe (nb 20)](./20_chain_of_verification.ipynb) | Decompose into atomic verification questions |\n",
    "| **Self-Consistency (this nb)** | **Sample N paths, majority-vote** |\n",
    "| [Ensemble (nb 13)](./13_ensemble.ipynb) | N *different* specialists, weighted-vote / aggregator |\n",
    "| [LATS (nb 22)](./22_lats.ipynb) | MCTS search with explicit value estimates |\n",
    "\n",
    "Ensemble (nb 13) uses N *distinct* roles; Self-Consistency uses N *identical* draws of the same model. Both use majority/weighted vote — both apply the deterministic-picker pattern.\n",
    "\n",
    "### 3.4 · Failure modes preview\n",
    "\n",
    "1. **All samples agree, all are wrong.** If the model has a systematic bias on a task, N=∞ won't help. Mitigation: pair with CoVe or RAG.\n",
    "2. **High disagreement (high entropy in `tally`).** If the modal answer has <50% of the votes, treat the result as low-confidence. The architecture exposes `agreement_fraction` in metadata.\n",
    "3. **Sample failures.** A structured-output failure on one sample drops that vote silently. The architecture continues with the remaining samples; metadata reports `n_samples` actually counted."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16097c83",
   "metadata": {
    "papermill": {
     "duration": 0.004018,
     "end_time": "2026-05-28T02:14:53.290271+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:53.286253+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## 4 · Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7bc28569",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-28T02:14:53.302062Z",
     "iopub.status.busy": "2026-05-28T02:14:53.302062Z",
     "iopub.status.idle": "2026-05-28T02:14:56.011815Z",
     "shell.execute_reply": "2026-05-28T02:14:56.011815Z"
    },
    "papermill": {
     "duration": 2.717529,
     "end_time": "2026-05-28T02:14:56.011815+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:53.294286+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">LLM: meta-llama/Llama-</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">3.3</span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">-70B-Instruct</span> <span style=\"color: #00ff00; text-decoration-color: #00ff00\">────────────────────────────────────────────────────────────────────────────</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[1;36mLLM: meta-llama/Llama-\u001b[0m\u001b[1;36m3.3\u001b[0m\u001b[1;36m-70B-Instruct\u001b[0m \u001b[92m────────────────────────────────────────────────────────────────────────────\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from agentic_architectures import get_llm, enable_langsmith, settings\n",
    "from agentic_architectures.architectures import SelfConsistency\n",
    "from agentic_architectures.ui import print_md, print_header, print_step\n",
    "\n",
    "enable_langsmith()\n",
    "\n",
    "# Llama at high temperature for path variance — see § 3.2 for why not Qwen-Thinking.\n",
    "llm = get_llm(provider=\"nebius\", model=\"meta-llama/Llama-3.3-70B-Instruct\", temperature=0.4)\n",
    "print_header(f\"LLM: {llm.model}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e5ab778",
   "metadata": {
    "papermill": {
     "duration": 0.007959,
     "end_time": "2026-05-28T02:14:56.019774+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:56.011815+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## 5 · Library walkthrough\n",
    "\n",
    "Source: [`src/agentic_architectures/architectures/self_consistency.py`](../src/agentic_architectures/architectures/self_consistency.py).\n",
    "\n",
    "One Pydantic schema (`_ReasoningSample`) constrains every sample to a `(reasoning, answer)` pair. The `_sample_all` node binds `temperature=self.sample_temperature` per call and loops N times. The `_vote` node normalises (lowercase, strip, trailing-period-removed) and runs `Counter.most_common(1)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b272f2c2",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-28T02:14:56.027490Z",
     "iopub.status.busy": "2026-05-28T02:14:56.027490Z",
     "iopub.status.idle": "2026-05-28T02:14:56.043682Z",
     "shell.execute_reply": "2026-05-28T02:14:56.043682Z"
    },
    "papermill": {
     "duration": 0.023908,
     "end_time": "2026-05-28T02:14:56.043682+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:56.019774+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- _ReasoningSample schema ---\n",
      "{\n",
      "  \"description\": \"One sampled chain-of-thought.\",\n",
      "  \"properties\": {\n",
      "    \"reasoning\": {\n",
      "      \"description\": \"The step-by-step reasoning. Be explicit about each arithmetic or logical step. Don't skip steps.\",\n",
      "      \"title\": \"Reasoning\",\n",
      "      \"type\": \"string\"\n",
      "    },\n",
      "    \"answer\": {\n",
      "      \"description\": \"JUST the final answer in the requested format \\u2014 no units, no explanation, no punctuation ...\n"
     ]
    }
   ],
   "source": [
    "from agentic_architectures.architectures.self_consistency import _ReasoningSample\n",
    "import json\n",
    "print('--- _ReasoningSample schema ---')\n",
    "print(json.dumps(_ReasoningSample.model_json_schema(), indent=2)[:400] + '...')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d209ea1",
   "metadata": {
    "papermill": {
     "duration": 0.00104,
     "end_time": "2026-05-28T02:14:56.053186+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:56.052146+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## 6 · State"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02d79d6f",
   "metadata": {
    "papermill": {
     "duration": 0.0,
     "end_time": "2026-05-28T02:14:56.061345+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:56.061345+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "| Field | Set by |\n",
    "|---|---|\n",
    "| `task` / `n_samples` | caller |\n",
    "| `samples` (`Annotated[..., operator.add]`) | `_sample_all` — list of `{sample_index, reasoning, answer}` |\n",
    "| `tally` (dict from normalised answer → count) | `_vote` |\n",
    "| `final_answer` (raw form of modal) | `_vote` |\n",
    "| `history` (per-stage event log) | both nodes (`Annotated[..., operator.add]`) |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28880de3",
   "metadata": {
    "papermill": {
     "duration": 0.017595,
     "end_time": "2026-05-28T02:14:56.078940+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:56.061345+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## 7 · Build the graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "dcd426d8",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-28T02:14:56.092071Z",
     "iopub.status.busy": "2026-05-28T02:14:56.092071Z",
     "iopub.status.idle": "2026-05-28T02:14:56.474467Z",
     "shell.execute_reply": "2026-05-28T02:14:56.474467Z"
    },
    "papermill": {
     "duration": 0.392985,
     "end_time": "2026-05-28T02:14:56.478789+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:56.085804+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAHoAAAFNCAIAAABAM+wSAAAQAElEQVR4nOydB3wUVR7H3+xsTds00ntCCAQIPYAoLQRQzxPk6F0OFFA60kEQUIp6gKicKF06CHqCKCBIkSYgEISQkJACpNftM/efnbBskt3AbnbGPJivfOLOm5m3s79583//V+b9xTRNIwG+ECMBHhHk5hVBbl4R5OYVQW5eEeTmFa7kTr9VcvtSWf4DrVZF0xQyGGiRiKCox04nSYoMBoogEOOIEoSIQOxeMSnSGyj4YDre9IEUEwZ9RQ5sIpxFixCiCfpRzkxGNKIrMkW02TdWnPI4N5FBT5lfs1hCwD+FqyggQt4qwRtxAOFYvzvpXNHFowWFD/TwWSwlpDIklpHML9Qh44+HLwR1jIeSBDLQFZsEYnQ3/naRWEQZVag43qggZaDhAJEYUfpH1y1iJKYhkWA2acqUbrx/NJOrSPQ4ncmZZPJBcHsqvoig9JV+OyllbrlOY9CoaIMeyRREQKT8lZGByHE4TO7bV0qO7Xio09Be/pKmLyobxbsjnFGptCd356XfUmnKqYAIWa9xwcgROEbuLR/dLbyvj2ji9PLIAPRskX6r9JdtOepyQ88RfmENXVDtcIDca6cmu7qTQ+aEo2eX80fyzh8uiIpzSRzih2pBbeX+fFpydGunrn2ftUJtkS/eS+7Szye6hRuyl1rJDVo37+LeticnlXjd5MuZySENnHoOt7N4iZC9rJt5p2G823OlNTBmaVTaDdXFI7nILuyUe/vKNLkz2amPD3r+eP2dgN8PFyK7sEfu7JSy3Ezd0Dlh6LnEL1jhFSjdtDgV2Y49cv+w/n5gpBw9x/SbFFKSb8hKKUM2YrPc2ellajXda1wQer7x9Jf8vPUhshGb5T6xK8/Ng0TPPS+94V1SYEA2YrPc+fe1kXG1bVzZyowZM7777jtkI3fu3Hn11VcRNwSGO0tkxKmDthVw2+TWarXQd/PCP+ohfrlx4wayHfvOenqc3Mj0myqbTrGtmXPxWN65HwveXhaFuOHUqVObNm26fv26t7d3XFzcO++8Ax9atWrF7nVxcTl+/HhpaemWLVvOnDkDhRf2duzY8e2335bLmaq7a9euo0aNOnr06B9//DFkyJDNmzezJ06aNGnQoEHI0RzakJV5R/3mooinP8W20p2boZVI7W8Z1czNmzcnTJjQunXr3bt3T58+/datWwsWLEDGewB/586dC1rDh+3bt2/YsAHU/PTTT+H4I0eOrFu3js1BIpHs27evQYMGn3322bhx44YOHern53fhwgUutAY8/KV6rW1tctuGF1RlFCkhEDdcvnwZCunIkSNFIhHI1KhRo+Tk5OqHDR48GEpxeHhFj9iVK1dOnz797rvvImZAgVAqlVOnTkW84OYpMVBcyg2d9gTNldzNmjVTq9UTJ06Mj49/6aWXgoODTWbEHCjCYEnmz58PxV+vZ4YbPD09TXvhJiG+gNElZGOHk22WAQY49JUHnBxITEzMqlWr6tWrt3r16l69eo0dOxZKbvXDYC9YDzhg//79YChGjBhhvlcqlSK+KM7TIRvLnm1ye/hJ9Fqu5Abat28PNvrgwYNgtYuKiqCks+XXBFTse/bs6devH8gNBgdSSkpK0N9EbpaWtLEFYpvc0a1c9TrEERcvXgQrDB+ggIO/PGXKFJAyOzvb/BidTqdSqXx8KrrGwDE9ceIE+pvIzVA7udimt21ye3rLYez14i95iAPAdIBDsnfv3oKCgmvXroEHArr7+/vLZDLQ9+zZs2A6oBYNCws7cOBARkZGYWHhwoULweIXFxeXlVnovggJCcnNzQV/Ji0tDXFAYa4hIMq2viObvTo3TzLpHCfPL7gcYCJWrFjRrVu30aNHOzs7g40Wi5nKHNyV8+fPQ3mHor1kyRJwYPr06fP666+3adNm/PjxsJmQkJCVlVUlww4dOsDNAEfl8OHDyNGoVQaoJ7v297fpLJtHc/66UHxk68Pxn3DV0sGFPaszoD/j34ttaOMgO0p3g1ZuEinx0+b76PkmO1Ud39PT1rPsmUXV9hWPk/vzE4dY3gvVV2JiorVd4DVDY6T6roiIiK+//hpxwwYjFndBxwD0CljcBV4/WDaLu/Z/fo8Uo6YdbJ5LY+fQ8NcLUl1cyb5TQizuteacaTQaqPcs7oJ7AL8ccQN8L9xpi7sg3ZqrTpKkk5OTxV1rJiUPnxfi4mGzj2//SPzaqcldB/g0aGn/LABM+Wr2ncD6CvsG4+3vbxq1IOTnbTYPZ+DOhkUpTm5iuyc+1GqeiarUsH5eap8JAX6hTug5YP3clNBGTgkD7J9IVdtZVGXF2m/mp4fFKl4d5ciZonWN8iLt1o/uOXuIB04LRbXAMVMy1826A9l0eM07tp0SPXPsWXXvfpomprVL1/61miCIHDjh+Mi27OQ/ykRiIrKJU8JA29padZO/LhRc/KW44KHOxZ0cNtcxE04dPJ3+8MbstJvlWjUtIpGrh1juTDg5kxIZqafMfW3aOHve+PXM5Hei4h0GqLiZOfIVm6ZExk2nCeMEeUSKCLZHn2Bm4D96vYFApl5+SGaPNCWSImSgjJkw0+8Jdja9qOKtCeMXsX+Y6fe0Tk2Vl+jLivRqFZON0lucMNDXN0SBHISD5WbRaHSn9ufnZmqKC7QGHSMvVUXuR/3EVZU1vm8AOtKVUhm52ZceTK96UDRFknBziEcHPPo9po+PTheRIop9K4V69JqE8Q7T7CsOZudAy4WUiqQy5O4tjWjqHNvW8W8EcCI3DwwbNmzatGmNGzdGWIHrm2cw7MB2FuKFIDevCHLziiA3rwhy84ogN68IcvMKrnLrdDoYGEK4IZRuXhHk5hUs5YaOB4qCPhP83lnBUm5MizbCVG5M60kklG6eEeTmFUFuXhFsN68IpZtXBLl5RZCbVwS5eUWoKnkFS7kNBoNQunnFy8sLYQiWcotEoocPsZxajucjKRZXeZsYFwS5eUWQm1cEuXlFkJtXBLl5RZCbVwS5eUWQm1cEuXlFkJtXBLl5haslLzmFIAjopYJuWIQbWMqNsC3ggty8guvwAqZyY/bWcFxcHEmS7FvcAGvBBw4cOH36dIQDmBmTmJgYkJitKlndQ0JCQG6ECZjJ3b9/f2dnZ/OUdu3aBQVhE3cDM7l79eoVGvp4vRxfX9++ffsifMDPMwHTYVpvrmnTppGRkQgf8JO7R48e7NL0Xl5egwcPRljxZM8k/VbZ7UslGrVZ0uPlXyo2K0LOWkkxhv99vGlacadirwhRbJDbx0v3ULTZF0CFaKBp9HhBGJSX++DPa0lKd2XzZs3NvpJAZmvOIPYaiUoXVn0tH1Rp8Rmz09lLMc+t8ko3VRBLkJun+IkB954g9/p5yZpyJJGJdJrHi+BQzFpFhOkKYEPE/IxKS+aY/05mQSLaTG4SmTe/q8jNbNKVflPF4j2P4gsbTyGMYaErL25KsLfZPMxzpe81ZkXTNLsO0uM40+YKgtNDsd/F/K6quTErCVWOX2xCIqPhR1EGFNvWteMbvsgKNTVzvpyR7B0oThwahgSejqzU4mPfPnT1lLTobHk1Xqul+7+zk4Pqyzv0et5jm9nBtqXJzTsr23S3EMzJclV55vuH8FwIWttHcIzi6skii7ssy51+Wy13xbU75W8nJt5Tq7a8y7KmunIKcRhA5BnH01dBWemKtyy3AZwPiqtwRM8+BmQtfJFgMXhFkJtXBLkdD2HVlghycwDTHLWyy7rcWC4NWzewLp1lvxs6LrAdNK7TWC7dTJ+RULo5QLDdjofpqbRivAW5HQ/T6Sc0c3ijhua4ILfjqaHWs+x/EAQeNeWC99+bOm0s4oA9e7cnJMazn1/vnbBp81c2nEwg2ibbTdNI6KCyH5odS7SAYEx4xWFyp6ff/WbDF5evXITRuNjYpv37Dm3SpBmkp6beOXBw96U/zt+/nxUWGvHyy6//87U+7CnwkA4fNiYjI33P3m/d3T3atX1x/LipSz6ce+rUr8HBoYMHjkxMfAUOmz13skQsCQ0N375jE0VREeFR06bOi4qKrnIB+fl5az//+Nr1K2q1unXrdkMHj4JMnnjZe/ftOHv2ZFLSNalMFte0xZtvjgsM4HAMyzFtR61WO3HyaJIkP/pw9crln4tJ8ew5k+Bnw67P1q48f/7MhHff+3DpKtD6P6s+Ovv7KfYsiUSyfcfGkJCwwz+eHvXmuB8PHZg0eXTXLj2OHD7buVO35SsXlZQycS8htz8uX4APh/53auOGPZ5e3nPmTa4ylx42J00ZAzd70sRZX3+1w8Pdc+y4YZlZGTVf9p9/Xl69ZnlsbNzChStmvPd+QUH+4iVzUO2xboitNeJhyB89PffupcG1vtF7QHT9mMjI+vPnffj++8vZCcFz5y5dvnxti+atmzdrBeW6QXTDc+dPm06sHxXz2j/ekEqlnTp2g014LEBosVjcuVMinJ6elsoeptVqhgweRRBEgH/giOFvPXhwH5QyvwDYhMdr1sxF8W3ae3p6vf3WRDel+54922q+7EaNmnyzfueggSPg2lq3atv3X4OhmBcVF6HaYXOPIE3bVlkGBYWANfhw2YJuCS83i2vZuHEc/ABTXnv3bv/93Cm4JWyCv//j2H9QtNkP7ETLsLCKGWgKBTMtraSkmN0MD48yLc8TFMiEgE1LT23WrKUpnz+vXYZnBW4quwk3Bi7jytVLqEbgcczKyoDnL+nmNVPo88KCfKVbrQLl1aCdVc/EpmnfMpnsP5/894f/7d+9Z9v6r9cGBAQNHzq6W7eXwdTOmDVBp9P+e9T4Zs1aubq4vjPhTfMTq0zNEYksP21y2eNw7HI587msrFKA4NLSEp1O17lrK/NEKAGoRqCSmDNvCpTuMaMnwEN54eLv098bj7jEYVUllFN4hOFJv3TpHFjhJR/OCw2LALlv3ry+Yvnali3asIeBLvW8fZCNmIvLVgkysxuAmPmC3gqFYvEHn5gnkqInLPD9/f/2QX0O1Ybp2hDHOEZusJvXb1zt2eM1KHrt278UH/9Cj5dfuHUrqV49ZvqWSd+7d1PgX3iYzXNW76TcLioqVCqZmG+QLWIib0eZHxAZGa1SqXx8/Ex+RVZ2prvyCaW7uLjIz/dxrMeTJ48ih2BHVUnYUlXCdS9bvvDzLz7NyLwHNnrrtm+gomscGweeH9jcHTs3F5cUwy0BNwBqpPsPspGNuLkpV61eBpnAv02b/+vr69e0SXPzA+DpadOm/YoVi6AWhRuz/7tdb7095NChAzVnGxUZff7CWXB74Gp37d7KJtpxeVUgaDuqStoGvaFunDxp1oaNX+7ctQU2W7WM/3jlF2FhEfB59qwPNm5a98/XuwQGBs+euSgvP3fuvKnDRvTZ+M3up88ffG2oRfv266nRaPz9Aj5Y+HH1SABLF3964OCehR/MvHHjT/C4ExJ69u7dv+ZsR44cW15eNmfuZHgyevfqD75gdnbmjJnvwjWjWkBbL9+W5whuXHSXQH9sgAAAEABJREFUpog3Jj65mcAD8xdMB6u6csXnCBMMWrRlcfL4T6Oq7xIa8bxiWW5m+vMzMXg2c/bEa5UbRCagiQuuFOIC62bYqt9dd3h/wTJkL1Mnz9HqtBZ3OSm4im1vbFVaVtAxVWWdBfxxxDvGVqVl9QTbzSuW5SZEmAzn1E1sbebULeONHdbFc0wXlUAlbB2rFKgV1scqrc4RJJ4Fx+Rvwla/m6IEY1ILbLXdAhwhyM0rluWWKkhaj9+ycXUGA2FlHMlyValwhjEqQW47SU0qIay0Zywnd+7rrSoV6ko7uXGmWOltOYqSZbmVXgq/cOnWpclIwEbO/ZRdnKcd9J7lkZma1jP5/XDOpV+K/COcAusrFE5SCycT1vxFS+NHzBooRLXj6SrTbY3rotAwVkqbpYiYhU6IygvB0MbVSczOYl8boCudRZs1H6ClLKp8CsGuB2Plis12Vlx/pcZilR9P6/Pua1NvFGvL6NFLLYzjPMqjRgf77KGcpLOlmnKDXodswdJEIsKR7/tYuJ80quW0XWNxMNuufMGmVWwqNklEm9VuJEmQEtrdR9J3Uk0jjpgt22hi+PDhU6ZMadKkCcIKXP1uTGMoCnLziiA3r+AqN6YRQoXSzSuC3LwiyM0rgty8IsjNK4LcvIKr3JjGLsdSbija1afTYwGWcmPaxkH4lm4cLQkS5OYZQW5eEWw3rwilm1cEuXlFkJtXBLl5RagqeUUo3byC5UWzYSoRhuDaI5iWloYwRIgTzyuC3LwiyM0rgty8IsjNK4LcvCLIzSuC3LwiyM0rgty8IsjNK4LcvIJlZDN23WmKwi8+L66B5DAt4ILcvIJrfzemcmP21nBcXFyVua9w/V27dl2xYgXCAcyMSWRkpKgyvr6+I0eORJiAmdzdu3evEg4jNja2UaNGCBMwk3vIkCHBwcGmTaVSOXjwYIQPmMnt5OTUq1cvk/mOjo5u3rw5wgf8HMFBgwYFBAQgY2SjoUOHIqzgzxFMv1WmVxlogqTNA6kTNPxHPVr4hV0BhmBX0kGPF8hhTzEd07v72IMHDgQEBfo4N025Wlo9CJP5wi/GfZUCNVVZdsdgoMJiZVKpFHEPH47gd5/fy0zRgABUtcXbal5hx/riS2Yn1nDQ0+VLSpBBhxSuon7T/V1cFIhLOJf7wLrM+2mqtq96hjfyRHWYYzuz0pPKRy0Olys4fKeNW7m3LburURn6TLQ5DNTfgkql3bksffzHUYgzOKwqi/JVBQ/1uGiNmDCCUndf6bfLOJwOx6HcZ78vkMkxW5Y6tKGiKM+2ReVsgkPPRF1OWwuIWGdx85TRBg6LCIdy6zS0TovZqnmUgdAbOLxmYUHpKnBbPgS5K2FsZOFpTAiEMKsoGWoINukAOJSb5vrJ5ABaMCZ8IiKwNSbgBIpwCwdD0QhXY2IEM3Ni7HfEs3RDZwx+824QtqWbxrGuRFbjZzkEoaqsBBNaj8vqRvC7K0EQ2NpuAvwSEWaCM0F/ubSAHPbYURRNcR+Pe9/+nUs/mo8cBb6lmx/++usGciAcl+46JPe8+dPUatWyj9aYUmbOnlhUVLh2zQb4vGnzV4d/+j4396GPj1+zuJaTJs6EzvSJk0dfuXIJ9v700w9ffrElun7M9etXN25ad/PmdaW7R7u2Lw4bOtrZ2fnpr4HrViWHxgQeS5tMd7t2L168dK6srIzdVKvVFy6cTejSAz5/s+GL/d/tfHvMxN27Dr85cuzxX4/s2r0V0j/9eF3Dho0TE1859ssF0Doj897U6WPVGvWa1d8sen9FSsrtSZNH2zRRlutWJYdyM80cW64cCiNFUSd/O8pu/nbqOGx26tStpLTk2+0bhwwe1aFDJ1cX104dE3q93m/L1vU6XdVRrp9//lEiloDQISFhYWERU6fMvZ38F+SD6gxclm6SsKnPxN3dA6zEyd+OsZunTh1v2aKNp6fXvXtpoCyUYtOR0dENS0tLMzPvVcnh+vUrMTGxSqU7u+nn5x8QEHT1zz/QUwMGCttGvIGmbJxVAWV5zWcrwIyQJHnm7Ml335kOifn5ufBXLpObDlMonBAzT6G8yumlpSU3/7rRuWsr88SC/Dz01FAcxxCvW54JyL1q9bLTZ05IpVLGknTshpi5gC7wV6VWmQ4rL2fsu6end5XTPb28mzRpNmL4W+aJSjd39NQQVePKOZi6JbfSTQkG5Ny50xqN+oX2HZ2cmFIcGRkNhR0MRcOYWPawpKRrYMTr1fOpcnpkRP2fjvwQ17SFaQbA3bspQUE2rFpFI2yrSmRXI75jx4SrVy9dvPg7lHQ2xc3VrVvCy1u2fn369InikmLw+fbt39GnzyBW08DAYFD/0h/nCwryIRGeiTVrV4I5Aov/5bpVI0f1S0mtQ1E3uZXbjnICBuTBw/t6gx5Ktylx3NgpsLlo8aw3+iRu/fabgQNGDBwwnN31j1d6Q0tw2vRxd1Juw41Z/9UOhVwx5u3BQ4e/cfnKxWlT54KD+NRfzvnQMIdzBHf/JyMvWztwZgTCh+TLxb/tf/jOJ1xNExQ6YHlFkJtXOJSbhGYOiVkHLNdD2RzKzfQ/4BZZl+ZYcC5bldBEw21smOB4NFuw3ZUgGM8Y08EzGztg6wIUxsML3Pb2YIkwJZNXBNtdCYznd+MIje+kNYHqCHLzCodySyW0RIaZJ0iS3L6byGHeCqWY0mPmmxTkqEkuA/JwKHeX/j4aNWat+LQbxR71uHyjA3EGDDD6BEt3LL+DMCHtZmFJAdVvShjiDM4X2Di688GtSyWN2yvjOtZDdZXc+6oLh3Nz0jVjV3C43APiZ/mYQ5sy7t5QG3TQR1jJp62yag5DteVkiGpNU8JSY/VpsrJ2LruwlbOSHDY3HHEMf8s2arXaohwDIkiz72b+Q2YqEHA/yEpL6VQRiKjYJpYs+aBv376RUfUrEqmKrEynsEfSptNo2vSNVfIEV8TLj4+VkRCffrdUKq0XiBxFTlGKmzdRL4AnmRwFrs0cTIP6CXLziiA3rwhy8wqucmMa/1Yo3bwiyM0rgty8grHcgu3mD6F08wdoXSXgBS7gKjeORRvhKzeOhhsJpZtnBLl5RZCbV7C8aOgwEeTmD6F084ogN68IcvMKpp3dSCjdPINrF1V4OOdTcLgAS7lpmr579y7CECFOPK8IcvOKIDevCHLziiA3rwhy84ogN68IcvOKIDevCHLziiA3rwhy8wpmwWlZSJKkKAq7ZdwQpnIjbAs4rnLDaE71YAB1H1yHFzAt3QReFjAxMVEkEoHhzs/Pl8lkBoNBq9U2bdp0w4YNCAcwK91QSebk5LCfNRoN/PXw8BgzZgzCBMxsd4cOHaBom6dERka2a9cOYQJmco8YMSIgIMC06ezsPHDgQIQPmMkNWickJJg2Q0NDO3XqhPABP0dw6NChISFMsBAnJ6cBAwYgrMBPbk9Pz+7du0OdGRQU1LNnT4QVHDqCpw8+TP9LXZirM+gq1nF5Ygg05mqeYr1yC2vzWIR+2sWhCXY9ZjGhcBF5+Uvbv+rt5S9DHMCJ3Js+SCkppGgDkjqLZW5SFw+FzEnCvCtW6cczi+aAuETFBbCr61SsplPlMPif2ZHMRoXkNcZhNi41b+2Aykv2ULRGo1UVa8oLtLpynV5HSeUotq3bC6/5IIfiYLl3fpL+MF1LSomAGG+lnwvClntXHpTklYvFRM/hvsENHPZDHCa3Tq3775x0kYSIeSkUPStkXs8pzC4NiJT3GhuEHIFj5C54qN32YbpnqJt/tBd65rj5a5qTi2jonDBUaxwgd26WevuKjMbdsJyS+pQkHUvzDZX1HlfbtctqK3d+rmrb4szGic+y1iy3Tt2VydCwuZGoFtTW7/52SaZfQ0/0HBD9QlhpIX1oYxaqBbWSe9MHqXI3sXewEj0fxCaEJ18uN+gMyF7slzv5clFJgSEyPhg9T8iVsk2L05G92C/3iX35CndOml51maj4gLIiQ/bdcmQXdspdVKApLzZEtApAdZXlqwfsObgMcQA0lX/e+hDZhZ1yH92RI5HhOqxcS+qFuRfl2jlMaqdkOekaudtzZ0lYPAJd4W/SuUJkO3aOVWpVtE8DZ8QNBoP+x5+/SLp1qrDwfnhoXPv4fzVq8AKkZz+4s3LNwHfHfH30xMZrSb8q3XyaNen2crdx7EJJ9x+mbN+z8EFOalREy4SOIxGXECS6fam0YRsbQqKz2FO6i/K08NfdxxVxw77vV5w8822H+H/NmrK/SWyXTdtnXL12FNLFxigUu75b2rxp9w/n/zawz/u/ntp65frPiHmxVffVponuSp/p7+54JXH88d+2lJTkIs6QyMRFefbMcrFH7uwUFeIMnU5z4fIPXV4c1q5Nb2cnZXzL10DcI8fXmw6Ii+0S17irWCyJDG/h5RGYkXkTEv+8cayw6MFrPSd5uPv5+UT0enWqSl2COEMsIzUqe1rj9sit1XA4NeVeVpJer42OijelRIa1yH6QXFZexG4GBTQ07ZLLXVlZc/PuSSVyTw9/Nt3N1dtd6Ys4gyBFtF2h0eyx3STBYYhYtaoU/n721egq6SWleaSIuVqCsFBEylXFUpmTeYpELEecQVM1D2xYxR653f0k3BVvNzdv+NvnnzO9PSu1Vz2UfsXWzbGTwk2jqdT0UGvKEGdQOoNUgezAHrkDIxmfRFWqUbg43hes5xUikTDZgoPBppSU5kO3pQwKr3Vr7OHur9Opweb4+0bBZmb2reKSHMQZ0G3i4mPPCh92+t3gIxTcK0YcALImdv73kWPrU9Iu6/Ra8EnWbXhn7/dPaB/GNnxJLJbu2r9Uq1UXFeds2TnHyYnDjjODnoIhHmQ7dvrd7t6Sknyu/JPOLw4J8I8+dnLT7Tvn5XKXsOAm//rnrJpPUchd3hz88Q8/rZmzuAvUmeALXrp6mKP6RVuupQx0257eyHbsHF648Xvh8V25jbo++6MK1Um9mEVrdSMXRiDbsdOYNIp3J8VE5k0O7WOdpbxQE9veDdmF/ROO6zd3uXmxNDDGaiSzOYu7WkynKAM4c4QVX3LGxD0uzjY3jq2xfvPk1PQrFneBMwPuo8VdH8z+BVkhMykXugzie9hjSVAtxyq/mJHs4u0cFGt57kt+gT3jTJ4ejuzULS7O1Ru0FndpNCqZTGHrNVz/JbVtD4+WCXZOOKiV3IU52i1L05/tMXhzbp/JkMvpIbPCkL3Uqs/avZ40uoVL0rG76Dkg+1aOQaOvjdao9iPxiYP9/EJl146komea9Bv389NL3/qoVrMekKNmUZ3+Pu/yicJGncPQs0ja5fsluarxK6NQrXHYHMFDG7OTL5e5BzgFNeawK45/kn69S4qI0Uvs8bKr48gZsA8yynd9nAUdk96hbn71sZ8smHw2Q12sC4yS9hoXghyE4+d3H96cceeKmjIgsVyk9HfxiXDHKApI0YOSgqwydbFGr6XcvMT9JwRIXRwZgZSrtxcu/JL/5zAhWkwAAACqSURBVG+FZUXG6MIiJDK+H0CbTz96NE3+0Zx3s1cNiEdb9ONp7xXxbFHlRON8fMtX8PjER8eYpu2jqvlQIkRQzCsW0JculYv8wuT/+Dcnczr4eGv4r0vFBQ80OjXTK18Fk141vmlgHaNEBGG6Icg4yEJXe0sCmb2uYBTV/JUJ+J9M5OYq8g+T+IRy+woAZi9p4w6uSxBgiiA3rwhy84ogN68IcvOKIDev/B8AAP//XGrUXwAAAAZJREFUAwCzr5jr5cGrpQAAAABJRU5ErkJggg==",
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import Image, display\n",
    "arch = SelfConsistency(llm=llm, n_samples=7, sample_temperature=0.9)\n",
    "graph = arch.build()\n",
    "try:\n",
    "    display(Image(graph.get_graph().draw_mermaid_png()))\n",
    "except Exception as e:\n",
    "    print(f\"(mermaid PNG render unavailable: {e}; see § 2 for the architecture diagram)\")\n",
    "    print(graph.get_graph().draw_mermaid())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e1424b4",
   "metadata": {
    "papermill": {
     "duration": 0.010069,
     "end_time": "2026-05-28T02:14:56.500645+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:56.490576+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## 8 · Live run — a perspective-taking trick problem\n",
    "\n",
    "The Sally-siblings problem is famous for tripping LLMs in CoT: easy to misread \"each brother has 2 sisters\" as \"the answer is 2\", when actually Sally herself is one of those 2 sisters → she has only **1** sister.\n",
    "\n",
    "We sample 7 independent reasoning paths; some are expected to slip on the trap, but the modal answer should be the correct one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "da6461a6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-28T02:14:56.508432Z",
     "iopub.status.busy": "2026-05-28T02:14:56.508432Z",
     "iopub.status.idle": "2026-05-28T02:17:46.101869Z",
     "shell.execute_reply": "2026-05-28T02:17:46.101869Z"
    },
    "papermill": {
     "duration": 169.595446,
     "end_time": "2026-05-28T02:17:46.101869+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:14:56.506423+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "FINAL_ANSWER: '1'\n",
      "EXPECTED: '1'\n",
      "MATCH: True\n",
      "\n",
      "N_SAMPLES: 7\n",
      "UNIQUE_ANSWERS: 1\n",
      "WINNER_COUNT: 7/7\n",
      "AGREEMENT_FRACTION: 1.00\n",
      "TALLY: {'1': 7}\n"
     ]
    }
   ],
   "source": [
    "TASK = (\n",
    "    \"Sally is a girl with 3 brothers. Each of her brothers has 2 sisters. \"\n",
    "    \"How many sisters does Sally have? Return only the integer answer.\"\n",
    ")\n",
    "EXPECTED = \"1\"  # Sally is one of the 2 sisters; she has 1 sister besides herself.\n",
    "\n",
    "r = arch.run(TASK)\n",
    "\n",
    "print(f\"FINAL_ANSWER: {r.output!r}\")\n",
    "print(f\"EXPECTED: {EXPECTED!r}\")\n",
    "print(f\"MATCH: {r.output.strip() == EXPECTED}\")\n",
    "print()\n",
    "print(f\"N_SAMPLES: {r.metadata['n_samples']}\")\n",
    "print(f\"UNIQUE_ANSWERS: {r.metadata['unique_answers']}\")\n",
    "print(f\"WINNER_COUNT: {r.metadata['winner_count']}/{r.metadata['n_samples']}\")\n",
    "print(f\"AGREEMENT_FRACTION: {r.metadata['agreement_fraction']:.2f}\")\n",
    "print(f\"TALLY: {r.metadata['tally']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4e1a16f",
   "metadata": {
    "papermill": {
     "duration": 0.002004,
     "end_time": "2026-05-28T02:17:46.109714+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:17:46.107710+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "### 8.1 · Inspect every sample's reasoning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3606f5f7",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-28T02:17:46.117668Z",
     "iopub.status.busy": "2026-05-28T02:17:46.117668Z",
     "iopub.status.idle": "2026-05-28T02:17:46.134721Z",
     "shell.execute_reply": "2026-05-28T02:17:46.133542Z"
    },
    "papermill": {
     "duration": 0.021559,
     "end_time": "2026-05-28T02:17:46.135292+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:17:46.113733+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- sample 0 -- answer='1' ---\n",
      "Let's break down the information given in the problem. Sally is a girl with 3 brothers. Each of her brothers has 2 sisters. Since each brother has 2 sisters, and Sally is also a sister to her brothers, Sally must be one of the 2 sisters that each brother has. Therefore, there must be another sister besides Sally. So, Sally has 1 sister.\n",
      "\n",
      "--- sample 1 -- answer='1' ---\n",
      "Let's break down the information given in the problem. Sally is a girl with 3 brothers. Each of her brothers has 2 sisters. Since each brother has the same sisters, and Sally is a sister to each of her brothers, Sally must be one of the two sisters that each brother has. Therefore, if Sally is one of the sisters, then there must be another sister besides Sally. Thus, Sally has 1 sister.\n",
      "\n",
      "--- sample 2 -- answer='1' ---\n",
      "Let's break down the information given in the problem. Sally is a girl with 3 brothers. Each of her brothers has 2 sisters. To find out how many sisters Sally has, we need to consider the perspective of her brothers. Since each brother has 2 sisters, and one of those sisters is Sally herself, the other sister must be another sibling. Therefore, Sally has 1 sister.\n",
      "\n",
      "--- sample 3 -- answer='1' ---\n",
      "Sally is a girl with 3 brothers. Each of her brothers has 2 sisters. Since each brother has 2 sisters, and one of them is Sally, the other sister must also be Sally’s sister. Therefore, Sally has 1 sister.\n",
      "\n",
      "--- sample 4 -- answer='1' ---\n",
      "Let's start by analyzing the information given. Sally is a girl with 3 brothers. Each of her brothers has 2 sisters. Since each brother has 2 sisters, and Sally is one of them, the other sister must be another girl. Therefore, Sally has 1 sister.\n",
      "\n",
      "--- sample 5 -- answer='1' ---\n",
      "Let's break down the information given in the problem. Sally is a girl with 3 brothers. Since each of her brothers has 2 sisters, and Sally is one of the sisters, the other sister must also be Sally's sister. Thus, Sally has 1 sister.\n",
      "\n",
      "--- sample 6 -- answer='1' ---\n",
      "Let's break down the information given in the problem step by step. First, we know Sally is a girl with 3 brothers. The problem also states that each of her brothers has 2 sisters. Since Sally is a sister to each of her brothers, she must be one of the two sisters each brother has. The question asks how many sisters Sally has. Given that each brother has 2 sisters and Sally is one of them, the oth\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for s in r.metadata['samples']:\n",
    "    print(f\"--- sample {s['sample_index']} -- answer={s['answer']!r} ---\")\n",
    "    print(s['reasoning'][:400])\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95109882",
   "metadata": {
    "papermill": {
     "duration": 0.003275,
     "end_time": "2026-05-28T02:17:46.141820+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:17:46.138545+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## 9 · What we just observed\n",
    "\n",
    "The cells above ran Self-Consistency on a perspective-taking trick (the Sally-siblings problem) where some chain-of-thought paths are expected to slip but the modal answer should be correct.\n",
    "\n",
    "### 9.1 · Vote tally + winner\n",
    "\n",
    "- **Winner**: `1` — **✅ Correct** (matches expected `1`)\n",
    "- **Agreement**: 7/7 samples = 100%\n",
    "- **Unique answers across samples**: 1\n",
    "\n",
    "| Answer | Count | Share |\n",
    "|---|---|---|\n",
    "| `1` | 7 | 100% |\n",
    "\n",
    "### 9.2 · Per-sample answers\n",
    "\n",
    "| Sample | Answer |\n",
    "|---|---|\n",
    "| 0 | `1` |\n",
    "| 1 | `1` |\n",
    "| 2 | `1` |\n",
    "| 3 | `1` |\n",
    "| 4 | `1` |\n",
    "| 5 | `1` |\n",
    "| 6 | `1` |\n",
    "\n",
    "### 9.3 · Self-Consistency vs single-sample CoT\n",
    "\n",
    "| Strategy | Correct trials | Error rate |\n",
    "|---|---|---|\n",
    "| **Self-Consistency (modal of N=7)** | 1/1 | 0% |\n",
    "| **Single-sample baseline** | 6/7 | 14% |\n",
    "\n",
    "Single-sample tally over 7 independent runs: `1` ×6, `2` ×1.\n",
    "\n",
    "### 9.4 · Patterns surfaced in this run\n",
    "\n",
    "- **✅ Strong agreement on the right answer** (7/7). If you'd run a single-sample CoT you'd almost certainly have landed on the same answer. Self-Consistency added little lift here — it would matter more on harder tasks.\n",
    "\n",
    "- **🟰 All samples agreed**. Either the task is easy or temperature was too low. If correct, you saved 0 by using Self-Consistency. If wrong, you wasted N× cost on identical wrong answers.\n",
    "\n",
    "- **✅ Self-Consistency outperformed single-sample baseline** on this run: single-sample was right 6/7 = 86% of the time; modal vote got it right.\n",
    "\n",
    "### 9.5 · The takeaway\n",
    "\n",
    "Self-Consistency is the simplest deterministic-picker architecture in the catalogue: every sample is the same model, every vote is one ballot, the picker is `Counter.most_common(1)`. Its lift comes from one place — when *some* paths are wrong but the *modal* path is right. Read the § 9.1 tally:\n",
    "\n",
    "- **All votes identical** → architecture spent N× cost for nothing.\n",
    "- **Modal answer wins with a clear majority** → the lift you paid for.\n",
    "- **Modal answer wins with a thin plurality (<50%)** → treat as low confidence; the architecture is *honest* about uncertainty but the answer might still be wrong.\n",
    "\n",
    "The single-sample comparison in § 9.3 makes the lift concrete by re-running the same task one-shot N times — the gap between single-shot accuracy and Self-Consistency accuracy is the architecture's value on this task."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8cdb396",
   "metadata": {
    "papermill": {
     "duration": 0.000539,
     "end_time": "2026-05-28T02:17:46.145554+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:17:46.145015+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## 10 · Contrast — single sample (no Self-Consistency)\n",
    "\n",
    "What would the answer have been if we'd only drawn **one** path? Run 7 independent single-sample queries and see how often a lone draw lands on the trap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e29bbc99",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-28T02:17:46.154494Z",
     "iopub.status.busy": "2026-05-28T02:17:46.154494Z",
     "iopub.status.idle": "2026-05-28T02:18:37.742808Z",
     "shell.execute_reply": "2026-05-28T02:18:37.742808Z"
    },
    "papermill": {
     "duration": 51.594285,
     "end_time": "2026-05-28T02:18:37.743852+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:17:46.149567+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SINGLE_SAMPLE_TALLY: {'1': 6, '2': 1}\n",
      "SINGLE_SAMPLE_CORRECT: 6/7 trials landed on '1'\n",
      "SINGLE_SAMPLE_ERROR_RATE: 0.14\n",
      "\n",
      "(Self-Consistency winner was '1' with 7/7 agreement.)\n"
     ]
    }
   ],
   "source": [
    "import collections\n",
    "single_sample_results = []\n",
    "single_arch = SelfConsistency(llm=llm, n_samples=1, sample_temperature=0.9)\n",
    "for trial in range(7):\n",
    "    r1 = single_arch.run(TASK)\n",
    "    single_sample_results.append(r1.output.strip())\n",
    "\n",
    "single_tally = collections.Counter(single_sample_results)\n",
    "n_trials = len(single_sample_results)\n",
    "correct = single_tally.get(EXPECTED, 0)\n",
    "print(f\"SINGLE_SAMPLE_TALLY: {dict(single_tally)}\")\n",
    "print(f\"SINGLE_SAMPLE_CORRECT: {correct}/{n_trials} trials landed on {EXPECTED!r}\")\n",
    "print(f\"SINGLE_SAMPLE_ERROR_RATE: {(n_trials - correct) / n_trials:.2f}\")\n",
    "print()\n",
    "print(f\"(Self-Consistency winner was {r.output!r} with \"\n",
    "      f\"{r.metadata['winner_count']}/{r.metadata['n_samples']} agreement.)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fb4670f",
   "metadata": {
    "papermill": {
     "duration": 0.003008,
     "end_time": "2026-05-28T02:18:37.749860+00:00",
     "exception": false,
     "start_time": "2026-05-28T02:18:37.746852+00:00",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## 11 · Failure modes, safety, extensions\n",
    "\n",
    "### 11.1 · Where this breaks\n",
    "\n",
    "| Failure | Mechanism | Mitigation |\n",
    "|---|---|---|\n",
    "| **All N samples agree on a wrong answer** | Systematic model bias | Self-Consistency can't fix this; pair with CoVe (nb 20) or RAG |\n",
    "| **High disagreement, no clear winner** | Task too hard or sampling temperature too high | Surface `agreement_fraction` to the caller; treat low-agreement as uncertain |\n",
    "| **Answers don't normalise cleanly** | \"yes\" vs \"Yes.\" vs \"Yes!\" all count separately | Robust normalisation (we lowercase, strip, drop trailing period); for numeric answers, parse as int/float and compare |\n",
    "| **Cost scales linearly with N** | N=10 means 10× single-call cost | Use small N (5-7) and prefer cheaper models; batch via `asyncio.gather` |\n",
    "| **Sample failure silently drops a vote** | Structured-output parse failure | Architecture logs sample errors; consider a re-sample policy in production |\n",
    "\n",
    "### 11.2 · Production safety\n",
    "\n",
    "- **Always surface `agreement_fraction`.** A 4/7 winner is much weaker signal than 7/7. Downstream code should branch on confidence.\n",
    "- **Cap N.** Diminishing returns past 7-10 samples; cost is real.\n",
    "- **Audit the tally, not just the winner.** Whether the runner-up has 1 vote or 3 votes is meaningful information about task difficulty.\n",
    "\n",
    "### 11.3 · Three extensions\n",
    "\n",
    "1. **Weighted majority via verifier.** Use CoVe (nb 20) to verify each candidate answer; weight by verification confidence. Combines Self-Consistency + CoVe.\n",
    "2. **Adaptive N.** Start with N=3; if all agree, stop. Else draw more until either confidence threshold met or budget exhausted. Cuts average cost.\n",
    "3. **Reasoning-quality reranking.** Score each sample's *reasoning* (separately from its answer); break ties by reasoning quality.\n",
    "\n",
    "### 11.4 · What to read next\n",
    "\n",
    "- [**13 · Ensemble**](./13_ensemble.ipynb) — N *different* specialists with majority vote (categorical answers) — the architectural sibling.\n",
    "- [**20 · CoVe**](./20_chain_of_verification.ipynb) — atomic verification questions (composable with Self-Consistency via § 11.3 #1).\n",
    "- [**22 · LATS**](./22_lats.ipynb) — MCTS-style search with explicit value estimates (the principled \"search over paths\" generalisation).\n",
    "\n",
    "### 11.5 · References\n",
    "\n",
    "1. Wang, X. et al. *Self-Consistency Improves Chain of Thought Reasoning in Language Models.* ICLR 2023. [arXiv:2203.11171](https://arxiv.org/abs/2203.11171)\n",
    "2. Madaan, A. et al. *Self-Refine.* NeurIPS 2023. [arXiv:2303.17651](https://arxiv.org/abs/2303.17651) — sister sample-and-iterate strategy.\n",
    "3. Wei, J. et al. *Chain-of-Thought Prompting.* NeurIPS 2022. [arXiv:2201.11903](https://arxiv.org/abs/2201.11903) — the baseline single-path CoT this architecture extends."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 227.138691,
   "end_time": "2026-05-28T02:18:38.428331+00:00",
   "environment_variables": {},
   "exception": null,
   "input_path": "all-agentic-architectures/notebooks/21_self_consistency.ipynb",
   "output_path": "all-agentic-architectures/notebooks/21_self_consistency.ipynb",
   "parameters": {},
   "start_time": "2026-05-28T02:14:51.289640+00:00",
   "version": "2.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}