Provider: nebius · Model: meta-llama/Llama-3.3-70B-Instruct ─────────────────────────────────────────────────────\n", "\n" ], "text/plain": [ "\u001b[1;36mProvider: nebius · Model: meta-llama/Llama-\u001b[0m\u001b[1;36m3.3\u001b[0m\u001b[1;36m-70B-Instruct\u001b[0m \u001b[92m─────────────────────────────────────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from agentic_architectures import get_llm, enable_langsmith, settings\n", "from agentic_architectures.architectures import PEV\n", "from agentic_architectures.ui import print_md, print_header, print_step\n", "\n", "enable_langsmith()\n", "print_header(f\"Provider: {settings.llm_provider} · Model: {settings.llm_model}\")" ] }, { "cell_type": "markdown", "id": "a9ca6d05", "metadata": { "papermill": { "duration": 0.0, "end_time": "2026-05-27T07:36:28.318915+00:00", "exception": false, "start_time": "2026-05-27T07:36:28.318915+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 5 · Library walkthrough\n", "\n", "Source: [`src/agentic_architectures/architectures/pev.py`](../src/agentic_architectures/architectures/pev.py).\n", "\n", "Five key pieces:\n", "\n", "1. **`_plan`** — same as Planning: `with_structured_output(Plan)` produces 3-6 atomic steps. The prompt explicitly demands \"verifiable steps\" — each step must produce a concrete fact / value / artifact.\n", "2. **`_execute`** — pops the next step from the plan (or reuses pending_step on retry). On retry, includes the previous critique in the executor prompt.\n", "3. **`_verify`** — uses `LLMJudge[_StepVerification]` with a rubric: *\"contains the specific fact/value the step asks for AND is grounded (URL or computation shown)\"*. The verdict drives the router.\n", "4. **`_finalize`** — synthesises the final answer using the verified history; explicitly told to hedge on `fail-accepted` steps.\n", "5. **`_route_after_verify`** — pass + more steps → execute, fail + retries left → execute (retry), pass + done OR budget gone → finalize.\n", "\n", "The Verifier rubric & schema:" ] }, { "cell_type": "code", "execution_count": 2, "id": "9fa408e6", "metadata": { "execution": { "iopub.execute_input": "2026-05-27T07:36:28.335687Z", "iopub.status.busy": "2026-05-27T07:36:28.335687Z", "iopub.status.idle": "2026-05-27T07:36:28.348279Z", "shell.execute_reply": "2026-05-27T07:36:28.347547Z" }, "papermill": { "duration": 0.018475, "end_time": "2026-05-27T07:36:28.348279+00:00", "exception": false, "start_time": "2026-05-27T07:36:28.329804+00:00", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"description\": \"Verifier's verdict on a single executed plan step.\",\n", " \"properties\": {\n", " \"is_satisfactory\": {\n", " \"description\": \"True iff the step's result fully addresses the step's intent \\u2014 concretely, contains the requested fact / computation / artifact and is grounded in evidence (cited URL or computation shown).\",\n", " \"title\": \"Is Satisfactory\",\n", " \"type\": \"boolean\"\n", " },\n", " \"issues\": {\n", " \"anyOf\": [\n", " {\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"type\": \"null\"\n", " }\n", " ],\n", " \"default\": null,\n", " \"description\": \"If is_satisfactory i...\n" ] } ], "source": [ "from agentic_architectures.architectures.pev import _StepVerification\n", "import json\n", "print(json.dumps(_StepVerification.model_json_schema(), indent=2)[:600] + '...')" ] }, { "cell_type": "markdown", "id": "94336b43", "metadata": { "papermill": { "duration": 0.003937, "end_time": "2026-05-27T07:36:28.360874+00:00", "exception": false, "start_time": "2026-05-27T07:36:28.356937+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 6 · State\n", "\n", "The state has a per-step *scratchpad* (the `pending_*` fields) that the Verifier writes to and `_execute` reads from on retry. When a step is accepted (pass or fail-accepted), the scratchpad is cleared and the result is appended to `past_steps`.\n", "\n", "| Field | Purpose | Reducer |\n", "|---|---|---|\n", "| `input` | original task | replace |\n", "| `plan` | remaining steps (popped on first try of each step) | replace |\n", "| `past_steps` | committed step records `{step, result, verdict, attempts, confidence}` | **append** |\n", "| `pending_step` / `pending_result` / `pending_critique` / `attempts` | per-step scratchpad | replace |\n", "| `response` | final synthesised answer | replace |" ] }, { "cell_type": "markdown", "id": "e98a4b37", "metadata": { "papermill": { "duration": 0.003267, "end_time": "2026-05-27T07:36:28.368153+00:00", "exception": false, "start_time": "2026-05-27T07:36:28.364886+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 7 · Build the graph\n", "\n", "PEV adds one node (`verify`) and one new conditional path (verify → execute on retry) compared to Planning. The compiled-PNG render should show 4 nodes (plan, execute, verify, finalize) with the cycle `execute → verify → execute`." ] }, { "cell_type": "code", "execution_count": 3, "id": "79364dc2", "metadata": { "execution": { "iopub.execute_input": "2026-05-27T07:36:28.384540Z", "iopub.status.busy": "2026-05-27T07:36:28.384540Z", "iopub.status.idle": "2026-05-27T07:36:31.393921Z", "shell.execute_reply": "2026-05-27T07:36:31.393258Z" }, "papermill": { "duration": 3.020783, "end_time": "2026-05-27T07:36:31.396157+00:00", "exception": false, "start_time": "2026-05-27T07:36:28.375374+00:00", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAGoAAAITCAIAAABg8R7gAAAQAElEQVR4nOydCVwU5f/Hn5k9uO9DEVAEvPAAFQ01JQXSvDUrzaxMy/v4e5RXFmqXpl1qZqmlpf7UNK3UMs2LMvG+D0AQOUTkZhf2mPl/dweWBZfZ2Z2FfWDn/fKFs888c+xnn+f73M9XTNM0EjAXMRLggSAfLwT5eCHIxwtBPl4I8vGCr3ypN8rvXigqfKxQKalyGYVq1IJIGlEEQSKaqhbMhFSFkwjpRyCQ5j7aayuOIUyEaHVlYPX7VFxeearqtpXXVh1okTqJRCTt4ilt3saxfQ8XxAPCvHrfhaOFV/8pKC1SwWtJ7EmJmJA6iNRqilZXv5sIITUyIJ+IgJgESdAUrYumd1rzbZk4iCCQ9g0rPooIpPeIijswl+t+g6oDAjH3ry6fnb24vFytUtEKuZpS0w6OkqD2Tn1f8kamY7J8F48Wnjv2WK1CvgH2kbFezdvZoYZMSR59an/Og2SZWkG17Ojcf1wTky43Tb6ty1NlJVT7KPfeIzxR4+Lm2dJ/fsuBxPjmimBEcL3KBPnWz0v2DbQfNcsfNV6O7869fqaw12DviL5uXOJzlW/tnKR+LzQN6+GMbID185LGLmzp5iUyGpOTfOvmJr35QajUHtkO3yxI6RrrGRnrzh6NRMbY8HZKv5f8bEo7YNLHwYl/5hU8NJK2jMj3w7I0n0C7dt2dkO3Rvb/Xzs/uscdhkw8qd/JS9fMzGnNZwULXGDd7R3L3Fw9Y4rDJd/bI4/ZPcSqAGisvzm6Rc7+MJUKt8l04VkSr6N4jvZAN4+hKOLmJf1mXWVuEWuW7mlDg28IB1S9xcXEZGRnIRJKTkwcPHozqhk69PLLS5LWdrVW+kkIltMlQPZKVlZWfn49M58aNG6jO6BLjBg329NuGFTTc43L3YilBEC3qpj0LNc0dO3b89ttvaWlpLVu2jIqKmjJlysWLFydPngxnhw0bFh0dvXr1akhTe/bsSUxMzMzMDA4OHj58+KhRo5g7xMTETJw48dixY3DVuHHjtm3bBoGRkZH/93//N3bsWGRp7J1EV04XBbYxkBcNy5d6vVRqx7nhZyI7d+7cvHnz7Nmze/Xqdfz48XXr1jk5OY0fP/7zzz+HwP379/v7a8p6UBCEW7x4MfyQqampn3zyiZ+fH1wCpyQSyb59+7p37w4idu3aFSL8+eef8HugusHZTZyfU27wlGH5Ch8r7RyN16jN48KFC2FhYYy1GjFiRLdu3WQy2ZPRPvroo9LS0mbNmiFtyjpw4MA///zDyAd6ubm5zZs3D9ULrt6SjCRTMq+iXC2R1pV84eHhX3311bJlyzp37tynT5+AgACD0SCPQzpNSEiAPM6EMKmSAX4AVF84uIhUSrXBU4blo2iapOsq87788suQW0+cOBEfHy8Wi6G0nTlzpo+PT7UXoKhZs2YpFIrp06dD0nNxcZkwYYJ+BKlUiuoLUpveDZ4yLJ9UIiqXG9bbAm9DkiO0pKSknD17duPGjSUlJZ999pl+nFu3bl2/fn39+vVg4JiQ4uJiX19fZA1kxRRZi3yGcyiMAyjKKVQ3gI2HUhUOoDwdPXr0mDFjbt++XSNOQUEB/NXplaIFWYmix0qxg2GhDIcGtHYoK62r1Hf48OH58+efPHmysLDw9OnTUP8AawjhQUFB8PfIkSPXrl0DZSFfQ42kqKgIit1Vq1ZB/QYqhgZv2Lx589zcXCjEdVbSshTmKTw8DNsKw/J17OkC3YD52UpUByxZsgTUmTNnDlTfli9fDrU8qJ1AOJQhQ4YM2bBhAxQsTZs2XbFixdWrV/v16we1uWnTpkGlD2TVVf30efrppyMiIqAg/uOPP1AdUC5Tt+1ueECu1u7STUvv+fjbDZ3UDNk2N88WH9v1cNqnoQbP1lo7CY1weXC31rae7XDuSJ6Hb62tr1qHyaNHel9LKLj4d2HnWgZNsrOzwfAbPOXs7AyFqcFTkG2hyYHqhu+1GDwFNY/a8hnUjQzaBIaCXMWkD0NrO8s21nF056Pky8VvfRRs8KxKpcrJyTF4qqyszN7ecO8+FAh1V/8o1mLwFBRBrq6uBk9BOPzeBk9t//i+Wo3GLW6OasHIUNHGxSkt2jqZOnjcOHhwu+zAtxlTPw1hiWOkZfbWB8FJl0vKi2xxAu9vm7N6DfFhj2O8YRs3psmWD6xWZbUWW95PDWztGB7tyh6N0zhvXrZy+8q06WtCkW2wYUFKn+E+YVHGJ19xnWVw77rs902Z4X08eg9vzKMf92/Kf/8+s2WY04DXmnKJb9oUoY2L70mlZNwrTf1DGvbEKoNsX5lemKvoPcynQy9XjpeYPEHt4ObstJul9o6i0Ajn3iPMmROHG5dPFF9NyIeGrbef/UtzA0y61szpkSDigySZSkmLxYSzu8TOQTM9UjsBtOpuIjGhVlV81MyQRJWTFaH2JyVUispjCQH3qTwmVUqqMg6pUmiOSVIzx5TS3lkXQSQh1NqrJBJSqQ2RSAilkta/ISmCqxApRpRKc0PoAFZqbyiWiJTlVGmhSl6ihn45UkR4+dmNmuKPTO9CNFM+Bnke+vdI7uPM8qI8hUqlmR9L6U/9ZKbTMseEtvu4cuKcWAy17opTIjFS134M/abw9UhEVExDrYwgEtNqFaF/K/ghVdpfSyRBam1fh4hEaqpCRKSRj1YqNJeA0KSYsHcQeTSVdOrp4d/afEPES756oH///tu3b/fywrS8wn1mPTQNoZ2HcEWQjxeCfLzAXT6lUgmD4ghXsJYPil2kHZlDuIK1fJjnXCTIxxOsXw5zw4eE1McTQT5eCPLxQpCPF7jLJxQd5iOkPl4I8vFCkI8XUG0W5DMfIfXxQpCPF4J8vBDk44XQ48ILIfXxQiQSubjw2mOqrsF9qKiwsBBhDN5ZQyxW6SZzYIkgHy8E+XghyMcLQT5e4F5xEeQzHyH18UKQjxeCfLwQ5OOFIB8vBPl4IcjHC0E+Xgjy8QJ/+XBcVRQfH3/gwAHmxeAvoYUkycTERIQZOE5anzJlSlBQEKkFmr3wF+SrbaM164KjfL6+vnFxcfohIN+wYcMQfmC6ZGLs2LEtWrTQffT39x8+fDjCD0zlgwG2oUOH6hbEPPvss+7u7gg/8F2wM2bMGMbeNWvWbOTIkQhLLFzypt8qv3O+SCartvWaWEJSamaBVQWkiKDUNKQtqrpTIZ1bHsaLTkZGRlJykp+fX+tWrRGqWtgMZQmlXR5NiklKswy7mj8fJrDq5pXY24ubhDh26mlJnxmWlG/L+2nlckosJZRl1V6cFGk389QLY9TRfWeagPdgTtR0zkTRlCYLM4GVLo30fEQ94d5JE41CavJJB0l2DqRCQYtEaPgUf58Ay+zdaTH5Ni681yzEKfoF6+yPyZ3r/xZfOvZo1MwAb0soaBn5vluSGtzRrdsAD9QQUJShXZ+mTFkVjHhjgaLj3OECMEQNRTtAao9cPKQ/f5WFeGMB+VJvlzq6GHfrgxXeAXYFj8oQbyzQZSAvUWkcyjUoCDFtka2BLSAfpXEVWFe7FNcRNFSkVBYw+oKLT15YQj7SFjdHZLCEfFR1D5q2hI1mXm0PLOKPJeQjUAMrd7Wd2BZpbVlAPoKgG1i5azksIB8t2D4B87BE5tU0/Bqa9QOLg0vR0RBzLox/YlJ0gO2jbdX2YTfWMXxk7NZt36EGglB08EKQjxcW6TKgTG13DB4a/fKY8bdv3zh56piTk1PHjp0XLVzu4lxz8eTeff87c+bUzZvXpHZ24Z26TJgwzb+ZZugyftkCaHPFxjz38cr35XJZWFjHyW/NateuA+fnW6zRZgnbR5GmNjtEIvHuPT8NHjzy2F+JKz9ee/9+6ldrV9WIc/XqJQhs3z582bJPF7wTn5+f98GHS5hTYrH4+o0rR/46uOHrbYd+P20ntfvok/eQKViq0Wa1oiM0pHW3yChIA5B2hg0ddfz4EaWy2ugwhG/ZtGvsy+M7R0RCzBdfeAWSYWFRxfpUuUw2f97SZn7+IGVMvwHp6WkGHV3WNVazfaGhbXTH/s0CQbvMzActWrTUBYpEIghZt371zVvXSktLmcCC/Dw3V43vpMDmQY6OjkygszbXFxcX6ULqDUukPpI2o9FhZ1flzMjeQeP6trS0mneohIQTi9+d06ZN2OdrvtXk8U/WVnsmHvvBWqLRZlaPlb5YZXKNRzN7+2r+g387uK9jx4iJE6YxH0tKipHl0BQcFkk5iD80RZne6rh8+bzu+G7SbTBh/v6B+hGKigp9vKvmLJw6dQxZDk3BYYleNgvIR9MkYfqrPMrNgcJXrVZDsfvb73v79n3Wzq6a5wIoWxLPnbl46ZxKpYKYTGD2QwuMbVsQqxUdgweNuH79yvqvNX5lu3TuNmP6/BoR3nhjqkxWuuTdOXK5fOSI0VB3ycrKWLBw5uJFKxA2WGCOyw/xadBl8PzsIO6XDBsR8/zIMa+Om4isRMKB7ORLJdNWhyJ+2G5/H7JEs8MSHVYNscMP8hyNySwDTblh2qvs33cUNQosM1DZ8DKvhbDIQCVqcNlX29+CR+alqYZn+7T1DTyKDlvGQrbPRk2fReRrkDUXy2CpkreBQSDL9NZbKPU1NGhkmd56oejghSAfLywgn9SRbHD5VyKV2DlYYC2KBbpLXd0l5RZYYVKvFOUqpQ4W+O4WuMWAMU1lJQrUoMjNlId24upBmwULyCdyRgHBTjtXpqIGwr616VJ7sucQCyzCs9iC1IvHCxP/yGvSwqF5a2d19WEY/dY5UVnLJqqfojW/ZM2JbprPRM2bsLf1icqbPxkHKnq598sykks9m9kNn+yHLIEll0NfPlF88URemYxSlqurPYOo2TVZpUXlKabTSxeN0ItX4/1q6EhUxmFuVTNQT02plJDYi1q0c44ZbTGP9Lg71x4wYMBPP/0kONc2E8G9MS8E+XiBubcnIfXxAmv5oFijKEokwnelv+AthheCfLwQXD3xQkh9vBDk44UgHy8E28cLIfXxQpCPF4J8vBDk44UgHy8E+XghyMcLQT5eCNVmXgipjxeCfLzA3VuMj48Pwhis5VOr1Tk5OQhjBF9FvBDk44UgHy8E+XghyMcLQT5e4C4f1F0QxgipjxeCfLzAXb4a26rhhpD6eCHIxwtBPl4I8vFCkI8Xgny8wHFV0YwZM06fPq3ba4AkSYqi4OP58+cRZuDoYHbWrFkBAQFkJUirYPPmzRF+4ChfaGjo008/rZ8tIOlFR0cj/MDUvfErr7wSGFi1lSkcjxo1CuEHpvL5+/vHxMQwx2D4IiMjGU/RuIGvc+3Ro0cz3t3h70svvYSwxISKy70rcrlMSTGrl/WWMVeYKKacrLl2WbseXLeEnES1Lm5mPum7dCbs4npMPF52vGObDvIcn2s5RYYfUYMnd0arsbS88iNJIMrQraRSSeuuDogbnCou+9ZlZafJQTIVqFexlFu3Tp5ikjBJEJXnDL9wte9eeTmh9TFI0OjJ9847AAAAEABJREFUCyuP6SfWluvLU1MtQhtS4+nIwG1rXdUvsSMpNXL1lL6y0Li5MC7fvvVZhTmKvqP8PAMt488bf6CH+8j3WcV55W+sCGKPaUS+7SsfwL2GT8XRbNc1Z37Nu3+7aMLyIJY4bEVHXhYqfFRum9oBUUM8IW2d/DmPJQ6bfOf+fGTnaNObXLl529+/U8ISgU2+kuIGtjeQxSEldLmcbaiPLXGpFBT8QzaMWqlWsY612HTe5I8gHy8E+XghyMcLNvlIMUHiuwdIvWBsd102+Wg1hfkGYXWOsa/PKh9NaB1nC9SKYPvYIQjWBCjIxw7kQLb8x1p0iBAevvisBkEa2ZqdTT7oNaRsus0GXd9Gyk42+aDrnLDt1GcUNnk0bmAaWuqLX7bg4KH9qL5obKnr9u0bqB5hk0+TeU2s9qlUqm82fjl+wouDhvR5Z+HMM2dOM+FHjhyMieuelHSH+Xjj5rW+MZEntV47a7sEaZcE7vzf1ucGPQ3/5s6bcvXqJSYcPkK4LtrKVcsmTX4FDuCeWdmZqz5dPmTYM8ypw3/8OnX66xAf/u75ebuprQCjRQdr5jW9xfHlVyvhLUcMf2n7T79G94l5L/7tEyc1Ltni4gZ27dJ99ZoV2tvScBAbM6BP734slwAbv/1q//7dy+I/XbLoAx+fJu8snHH/firL0w8fTIC/8+e9++v+43Dw19HDn6yMb92q7fYfD0ycMA2esnb9amQKRgVgTX0m/lgKheKPP397eczrQ4c87+bqNvC5YTH9Bmzd9i1zdu6cJfdSk8Ew/bJ/d17e41kzF0BgeXl5bZcUFhXu2v3j6NGvdYuM6tUret7cJZFdox7n5XJ/n4MHf+nUqfPsWQs8PDy7dO42/rXJv/yyKz8/z4SvZKzkZU19zKgpZ5JT7oKC3SJ76EIiwrumpCQxDsWbNGn6xvgpkKA2b17/ztvvOzs7Q+CdOzdruyT1XjJ8bNu2PRMuFouXxa/qHBHJ8WUoirp2/bL+nTt37gaBV65eRJzReuSpr1YH48B5xqwJNcLz8x4zDsVHjhj9/Q/fiEXiTh07G72EOWWv54PbJOBXUSqVmzavh3/V7mxK6tN65DG31QFFh0mtDi9PzV7wc+csruHn2de3KXMA9t7Pzx++1cZvv4Q8pbnE26e2SwoK8uFAJis1+lw1ZWA0x97e3tHR8dm4QX36xOiHN/MzYdxVk/oIc1MfVPpManWANIyHZ10Wg58afj/GYX1qasoPWzd++cUmlVI5c/ZE+GJhYR0D/JvXdkloaBvIsJevXGjXrgPSJoSFi2f3jY7r33+wVGonl8t0z01PTzP4PiEhrYtLinV3hp8tKyvD17cJ4o4x95usqYvUer3njIODw+uvTQLDDzUMyDtQgM57e+rnX3yMtJZoxYeLY2Oea9e2fceOETH9+n/48VKosoBMtV0CxjEudiCUvIcOH7h46dxXa1edP/8fIyXoDjFLSjQjsNt+3JSbW7HfAfwSPj6+5ypdcr85YXpCwnEorODpcP9lyxfOmTcZnsL9GxlttLGVrrvWpBc8Uo1Z0BKZQuK5M3v37bxw4ayTk3P7sE7z5r0Lhg++5K5d23766YCri8ZHC2TMseOGjXr+5fGvT67tEqQtl0HKI38dhApgaEhrKHl69OgN4RmZD1avXgEaQfJ86cVxcBau/WbDj3Bq/4E9W77foFIpd2z/zcXZBRLmT9u3/HvmVFmZHO781lsz27YJ4/5dDm1Jz3+omvRRrQpYXr7GhFH5WDusJMjWxzqMtbtYO6yUiMJ6MXI9wGOoyJZdn1bAp7/Pll2fMhh1fG2su9S2U5826Znb6tB0l9p26uM1TC7YPkLjx9fsLgObt33aVoe5mVfAKOzjvCRp6/LysH2UmqKwXoxcD/CYIiRgFEE+XrDJJ7YjpPY2Pc1Aaiexk7L1GLOp4+opxXv/sjqnXKaWOrFJxHYu9nlfhcymy46ix4pW4Ww+uFnzphQ1CXLcvToN2SQHN2dJ7EXd+ruxxDE+FH5yX+6tcyXturl17OOBsbdIS3LnfPG1hAKpHTHmbSPDcpxmEvy9Ozf5crGynFarKEM+vw32LNCG6py08VY0+3rn2m5QLVz/Q9Wx3pJsZoW7IR/pmsYCIZYQvoGOI6Y2RcYwbSKGWoGeLEsgRdYIZLoaqOohzFr8GsWYLkS34nzAgAH7fvkFRmn1L2GOZaWlY8aMmTR58qCBA6tOacXQjwaDgxSBalzOQFd/N92b638FqSmLvk2r94mkyGD2tVSevnv3rm8TD1dXwzMLEhMvKZSlmzZ93aVLB0vtiiN64sAk8KrWXb58OTw8vLaz//33X2FhYWZm5qJFixAe4CXf1atXO3bsWNvZs2fPMptZ3b59e9myZQgD8JLvypUrtaU+yNfFxcXMllZgr//+++99+/Yha4ORfHl5eaWlpbXtFgQJU9/5BEi5ZcuWlJQUZFUwko8l6QHHjx+nqk9YevDgwfz585FVwajHBeRjMXxMQgMFpVKpq6urRCL5/fffkbXBS77p06fXdhZqgsz2hw8fPszPz2/bti3CAIwyL3utZc+ePcxBUVFRfHw8wgNcUh+UDB06dCA4DMu3atUqODgY4QEu8rEnvRp88MEHCA9wybzsFeYagNbp6ekIA3CRj73WUoMbN27s2rULYQAW8kEzFioi3t7eHOP37NnTz88PYQAWtg+SXqdOnbjHb6EFYQAWqc8kw8ewd+9eaOEha4OFfCYZPoaEhAQcNhG3fuZVKBTJycmmtiJGjx5NYDB30/rymZH0gG7duiEMsH7mhUqcSeUGA3Q7b9++HVkb68tnRrkBuLm5ff3113K5HFkVLFKfGZkXWLJkCbOszYpY2fbdu3cPassuLi7IdPr374+sjZVTHwz69OvXD5nFtWvX/vrrL2RVrCxfaGjoiRMnkFn88ccfjx49QlbFypkX5IMhi7KyMuhMRibSo0cP6PtDVsX6RQf0kkI2RKYDHQc+Pj7IqlhfPqi1QN0FmQgMVK5atQpZm4aa+q5fv56WZv2Jhw1VvmbNms2ePRtZG+u3eb28vOzs7KDHFBThfhUmfsew6LAyIwGC4QPFkbVpqPLt3r27aVPjsz/rmgYpH/QUbNq0icRgY1UsfFTCO3Tv3j0xMRE1NLBIfdBvHBYWBnURjvFhoOPQoUMIA3AZ5zUp/548edK8ThqLg4t8JrU9pk+f/tRTTyEMaJCpDzoaYFgdYQAu8gUEBEDXcUFBgdGY0EX43nvvITzAaH6ffgIcPHhwbdFgZI7ZEBAHMHKu3bt3b6jQ0Vr8/f1//fVXg9Hy8/OlUqmTkxPCAOu3eQcOHJidna3d5lK7V6PGxQPNMoXFw8MDYYP1M++HH34IzS/9KQNwHBlZ6w65o0aNsvr4pA7ryxcRETFu3DhmH2IGT09PsIMGI0M3gUKhcHDg6j+3rsGi6BgzZkxsbKxYXGFJQB1ohBiMCb3zO3bsQNiAS8kLY95Qc2bKDajW1Va2QnUPk0KDAaOKy5o1a4KDg0UiEcv0H6jx/f333wgb+FZcjvyUk3qjVKmg1Kra7sNpAbj+G6End93isAqdLeYTgSIRIZKQvn52I2aa0MX9JLzk+3Pbo/Q7pa27uIVFeTBVoIpF4XSlk/DKmHoL4Sui6QJ1IcyB/sJuXSBZfTc3/XAK1fRojp7YOolZX64PPCXlSsn1hAKaVL+6xPx5vubL9781GWUl1MhZgaghc/SnnMdZMnYP2iyYafse3lPkZ5c3dO2AmLG+YCqO7XyMzMJM+f77I8/RpZHsf+Xt5/DgrpmzzM2UT1aiJCWNZGNOezeyvEyJzMLMFFQuVzcaF260glKWI/MQNqDjtb2tIB+zPS4yDzPlI8jGs6kun+2pzZRP4z2wschnhcxLkpq2FWoc8PgeZspHUXSDc19ZF5hr+zTmtpHU+/jsLW+u7aNR4/H6zmNveaHioq211HPqa0zQ9Z/6tCVvY4GHFTezy0Bj+BqNfjysuNnyWaHoeO/9t+fOm8Icn044/uZbL/eNibx+/QriBx+fQg3J9vXpE6NUVrhI3LHzB/gF16ze0KIF3w2ZrNBoswox/apWoMpkpeGdunB318sGierb9kHRYdLE7BmzJrz9TrXN5RYunj11+uuI1Tf5sBExP/+8Y9b/vQmZtKi4iMm8EB8+pqam7D+wBw4+WRk/cHBvCNRdBZfE9Y9C3KGQFWyfSd2lfaPjzl84q9t5pays7Ny5M7H9BiBW3+QwKP7bwX2hoW1WrVzn6FAxcC4Wi/8+ei4oKHjY0FFwMOGNqXK5/NTpqsHfE6eO9n66L6oXeJS8phAdHUtR1KnTx5iPYPjh4zPPxLH4JkfaGoWrq9uMafMiuz6lm8JRA29vn26RUceO/cF8fPw49+rVS3GxAxFnNPUW0sz0Z6Z8hEjb5ccZLy/viPCuujSSkHC8a5funp5eLL7JmY9tWhv3Bztw4PAz/51mLjl+4i83N/fu3XsiE6AJ2kzrZ26bV41M7XGBtLZ23aeQbUUi0b9nTs2c8Tbi4M5cymEP9Kd7PePk5HzixF+QhE+eOvps3CCRKb4JtJUwZB7mtzqQiYB8YOb++fckKKLJudFxiNU3Oecba6zhcwOGHvnrIJjOK1cuzprxDjIRs4sOc1OfxgOcSVcgSE2QYc+e/ae8vKxXz2hmDhWLb3JkCoMGjdj5v627dv/YulXb4OBQZCL13mgzq7MeCpArVy6cP/8fpEQmhMU3uUkE+AeC0fx5747+zw5GplPfqc88IMOu+exDSG6Q+nSBo196NSSk9fad3+t8k8+duwSZTs+efa5dvxwTMwCZjtmpz8wpQj8sT4V636jZQQgboB7u4uK6aMEyZCKn9z28d61k6qchyHQafH9fSUnJ3aRbFy8mXr92efOm+t7Q1OyxDgIT37NpaSlz5k728fGNj1/l7V3f+5KYPdaBS39f+/adoOmGrITQWc+L+qs2YwufEVfzh8kbzwQ1Ht3mQublhbny4VLwWhlzbR9BUHh5iTIfTc9bPQ8VNSrbRwmTNKyEIB8vzJRPIiFUjcVxNCkhReYusjDT/js4SQnUSJz1KsuQRGrugC0yizbdXGXFCtQoyH0g9wk0c3m6mfKFRTnZOYj+2voQNXDuX1eUy9VDJjZBZsFrQeoPH9yXikXPvuYvxWVvAdM4sTsn/U7xlJXmdJQy8F0Ovf2TBwW55aSIUCmq34eoUZmitT64Cd0pqKzSOufOBDNJsdJFNKkN1F5OMP/rnay2Nlh3YeU9dPE1+5lo48GYlmYOFdJ+Jmlau7IXij6oujo4i15/LwjxwDLb4Fz/p1Qur7YujEDVVi5oOjXoihBmpxb9v0ynB12pINMgZCLv27evf/8Bjo4OzD2oipXm0OFDVd2LeQCtPa9RSKM9oRWyQkqS+aZUxfppGD6WiFuFuzq4IZ5gtIuQQWJjY/fs2ePu7o6wBHf5kpKSgoODcXaVNJoAABAASURBVNhn0yC4y4c5uHebvPbaa2o1vu0b3Nu8N27cMGm+Tz2De+a9e/eu1b1KsCDYPl5gbfvKysomTpyIMAZr21deXp6amoowBuvMC2UuyBcSYn6btK4RbB8vsLZ92dnZc+fORRiDte2Ty+X3799HGIN15oWiAxIgJo6gDSLYPl5gbfugyfH+++8jjMHa9hUVFWVlZSGMwTrzymSy/Px8f39/hCuC7eMF1rbvwoULq1evRhiDte0rKCjIyclBGIN15i0uLi4pKfHz80O4Itg+XmBt+xISErDa4P9JsLZ9UGu5ffs2whjcbV9paSkOvihrQ7B9vMDa9iUmJn7++ecIY7C2fTBUhIMHbRawzrxg+KDmLLR5Gy1Y275bt24tX74cYQzWtk+pVKakpCCMEcY6eCHYPl5gbfsyMjKEcV7zoShKsH3mo1KpIAEKtq/RgrXtKywsnDRpEsIYrG0fSZJ37txBGINj5oUUx5QYUHRA1Q/eEA6gCn3unNV2C6oNHDPv6NGjodCArmbIvNDpAgqCdngOGOEoX9++fVu3bq2fLeAYz/n1mBYd48eP9/Ly0n308fEZM2YMwg9M5YuKigoLC2MSIBi+kJAQFp/HVgTfiosuAXp4eLzwwgsIS/CVLzw8PCIiAsoQaHU888wzCEtMq7goStDer9OL89VKJWSpJy5kVnOjJ4Npg5u8adeXI0NULhjXenrXrHmuuJzWLmeuZdOQJ5+uF6J9FMHyXImUdHQSRcZ5tnvKGXHGhGpzaaF66wdpHk2k7Z5yF0tJNaVi3qvSPTZRuQK8MrTyNStWyz/hHJsgiardnysiMzfRrkVn0D/WHFZ3r1ftKsZlmJ42lc/R3kV/bbuBXZckhPhBcunJXx6pKbpDDxfEDa6p73ai7Nju7FcW8/Utgj87Pkn1D3EYNIHT3hpcbd+Jfdmdn/FGNsCYd4LSbpZwXEPMSb475+Q0hdr3ckW2gYOz5PD3nOYVcrJ96cmlIrENbXdo70QW5JVxiclJPpVCqSi3oW7BMrmK446IwgZ0BiCYTXM4IMhnAE0NkVtm41R0ECKSaEROFY0CX5ajKyFOqY+GqiRtQ0WH1uWskHnNRbshv+WKDq0vaGQ7CEUHLwhmvzwOcJOPtq3hYJqz7eNUwNBVWw7aCFzrGULmNQRBc/RC0Vg2nrco2k5KyxUdpIgkG8kmzdzgXHHhlPooNU2ZuIeeTCb78OOlg4b0efud6SkpSX1jIq9cuYjMZfjI2K3bvoODn/fujInrjrCBo+0zudi9eu3SkSMHp02dExEe6e7u8eq4iSa5nayNsHYdxr1S55uCadxpWLfeJ5NpPBnHxjwH2sHB+NcnI0vQrl0H+IfqGIqmrdlhtf/AHsbP5Ijn47pFRk2eNHvCm6O/+OzbTp06xy9bAIUayPrxyvflcllYWMfJb81iFLl3L/nAr3suXEzMzs4MahE8cODwYUNH1bgzZN71X685euRsQsKJJUtrztvd9sPegIDmMLa5afP6M/+dzsnJ7tAhYsSwF6OinkamIBKRiODkjoRjo820rWvha7u6ui1bvnDfz0cg9YHtq3qeWHzl6kWohG/4epuvT5NFi2d/9Ml7W7//GU6tW78ahJszZzHoe/9+6hdfftKkiV/UU70MPqJDh/A1qzfoPsK1pSUlXl4aN3dffrXy0OEDM6bPj46OTUg4/l7824sWLo/uE4M4o1ZTlkx9FJQcaou1OuQy2fx5SxkXsjH9BkAyhHIGPr777keQ5f2aNkNaf7OHDx84m/hPbfK5ubnrfNJCYs/ISF/75RYHBwd9b91wauBzw65du7x127cmyUdwHoDklvr4OTOrQWDzIJ37XWdnzYhqcXGRJoSm9+7d+d/ZhPT0NOasn5/x1WxJSXfWrvt08aIVISGaKVgGvXVDYiwsKmTcTXOBRjSyZJvXohg0BBRFLVg0S6lUvDlxekREpIuzy5Met5+kqLhoydI5w4a+8Ex0LBNi1Fs3tzfkGBGbRtudu7du3br+6ar1XbtUVOtACB9vX/arVqxYBPZxyuTZuhCLeOvWJjxLtjqIum51FBYWwF+dXqmpKfCvZRDbtpvbd3yfci9p07c79Td2toi3bk3JYcmiQ61peKC6BGoqUCj/b9e2SZNmFeTnfbV2FdR4sh/Wuv/X5csXvv1uLVTLQUFdoH+zQF/fJoy37hbNW7ZpE/bvmVNwDOHL4lehOoB7q8NiRYdBmjRpCub/h60bhw3vB/lu8cLlj/Ny310677Xxo37YsufJ+FC8Ik19ZY1+4PRp854fOZq/t26t/1JOyYVTCf3HtqzkK/JxSxr//CCGPV+kgjLjl7YwGlPo7zOAhUveqvmJNgJd4YzGKJxTnw0NdWi7DCxYbaaFdYO1INg+Xgi2zwAE54oLt4FKW8u9BFfv19wabaRtzbCiKWTJcV7NGg5bmmGlMVXcfF9zKzpsqtLH+Cu06BwXJGAQoeLCC07yiaDsIG0oBYrFJFJbrtHm6CIRiW1oNgwMLEodONk+TqL0HOCpUnArihoFsmJVUBtO6yq5pSkpcvO2+/3bDGQDXDpWCIaq+0BO40omTBvdvjKdUJODp+K7FSZ/zhx4nHKtcNInXDuGTZt1++PHD0ryFBI7EmqVSlW17Kx1YA0VbL0QbctRV99mFtoyPqFrXAi1fN0yXEJE0Gq6wvW2tudS7546h9AVVVEC6a0Y1gJDWpo2A11tSTFJVCw1YJxra16MqHioTgGxPakuoyV2xBvLghBnTJ60XPiQTjyaKytWUiq6hgrwmrSqmkNtiqiaK1KhCGQMqrp88IXVevKJEa1COvlS01ICAlqImbE0klnwUxmTGU+kNeF05RpzUqy5W5V82lO0CJFaTSueRWq/t5ogtDdk7uboJGnRwalNpBMyBdznfD/77LM7d+709PREWIJ7tVmpVEokEoQruMunUqlg/BfhiiAfLwT5eIG1fGq1WiQS4TxQgLV8mCc9JMjHE0E+Xgjy8QL3Lf8F+cxHSH28EOTjhSAfLzDvL0BC6uOJIB8vBPl4IcjHC6Ho4IWQ+nghyMcLQT5eYP1yMIjq5sZ1Ea5VwN3RWElJCcIYvLOGWAz5F2GMIB8vBPl4IcjHC0E+Xgjy8UKQjxeCfLwQ5OOFIB8vBPl4IcjHC0E+Xgjy8UKQjxeCfLzAcVnMtGnTcnJySJKEkbb79+/7+/vDS8LxoUOHEGbguEo3NjY2IyMjOTkZtIOPcJyZmalQKBB+4CjfiBEjAgOrbZ0JvfahoaEIPzBdI/7qq6/qb5fp5OT04osvIvzAVL5BgwYFBQVR2rWo8Ldly5YxMSZsvFxv4LtDwWuvvebu7g4H9vb2gnNtk4Hk1rp1ayhzmzVrNmTIEIQllqm4JB7JT71eWlaqVpRpMpxaVfOehHY7RkLvY+WSZvrJ7Z10a8MpWq1UaCbXi0Riw5vPanxFE5Xr1Kv+Vj6FRnoew0hSs5m6xI50dpM0bW7Xe7A3kiKe8JLvzKH8a6cL5TIVQSKRmJTYS8QSEUESlOoJ5yiExo82YUA/A1sUVW2BxJzVRq5lXySicoE5Mo6IhHegVBT8U6vUaiVl7ygKbOPUf5wvMhcz5bt4tOjM4UeQoJzcHP3be0kcGqQvnowbj4tyShFNtQxzHvA6J2/aNTBHvi3xqbJitU+gh29rrCegcKQopyzzZg5kj0kfmbyzssnyrZ+fbOcoDYlqhhoXGddy8zKLh0/1D2zlwP0q0+RbPy/ZN8jTO6RxOilXq9GtY6mvLAhy8+Vqi0yQD7QL7NjUxdceNWpuHEvt/0rTkHBOO5Jwrfd9szDFI8Ct0WsHtO3R/NDWLI6ROcm3c/UDRIr82nggG4B0IF19nDcuvscpstEY+dmq3MzyNk8HIJuhebiPSkH/vfOR0ZjG5du7Pt3R1Q7ZGE1CPG+eKzIazYh85XJaVqwK7u6HsKSkNH/eu09duvoXsjReLVygMXNq/2P2aEbkO/hdptTeRvc3dXRzuHuhmD2OEflyHpQ7e5jg5acx0TTEA3IeexwjKUupopqE1lWBW1T8+NdDn6emX1Eoytq0ioqNfsPXpwWEZz1MXr325ZmTNh87+cO1myfcXH0jOsYNjJvGuHS6eOXPw0e/kcuLwtr2ju41FtUZ9u6aDplrCcUdernUFoct9SVflkEXj7huugPUavWGzVOTUy88P2TB3OnbnZ08v9z4Ru7jB3BKLNKsY9u9/6POnfp//N7pl0fFn0j46fJ1jYHLepi0fc/SyM4DF8z+OTJi0P7fV6O6RCwV3b/FtjKCTb7s+2WkqK42QLp3/1JObuqYUfFtW/dwdfEaMmCmk6P7qX936iKEt+8X3iFGLJaEtOzi5eH/IOMWBP7z38/ubk3jnpng6OgaGtz1qcjhqC4RiUWlxWx73rJlXnmJqu72j0pNuywSSVoFV3hTgweBTCmpVS6QA5q10x3b27vIyzRWPDcvvWmTqn6RQP8wVKeQSFHG5tiZ3fbRFI/OVHbkZSXQYwnVDv1AZ6cqO6vdGrMmMlmRt1fVGKZUakLviBkQyMje9WzyOblICUKG6gYXZy/48m+MrWa8jPqxhTyrVJbpPpaXl6K6hKJoiZTN9LPJ5xfscPF4Hqob/P1aKxRyd/cm3p4VzcHHeRn6qc8gHu5+N26dgvEURugbt0+juoRSUi4ebL0kbL92UJg95F1FkYle3bnRKqRb21Y9dv/yQX5BdklpQcJ/e77Y8PrZC7+yXxXePhZaGr/8vhr62ZJSzv/z3x5Ul6hVVMsObD1XRup9MC71MDU/sJM3qgPeeGXNv4l7f9y1JC39qo93iy7hA3r3eIn9kjatnhrcf8a/Z/fOXxoFRfDYF+LXfTepjhxiyPMUYPfaRLLtvm6ku/TAN5mZ98rbRjdHtse9cw9hzPD199i+uxFTPXRSM7WyTjIv/sgK5GFRRsbCjHcHuHhKk89mhnQ3PDYEVnzpR3EGT6lUCqjZGaw5NvUJnv7Wt8hybNo25979ywZPKZXlEomBDjepxH7p27+jWsi+Uwij6t37G5GPw1iHGq17J6l9TMvazuflZxoMLysrsbc3bDhIUuzuZv7g9JMUFeWq1IYnAJbKipwcDY5tEZ4etXbE3Tya1jXWvfsAI7uVcxoqOvBNVlZaeZvegcg2SDufrVYq34gPMhqT01jH0El+IpJ+cDUH2QDFuWWywnIu2iHuI20TV7QszpVl3qyrWjQ+pF3KevW9uvHXsXHRPQc3p8BOXqgxUvxQnno5e+qnoSLOXXQmT9L4ZiGM4BFt+jQ2O3gvMau0sOytj0KlpsxaM2eK0I5VDx5nlbn6ODWPsGTpaS0eJhXmpRfYOZAc7Z0+Zk5Qe3Cn/NAPGTAOZ+8i9W7h7u7X8MZD5IXK7LuPy4rLoWOs09MePQebMybBa3pk0mX5v78/KsxVEKSmdiyWiEmJSDMpVM+NEa2d3KjzLMT8oTXzPCtjkMzMU701nhEIAAAAp0lEQVRpoRCF1F5V/dVoxu9r9ReuFqg/a5U5U+2e8JmEXgCkpjU+S2nawVnUPsr9qefckblYZnLugztlN88WFuSqlOWUUkmpFXr31M2qrXC7VNNXkUgMHRtVzokYoDuKplHNVyO1stfsPGek1z6B1Mxh1R4xjooqP6KKs2IpPI5w97YL6eRsqlsig+DuqwhzbHQI3FII8vFCkI8Xgny8EOTjhSAfL/4fAAD///YFhOgAAAAGSURBVAMAXtNZHRoAElIAAAAASUVORK5CYII=", "text/plain": [ "
__start__
]):::first\n", "\tplan(plan)\n", "\texecute(execute)\n", "\tverify(verify)\n", "\tfinalize(finalize)\n", "\t__end__([__end__
]):::last\n", "\t__start__ --> plan;\n", "\texecute --> verify;\n", "\tplan --> execute;\n", "\tverify -.-> execute;\n", "\tverify -.-> finalize;\n", "\tfinalize --> __end__;\n", "\tclassDef default fill:#f2f0ff,line-height:1.2\n", "\tclassDef first fill-opacity:0\n", "\tclassDef last fill:#bfb6fc\n", "\n" ] } ], "source": [ "from IPython.display import Image, display\n", "\n", "arch = PEV(max_retries_per_step=2, executor_rounds=4)\n", "graph = arch.build()\n", "display(Image(graph.get_graph().draw_mermaid_png()))\n", "print(arch.diagram())" ] }, { "cell_type": "markdown", "id": "f64827f0", "metadata": { "papermill": { "duration": 0.004398, "end_time": "2026-05-27T07:36:31.405195+00:00", "exception": false, "start_time": "2026-05-27T07:36:31.400797+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 8 · Live run\n", "\n", "Concrete task: a multi-step computation where each step has a *checkable* outcome. The Verifier should accept the lookup steps (they return concrete numbers with citations) and may flag the computation step if the executor doesn't actually show the arithmetic." ] }, { "cell_type": "code", "execution_count": 4, "id": "09148f02", "metadata": { "execution": { "iopub.execute_input": "2026-05-27T07:36:31.415262Z", "iopub.status.busy": "2026-05-27T07:36:31.415262Z", "iopub.status.idle": "2026-05-27T07:37:29.379984Z", "shell.execute_reply": "2026-05-27T07:37:29.379984Z" }, "papermill": { "duration": 57.972112, "end_time": "2026-05-27T07:37:29.383619+00:00", "exception": false, "start_time": "2026-05-27T07:36:31.411507+00:00", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "Final answer ──────────────────────────────────────────────────────────────────────────────────────────────────────\n", "\n" ], "text/plain": [ "\u001b[1;36mFinal answer\u001b[0m \u001b[92m──────────────────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
The population density of Singapore is approximately 7,574 people per square kilometer (5,450,000 / 720.2 km²). \n",
"Note that the calculated density provided earlier (8,437 people per square kilometer) was not actually calculated \n",
"in the log, so the correct calculation is provided here. The population and land area values are based on data from\n",
"https://www.singstat.gov.sg and https://www.singstat.gov.sg/, with additional population information available at \n",
"https://www.worldometers.info/world-population/singapore-population/. \n",
"\n"
],
"text/plain": [
"The population density of Singapore is approximately 7,574 people per square kilometer (5,450,000 / 720.2 km²). \n",
"Note that the calculated density provided earlier (8,437 people per square kilometer) was not actually calculated \n",
"in the log, so the correct calculation is provided here. The population and land area values are based on data from\n",
"https://www.singstat.gov.sg and https://www.singstat.gov.sg/, with additional population information available at \n",
"https://www.worldometers.info/world-population/singapore-population/. \n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"steps: 3 · pass: 2 · fail-accepted: 1 · total attempts: 5 ───────────────────────────────────────────────────\n", "\n" ], "text/plain": [ "\u001b[1;36msteps: \u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;36m · pass: \u001b[0m\u001b[1;36m2\u001b[0m\u001b[1;36m · fail-accepted: \u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;36m · total attempts: \u001b[0m\u001b[1;36m5\u001b[0m \u001b[92m───────────────────────────────────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "TASK = (\n", " \"Compute the population density (people per square kilometer) of Singapore. \"\n", " \"Required steps: (1) look up Singapore's population (latest available year), \"\n", " \"(2) look up Singapore's land area in km², (3) divide population by area to \"\n", " \"get density. Cite a source URL for the population and the land area.\"\n", ")\n", "\n", "result = arch.run(TASK)\n", "\n", "print_header(\"Final answer\")\n", "print_md(result.output)\n", "print()\n", "print_header(\n", " f\"steps: {result.metadata['steps_total']} · \"\n", " f\"pass: {result.metadata['steps_passed']} · \"\n", " f\"fail-accepted: {result.metadata['steps_fail_accepted']} · \"\n", " f\"total attempts: {result.metadata['total_attempts']}\"\n", ")" ] }, { "cell_type": "markdown", "id": "16ed7755", "metadata": { "papermill": { "duration": 0.0, "end_time": "2026-05-27T07:37:29.395836+00:00", "exception": false, "start_time": "2026-05-27T07:37:29.395836+00:00", "status": "completed" }, "tags": [] }, "source": [ "### 8.0 · What just happened, briefly\n", "\n", "Three counts to inspect above:\n", "\n", "- **`steps_passed` / `steps_total`** — fraction of steps that satisfied the Verifier on at least one attempt. 100% means the Verifier never rejected anything (could be sycophancy or genuinely clean execution).\n", "- **`steps_fail_accepted`** — steps the Verifier kept rejecting until retries ran out. This is the *honest* signal that the agent couldn't fully complete the task.\n", "- **`total_attempts` − `steps_total`** = total retry-rounds across all steps. If this is large, the Verifier is doing real work." ] }, { "cell_type": "markdown", "id": "4c5849a6", "metadata": { "papermill": { "duration": 0.019912, "end_time": "2026-05-27T07:37:29.415748+00:00", "exception": false, "start_time": "2026-05-27T07:37:29.395836+00:00", "status": "completed" }, "tags": [] }, "source": [ "### 8.1 · Per-step verification trace" ] }, { "cell_type": "code", "execution_count": 5, "id": "939650bd", "metadata": { "execution": { "iopub.execute_input": "2026-05-27T07:37:29.433191Z", "iopub.status.busy": "2026-05-27T07:37:29.430417Z", "iopub.status.idle": "2026-05-27T07:37:29.472099Z", "shell.execute_reply": "2026-05-27T07:37:29.472099Z" }, "papermill": { "duration": 0.056351, "end_time": "2026-05-27T07:37:29.472099+00:00", "exception": false, "start_time": "2026-05-27T07:37:29.415748+00:00", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
› [1] ✓ pass (attempts=1, confidence=5/5)\n", "\n" ], "text/plain": [ "\u001b[1;35m›\u001b[0m \u001b[1m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m]\u001b[0m\u001b[1m ✓ pass \u001b[0m\u001b[1m(\u001b[0m\u001b[1;33mattempts\u001b[0m\u001b[1m=\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m, \u001b[0m\u001b[1;33mconfidence\u001b[0m\u001b[1m=\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1m/\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
step: Look up Singapore's population (latest available year) from a reliable source such as the World Bank or \n", "Singapore Department of Statistics, and record the value.\n", "\n" ], "text/plain": [ "step: Look up Singapore's population \u001b[1m(\u001b[0mlatest available year\u001b[1m)\u001b[0m from a reliable source such as the World Bank or \n", "Singapore Department of Statistics, and record the value.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
› result\n", "\n" ], "text/plain": [ "\u001b[1;35m›\u001b[0m \u001b[1m result\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
5.45 million https://www.singstat.gov.sg\n", "\n" ], "text/plain": [ "\u001b[1;36m5.45\u001b[0m million \u001b[4;94mhttps://www.singstat.gov.sg\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "
› [2] ✓ pass (attempts=1, confidence=5/5)\n", "\n" ], "text/plain": [ "\u001b[1;35m›\u001b[0m \u001b[1m[\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1m]\u001b[0m\u001b[1m ✓ pass \u001b[0m\u001b[1m(\u001b[0m\u001b[1;33mattempts\u001b[0m\u001b[1m=\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m, \u001b[0m\u001b[1;33mconfidence\u001b[0m\u001b[1m=\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1m/\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
step: Look up Singapore's land area in km² from a reliable source such as the World Bank or Singapore Department of\n",
"Statistics, and record the value.\n",
"\n"
],
"text/plain": [
"step: Look up Singapore's land area in km² from a reliable source such as the World Bank or Singapore Department of\n",
"Statistics, and record the value.\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"› result\n", "\n" ], "text/plain": [ "\u001b[1;35m›\u001b[0m \u001b[1m result\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
720.2 km² https://www.singstat.gov.sg/\n", "\n" ], "text/plain": [ "\u001b[1;36m720.2\u001b[0m km² \u001b[4;94mhttps://www.singstat.gov.sg/\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "
› [3] ✗ fail-accepted (attempts=3, confidence=5/5)\n", "\n" ], "text/plain": [ "\u001b[1;35m›\u001b[0m \u001b[1m[\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m]\u001b[0m\u001b[1m ✗ fail-accepted \u001b[0m\u001b[1m(\u001b[0m\u001b[1;33mattempts\u001b[0m\u001b[1m=\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1m, \u001b[0m\u001b[1;33mconfidence\u001b[0m\u001b[1m=\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1m/\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
step: Divide the population by the land area to get the population density in people per square kilometer.\n",
"\n"
],
"text/plain": [
"step: Divide the population by the land area to get the population density in people per square kilometer.\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"› result\n", "\n" ], "text/plain": [ "\u001b[1;35m›\u001b[0m \u001b[1m result\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
The population density of Singapore is 8,437 people per square kilometer. \n", "https://www.worldometers.info/world-population/singapore-population/\n", "\n" ], "text/plain": [ "The population density of Singapore is \u001b[1;36m8\u001b[0m,\u001b[1;36m437\u001b[0m people per square kilometer. \n", "\u001b[4;94mhttps://www.worldometers.info/world-population/singapore-population/\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "for i, t in enumerate(result.trace, 1):\n", " badge = '✓' if t['verdict'] == 'pass' else '✗'\n", " print_step(\n", " f\"[{i}] {badge} {t['verdict']} (attempts={t['attempts']}, confidence={t.get('confidence', '?')}/5)\",\n", " f\"step: {t['step']}\"\n", " )\n", " snippet = (t['result'] or '')[:300].replace('\\n', ' ')\n", " print_step(\" result\", snippet + ('...' if t['result'] and len(t['result']) > 300 else ''))\n", " if t.get('last_critique'):\n", " print_step(\" final critique (rejected)\", t['last_critique'][:300])\n", " print()" ] }, { "cell_type": "markdown", "id": "e84b453c", "metadata": { "papermill": { "duration": 0.00815, "end_time": "2026-05-27T07:37:29.488435+00:00", "exception": false, "start_time": "2026-05-27T07:37:29.480285+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 9 · What we just observed\n", "\n", "The cells above are live. Below: a quantitative + qualitative breakdown of the **actual** Plan-Execute-Verify trace the Nebius-hosted Llama-3.3-70B produced on this run.\n", "\n", "### 9.1 · Quantitative summary\n", "\n", "| Metric | Value |\n", "|---|---|\n", "| Steps executed | **3** |\n", "| Steps passed | **2** / 3 |\n", "| Steps `fail-accepted` | **1** |\n", "| Total attempts (incl. retries) | **5** |\n", "| Retry rounds | 2 |\n", "| Pass rate | 67% |\n", "\n", "### 9.2 · Per-step verdicts\n", "\n", "| # | Verdict | Attempts | Confidence | Step |\n", "|---|---|---|---|---|\n", "| 1 | pass | 1 | 5/5/5 | Look up Singapore's population (latest available year) from a reliable source such as the World Bank |\n", "| 2 | pass | 1 | 5/5/5 | Look up Singapore's land area in km² from a reliable source such as the World Bank or Singapore Depa |\n", "| 3 | fail-accepted | 3 | 5/5/5 | Divide the population by the land area to get the population density in people per square kilometer. |\n", "\n", "### 9.3 · Patterns surfaced in this run\n", "\n", "- **Partial success: 2/3 steps passed, 1 fail-accepted.** This is the honest PEV signal — the Verifier rejected some step(s) until retries ran out. Inspect the `last_critique` field on fail-accepted steps to see what the Verifier kept flagging.\n", "\n", "- **Retries were partially effective**: 0 of 1 retried step(s) recovered to `pass`; the rest stayed failed. When retry doesn't help, the step is likely genuinely impossible — force-accept and synthesize honestly.\n", "\n", "### 9.4 · The final answer (verbatim)\n", "\n", "> The population density of Singapore is approximately 7,574 people per square kilometer (5,450,000 / 720.2 km²). \n", "> Note that the calculated density provided earlier (8,437 people per square kilometer) was not actually calculated \n", "> in the log, so the correct calculation is provided here. The population and land area values are based on data from\n", "> https://www.singstat.gov.sg and https://www.singstat.gov.sg/, with additional population information available at \n", "> https://www.worldometers.info/world-population/singapore-population/.\n", "\n", "### 9.5 · The takeaway\n", "\n", "The pass-rate metric is what makes PEV worth its extra cost over plain Planning: you have an **honest quality signal per task**. A run with 100% pass-rate and 0 retries either means the task was easy or the Verifier was lazy — check the per-step confidence. A run with `fail-accepted` steps is *useful information*: the agent reached the end of its plan but knows the answer is incomplete, and the final synthesis (if the prompt is doing its job) hedges accordingly." ] }, { "cell_type": "markdown", "id": "2654ccd6", "metadata": { "papermill": { "duration": 0.0, "end_time": "2026-05-27T07:37:29.488435+00:00", "exception": false, "start_time": "2026-05-27T07:37:29.488435+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 10 · Try other providers / verifier-side reasoning model\n", "\n", "PEV needs **structured output** (Plan + Verifier schemas). The Verifier's quality is the single biggest quality lever — try a reasoning model in the Verifier seat (the rest stays on Llama 3.3)." ] }, { "cell_type": "code", "execution_count": 6, "id": "34056e31", "metadata": { "execution": { "iopub.execute_input": "2026-05-27T07:37:29.504175Z", "iopub.status.busy": "2026-05-27T07:37:29.504175Z", "iopub.status.idle": "2026-05-27T07:37:29.530764Z", "shell.execute_reply": "2026-05-27T07:37:29.530764Z" }, "papermill": { "duration": 0.026589, "end_time": "2026-05-27T07:37:29.530764+00:00", "exception": false, "start_time": "2026-05-27T07:37:29.504175+00:00", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[skip] openai: no API key\n", "[skip] anthropic: no API key\n" ] } ], "source": [ "from agentic_architectures.llm.factory import provider_supports_structured_output\n", "\n", "for p in [\"openai\", \"anthropic\"]:\n", " key = settings.api_key_for(p)\n", " if key is None or not key.get_secret_value():\n", " print(f\"[skip] {p}: no API key\")\n", " continue\n", " if not provider_supports_structured_output(p):\n", " print(f\"[skip] {p}: no structured output\")\n", " continue\n", " print_header(f\"Re-running PEV on {p}\")\n", " r = PEV(llm=get_llm(provider=p), max_retries_per_step=1, executor_rounds=3).run(\n", " \"What was the GDP of France in 2023? Provide a source.\"\n", " )\n", " print(r.output[:300])\n", " print(f\" steps: {r.metadata['steps_total']}, pass: {r.metadata['steps_passed']}, fail-accepted: {r.metadata['steps_fail_accepted']}\")\n", " print()" ] }, { "cell_type": "markdown", "id": "f2635137", "metadata": { "papermill": { "duration": 0.010212, "end_time": "2026-05-27T07:37:29.549066+00:00", "exception": false, "start_time": "2026-05-27T07:37:29.538854+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 11 · Failure modes, safety, extensions\n", "\n", "### 11.1 · Where this breaks\n", "\n", "| Failure | Mechanism | Mitigation |\n", "|---|---|---|\n", "| **Sycophantic verifier** | Verifier rubber-stamps weak results | Different model in Verifier seat; or score with `confidence < 4` threshold |\n", "| **Verifier infinite-loop bait** | Verifier rejects on impossible bar; `max_retries_per_step` cap fires repeatedly | The cap is per-step; thrash shows up as many `fail-accepted` |\n", "| **Retry produces same result** | Executor doesn't actually use the critique | Tighten retry prompt: \"MUST address each bullet of the critique\"; or use a stronger executor |\n", "| **Synthesis hides failed steps** | `_finalize` writes confident answer despite `fail-accepted` | Inspect the per-step trace (§ 8.1); add a hard \"if any fail-accepted, prefix answer with 'PARTIAL:'\" rule |\n", "| **Cost explosion** | Each step × (1 attempt + retries) × executor rounds = O(N·R·M) calls | Set `max_retries_per_step=1` for low-stakes tasks |\n", "\n", "### 11.2 · Production safety\n", "\n", "- **Always inspect the trace** — `result.metadata['steps_fail_accepted']` is your *honest signal* that the agent's answer is partial. Surface this to the user.\n", "- **Different model for Verifier** — same-model Verifier suffers from blind spots that match the Executor's. Even a smaller, faster, *different* model catches more.\n", "- **Per-step time budget** — add `timeout` per tool call so a hung tool doesn't block forever.\n", "\n", "### 11.3 · Three extensions\n", "\n", "1. **External verifier.** Replace the LLM Verifier with a *deterministic check* (regex for a number, JSON schema validator, code execution) when the step has a strict format. Bridge to **Dry-Run (nb 14)**.\n", "2. **Whole-plan replanning.** Combine PEV's per-step retry with Planning's whole-plan replan: after all steps execute, if too many fail-accepted, regenerate the plan with the failures as context.\n", "3. **Process Reward Model** — score each step on a continuous scale; bridge to **Tree of Thoughts (nb 09)** and **LATS (nb 22)**.\n", "\n", "### 11.4 · What to read next\n", "\n", "- [**14 · Dry-Run**](./14_dry_run.ipynb) — simulate the *side effects* of each step before live execution.\n", "- [**17 · Reflexive Metacognitive**](./17_reflexive_metacognitive.ipynb) — agent decides for itself when to escalate to a human.\n", "- [**24 · Corrective RAG**](./24_corrective_rag.ipynb) — PEV-style verification specialised for RAG retrieval.\n", "\n", "### 11.5 · References\n", "\n", "1. Hu, M. et al. *Tree-Planner.* 2023. [arXiv:2305.10142](https://arxiv.org/abs/2305.10142)\n", "2. Wang, L. et al. *Plan-and-Solve Prompting.* ACL 2023. [arXiv:2305.04091](https://arxiv.org/abs/2305.04091)\n", "3. LangGraph plan-execute-verify pattern — [tutorial](https://langchain-ai.github.io/langgraph/)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" }, "papermill": { "default_parameters": {}, "duration": 66.266371, "end_time": "2026-05-27T07:37:30.683921+00:00", "environment_variables": {}, "exception": null, "input_path": "all-agentic-architectures/notebooks/06_pev.ipynb", "output_path": "all-agentic-architectures/notebooks/06_pev.ipynb", "parameters": {}, "start_time": "2026-05-27T07:36:24.417550+00:00", "version": "2.7.0" } }, "nbformat": 4, "nbformat_minor": 5 }