Primary LLM: meta-llama/Llama-3.3-70B-Instruct ────────────────────────────────────────────────────────────────────\n", "\n" ], "text/plain": [ "\u001b[1;36mPrimary LLM: meta-llama/Llama-\u001b[0m\u001b[1;36m3.3\u001b[0m\u001b[1;36m-70B-Instruct\u001b[0m \u001b[92m────────────────────────────────────────────────────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from agentic_architectures import get_llm, enable_langsmith, settings\n", "from agentic_architectures.architectures import TreeOfThoughts\n", "from agentic_architectures.ui import print_md, print_header, print_step\n", "\n", "enable_langsmith()\n", "primary_llm = get_llm(provider=\"nebius\", model=\"meta-llama/Llama-3.3-70B-Instruct\", temperature=0.5)\n", "print_header(f\"Primary LLM: {primary_llm.model}\")" ] }, { "cell_type": "markdown", "id": "825b711d", "metadata": { "papermill": { "duration": 0.002008, "end_time": "2026-05-27T10:41:34.333462+00:00", "exception": false, "start_time": "2026-05-27T10:41:34.331454+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 5 · Library walkthrough\n", "\n", "Source: [`src/agentic_architectures/architectures/tree_of_thoughts.py`](../src/agentic_architectures/architectures/tree_of_thoughts.py).\n", "\n", "Key pieces:\n", "\n", "1. **`_ThoughtCandidates`** schema — forces the generator to emit *substantively different* candidates, not paraphrases.\n", "2. **`_ThoughtScore`** schema — forces the evaluator to commit to a 1-5 score + a one-sentence rationale.\n", "3. **`_expand_and_score`** — for each frontier node, walks the path from root, asks the LLM for K alternatives, scores each via `LLMJudge`.\n", "4. **`_prune`** — keeps top `beam_width` thoughts at the current depth as the new frontier.\n", "5. **`_finalize`** — walks the best leaf back to root, synthesises the final answer along that path.\n", "6. **`_path_from_root(thoughts, id)`** — flat-tree helper that walks parent pointers." ] }, { "cell_type": "code", "execution_count": 2, "id": "a472d06d", "metadata": { "execution": { "iopub.execute_input": "2026-05-27T10:41:34.341485Z", "iopub.status.busy": "2026-05-27T10:41:34.341485Z", "iopub.status.idle": "2026-05-27T10:41:34.359177Z", "shell.execute_reply": "2026-05-27T10:41:34.357348Z" }, "papermill": { "duration": 0.023717, "end_time": "2026-05-27T10:41:34.361191+00:00", "exception": false, "start_time": "2026-05-27T10:41:34.337474+00:00", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- ThoughtCandidates schema ---\n", "{\n", " \"description\": \"K candidate next-thoughts at one tree node.\",\n", " \"properties\": {\n", " \"candidates\": {\n", " \"description\": \"Substantively DIFFERENT next reasoning steps or partial solutions. Each must explore a different angle / approach / framing. Avoid producing variants that are paraphrases of each other.\",\n", " \"items\": {\n", " \"type\": \"string\"\n", " },\n", " \"minItems\": 2,\n", " \"title\":...\n", "\n", "--- ThoughtScore schema ---\n", "{\n", " \"description\": \"Score for one candidate thought \\u2014 STRICT rubric to force discrimination.\",\n", " \"properties\": {\n", " \"score\": {\n", " \"description\": \"STRICT 1-5 scoring. Be discriminating \\u2014 if you score everything 5, beam search has no signal to prune on.\\n 1 = clearly off-track, contradicts the task, or contains a factual error.\\n 2 = on-topic but weak: overlapping with a sibling, vag...\n" ] } ], "source": [ "from agentic_architectures.architectures.tree_of_thoughts import _ThoughtCandidates, _ThoughtScore\n", "import json\n", "print('--- ThoughtCandidates schema ---')\n", "print(json.dumps(_ThoughtCandidates.model_json_schema(), indent=2)[:400] + '...')\n", "print()\n", "print('--- ThoughtScore schema ---')\n", "print(json.dumps(_ThoughtScore.model_json_schema(), indent=2)[:400] + '...')" ] }, { "cell_type": "markdown", "id": "d8be05a9", "metadata": { "papermill": { "duration": 0.005955, "end_time": "2026-05-27T10:41:34.369154+00:00", "exception": false, "start_time": "2026-05-27T10:41:34.363199+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 6 · State\n", "\n", "The tree is stored as a **flat list of nodes** with explicit `parent_id` pointers. Cleaner than a nested dict for `LangGraph` state (which prefers value types).\n", "\n", "| Field | Type | Purpose |\n", "|---|---|---|\n", "| `task` | `str` | root task |\n", "| `thoughts` | `list[{id, content, score, depth, parent_id, rationale}]` | full tree, **appended** to each round |\n", "| `frontier` | `list[int]` (ids) | which thoughts to expand next |\n", "| `depth` | `int` | current tree depth |\n", "| `final_answer` | `str` | set by `_finalize` |" ] }, { "cell_type": "markdown", "id": "319ad10d", "metadata": { "papermill": { "duration": 0.0, "end_time": "2026-05-27T10:41:34.377213+00:00", "exception": false, "start_time": "2026-05-27T10:41:34.377213+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 7 · Build the graph\n", "\n", "Four nodes: `root → expand → prune → (expand again | finalize) → END`. The `expand` node *does* both generate and score (combined to keep state mutations local)." ] }, { "cell_type": "code", "execution_count": 3, "id": "3cd113e2", "metadata": { "execution": { "iopub.execute_input": "2026-05-27T10:41:34.385285Z", "iopub.status.busy": "2026-05-27T10:41:34.385285Z", "iopub.status.idle": "2026-05-27T10:41:35.394305Z", "shell.execute_reply": "2026-05-27T10:41:35.394305Z" }, "papermill": { "duration": 1.00902, "end_time": "2026-05-27T10:41:35.394305+00:00", "exception": false, "start_time": "2026-05-27T10:41:34.385285+00:00", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAGoAAAITCAIAAABg8R7gAAAQAElEQVR4nOydB2ATZf/Hn7ukabp3SwdtKWVVppQhW1qGDBmCliWi/AVeUJChMgRZoiKIA0QRUZChIMsXQVEEpPICyt7Q0kEHpXQ3STPu/r/k2jQtGXe5S/u0vc/LGy/PPTfy7e/Z4yelaRqJ2IsUifBAlI8Xony8EOXjhSgfL0T5eMFXvpRrqtv/Fhbla1SlOp0WIRoRJKKp8k+a0P8PojFfEQHn9SGkFFFa/eWEBOlrThShP9AZ7kgYLifKz+ovoAzhcKwrf2hlZP0Xmrlv+SMMd4CviETI8LUyvALSiZC7kq6e0iat3GOeckc8IOyr950/VnT17/ziAi28vERKyFxJiYTQv6gOtCFoiiYkBBzrZTDcv+Kr4cfRtMSJ0GkqwuE/uor4hvN6mENS/wXupv/NFYoz4UygafxqMhklrrxzBaSEhF+tVlEaDU1pKbmbtMkTbk8/H4C4w1m+C8cKz/3+iKJQYKhLbLxveCtnVJcpyaNPHniQcVep01BN2ngMmBDI6XJu8n23LEWlpFp19u41whfVL26cKfn70EOw38krmrC/ioN8X8xL8g91Hj0rDNVfTuzJvXq6sPtQ//Z9vNjEZyvf57Pv9h0dFPOUB2oAbJibNH5+pKefxGZMVvKtn3P3/1ZEy1xQw+HLt5Nj4/w69rNhgySyxca3kuJeaNSgtAOmvB915rfcooc669FsyLd1RWpgY5eWnXlVjuooXQb67ViTaj2ONfn++aNQUUyNnBGCGiQd47zlbuRPn2VYiWNVvqOPYrp4ogbM6JnhWfeUViJYlO/Cn0W0lu410g81YNw8STcv6V7LBmhRvksn8wPCa7q86NevX0ZGBterkpKShgwZghxDux4+D9JVls5alK+0SNO5f42aXlZWVn5+PuLO9evXkcN4Ms6L0tGpN80nYfM9LnculEKzPLylQ9qzUNPcuXPnf//739TU1CZNmnTt2nXatGkXLlyYOnUqnB02bFjv3r3XrFkDNrVnz55z585lZmZGRUUNHz581KhRzB3i4uImT5587NgxuGrChAnbtm2DwNjY2DfeeGPcuHFIaFzcpdf+LopoaSYtmpfv3rVSJ2cCOYZdu3Z98803s2bN6t69+/Hjx9evX+/m5jZp0qR169ZB4IEDB0JDQyEaKAjCLVy4kCCIlJSUDz74IDg4GC6BU05OTvv27evcuTOI2LFjR4jw22+/wd8DOQYPb2l+TpnZU+blK3qkkbvabrLYx/nz52NiYpjcasSIEZ06dVIoFI9HW7VqVWlpaUiIvtoElnXw4MG///6bkQ/08vLymjt3LqoRPP1kGUkKs6fMy6cu0znJbDdI7KNdu3afffbZsmXLOnTo0KtXr7Aw830QkMbBThMTEyGNMyGMVTLAHwDVFNC3qlFTZk+Zl4/WUUjiKOsbO3YspNYTJ04sXbpUKpVCafv6668HBFTpraQoaubMmWq1esaMGWB6Hh4er7zyimkEmUyGagroeSUs5GTm5XN2dVIpzOvNH5IkRxhITk4+e/bsV199VVJS8vHHH5vGuXnz5rVr1zZs2AAZHBNSXFwcGMitL1MoVCWUhDSvn3n53L2cCh+ZT+38gTy+VatWTZs2jTIAukA5UC1OQUEBfBr1SjYAl6DaAEoCqdy8UOYzuNDmLqpSLXIMR44cmTdv3smTJwsLC0+dOgX1D8gNITwyMhI+jx49evXqVZAV0jXUSIqKiqDYXb16NdRvoGJo9obh4eG5ublQiBtzSWGBgTBvPy7ytenmAd2Aj7I0yAEsWrQI1Jk9ezZU35YvXw61PKidQDiUIUOHDt24cSMULI0aNVqxYsWVK1f69u0Ltbnp06dDpQ9kNVb9TOnRo0f79u2hIP7111+RA1CW6mJizfcTW+wu3bTwXmCYfNi0YNSwuXmu5Ped2TPWRps9a7F20qKjR+Y9R2V/dYjTh3K9/J0snbU4TN5rpP+VxIILxws7WBg0yc7OTkhIMHvK3d0dClOzpyDZQpMDOYZvDZg9BTVtS+kM6kZm8wSGkkLNqyujLZ21NtZx7IfcOxeLpqyKMntWq9Xm5OSYPaVSqeRyudlTUCA4rv5RbMDsKSiCPD3N911COPy9zZ76fmUqDF1OeCcCWcDGUNGX85MjWroOnNgINTzS76gOfnl/+kfRVuLYaJmB6d27WqouRQ2QQ19n9hweZD2O7YZtXELQ5qVJqIGx5d3Uxs1c2/awMa7Napw3L1u9c3X69DW1U+mveb54M6n3c0ExXWyPL7KdZZByTfnz1xnte/v2HF7fZreYknZDeejbzIgWboNeZpXdc5sitGlBskRGDhgfFBpdD4fNd3yYXpir7jksoHV3tuOLnCeo/bI5O+VmqYubJLqde88R/qjuc+FE0bXEgsI8jV8j54S53CZA2Tk98pdvH9y/U6pWUjI5KXeTuHtJpTKSlCLTiYj6Ph4Ceu4qnsRMmzSZe1oZU4Io48zR8gmVEEhQFZMtSZKgmEmSFQfVfwZpmIfKXFgRRyIlddrKx5Bk+ctIpBJtma6kUAeN2TKVjiQI/1DZqGlhyAlxxU75GErydGd/y8tJV+nfQ6kjSJIykY/pYjTentHF9LMy5mPzZ1HFr6Vo+HmSSk3Jyr9HlZ9BGOaj0pUPAqDDV6erGqc8nJDKCLmLxCfIqW13n9Dm9o+I8ZKvBhgwYMCOHTv8/DAdrcd9Zj00DaGdh3BFlI8Xony8wF0+jUYDg+IIV7CWjzKUsjAyh3AFa/kwT7lIlI8nWL8c5hkfEq2PJ6J8vBDl44UoHy9wl08sOuxHtD5eiPLxQpSPF1BtFuWzH9H6eCHKxwtRPl6I8vFC7HHhhWh9vJBIJB4eWG99gvtQUWFhIcIYvJOGVArpF2GMKB8vRPl4IcrHC1E+XuBecRHlsx/R+nghyscLUT5eiPLxQpSPF6J8vBDl44UoHy9E+XiBv3w4ripaunTpwYMHmReDT8IASZLnzp1DmIHjpPVp06ZFRkaSBqDZC58gn6WN1moXHOULDAzs16+faQjIN2zYMIQfmC6ZGDduXERE5fYfoaGhw4cPR/iBqXwwwPbss88aF8T079/f29sb4Qe+C3bGjBnD5HchISEjR45EWOKQkvfUTwXFpSptGcV4yGEWizMrmZk14kyI1InQavTrywlkOFWx1Nm4uDwj8/6dO3dDQ0JbtGhO0+VvWsXrjsGRDrPivMraccPiclTxy6RSibObtH13X99QgfezFVi+3Z9m5qYrnWQSGn6jutxhkOkC/PJPQzgppSktAeIRRJUV+kY3QyColtIRSEKYyGHq54k5pgmKoElTWU0dHhnuQ0hkSFtGu3lLJywIR8IhpHxHvnuQeVc1+o0I5KhtT/nyy+YsZbH6pSURSCAEk+/AF1mFj7QjXmuM8Ob37VlFuWUTF0ciIRCs6MhMUXYbEoSwJ35csFJBpd1QIyEQRr67F5WQ2wQ1qbnNgPngLCevnbFnc/zHEabLQFGkpnRY78hhik5Hq0qF2ZZVoB4X/XY3dUY++EvrdEgQGqKLT5pGQlU3hJKPRo5ysCA8BGFxE3WuCCUfgepM2sXR+hooghUddSnxCpdYBJLPpH2OP7RwWbVgRQddd/TDsegg6k7qFYsOXsCfmsTM+uoSwhmfcHmfRFJ3il7hECzv09WdLgPonhbK/PAdKmLJiOf6ZWZxc2sp5n3lZGdnFRRw7rkD06Pqesm75N03JRJJUFDwrh+2Ln33w149+yoUirXr3rt48Z/i4qLIiKhnnhk2fNhoJnJaWsq6T96/feeGRCKNjIx6aeKUDu1jL1z8Z/YcvVvLceOHde/ee8WyNeyfjl3i5eoVz8nJKfneXfi3cvnatm06QMjbC17PzLy/fNmaH3f90qtX3CeffnDj5jUIz8/Pm/HapMDARl99uWP9Z1t8vH2Xr1gAWoOCq1augwjbvz/ASTsYt5MI9LsFk49rByRBENnZmUuXfNitWy9vb5//nUm8cuXivDnvtGr5hJeX97ixk9q0af/d1q8g5u4922XOznPnLAoJDg0LC583d7FSqThwcDeyF5oSLPHWZtEREd7E6JPn3r27cNykSaVPi+bNWt26pXe8CxbarFlL47JyNze3xmERt2/fQPYCw/BCeZCsTfnApozHjx7lyuVVvDC4urqClcFBHpxyruL5SO7iolDa78WL0iGhOutxqbiATalUVTwwlypK/f30fj9d4VRZFffWSoXCzxcLZw3CyEcT+skriActmseoVKo7d28ZQ27cuBppSMtwCo41mvKxsaLiotS0e6bJnCskqZ8Vg4RAGPngZXgOVHbu3C0kJGzt2pU3b13Py3u0+ZsNINkLoyfAqaFDnystLVmzduWDB9kpKcmr3l8MaXnQM/rpfo3DI+Hz+PGj129cZf8sGBSkaGHqzbgkXigZoPLh6en1n+kTx45/9t/zZ5cv+wgKXzgVFtp4yeL3oWxJGDtk1uxXIeSTdV9DYoeD0JCwgQOGbvl246ZNn6HaQJg5Lpf/Kjy57+HEJdGoLrDjg2TfINnomQJMlhZurKMOIVznhnBjHXUH/YRMrLoMoOSoQ/ann6eKlXxQb6lL9gfiCeR4XRwm50WDLDoIwSpsDbHoEHBGjlBFB6pD47wCjukLVXSgOjTLQEAaZNEh3ISmhlryipM07Ae7RltDRRj5SFLvtRXVEWRyiVwuTMVPmLs0beVF0wK1gxyPVkP5Bgnj3VoY+Vx8kdxVknjgIcKevAy1Tk13H+6DhECw3uZhr0bcu1akFmapmAM5sjWjZaxgC9OFXJAKo39fzU/2DnCObOnh7G5mKJqoWugR5c8nTE+ZxmGOCRhJoasO7cAf3ZBV0CYVuKo3Z1xQV5wiEa0h0m6W5mQoBk0KDm8hmF964VeT71pzvyhXo9PS2oq13abSoKrqVMpX4fmaIB7rjau43ngfkiAoQ2wmRCKpMj2O8eltuFF5BBhac3IiXN2lvUYGhMcIph1C2DvXHjhw4Pbt20Xn2nYiujfmhSgfLzD39iRaHy+wlg+KNYqiJBJ8m4OitxheiPLxQnT1xAvR+nghyscLUT5eiHkfL0Tr44UoHy9E+XghyscLUT5eiPLxQpSPF6J8vBCrzbwQrY8Xony8wN1bTEBAAMIYrOXT6XQ5OTkIY0RfRbwQ5eOFKB8vRPl4IcrHC1E+XuAun06oDWscg2h9vBDl4wXu8hk3D8IT0fp4IcrHC1E+Xojy8UKUjxeifLzAcVXRa6+9durUKaJiTRpJkhRFwdd///0XYQaOu4bPnDkzLCyMrAAZFAwPF9I1p1DgKF90dHSPHj1MkwWYXu/evRF+YLpn/fjx4xs3rvQWCsejRo1C+IGpfKGhoXFxccwxZHyxsbGMp2jcwNdjQkJCAuPdHT5feOEFhCWcKy63/1Wq1VWb8XSVTXmMC5vLv9KIJh9bIc4K5/5dX/lTdaJNi9bKnICrOUWGVeWVN6nyWNryxkBmn2huB9nFmAAAEABJREFUHTpJEhFPeLi5I/ZwqLhsfy+9MF8Nz9Coq2yaUbkQvPw1qslpUNDCPR9fnl9lrblhR+PKu1VdVm76dzIcEmbfwfSeJFG+RYAxvmkEvU9wipK7SUa/EenuhdjAVr6v37nn5SvvNzEY4/V5wnBq/8OUa8WvLI6SsTBDVvJtWngvopXXU0N9UcNAp0Y7VydN+9D2xuS2i44Te3IJkmg42gESGfIOkP/w0X2bMW3Ll35H6R3gjBoYka08ih7Z7ui2LV+ZQifFeoqiQ3D3ITVa29tK2a64aNWUTl1n9qcSCq2OpjS2SwVxAzoL0KycqIryWYBgtRurbfkIsh44w+OM3hUeaftn25ZPvy9fg8v6IO0SNCXmffaib06waI6J8pmHaVPbRMz7zEPDzyaFSLwNM+8j9K4EbZufaH0WYLezuGh9FqBZdeuKRYcF2Fmf7WRZDxJvQUH+03Gxfx4/yv4SkmblQFpMvObRERRFi9VmeyFoVg5cHJIs8/IerVi5MGHskOEj41eueic9PZUJ/2jNihfGDFapyt2dbt+x5ZnBPbKyM2/fuQmJ6+Rfx175vwQ4GPX8wPUb1hrvdvr0XyvfWwQXQuTZc6ZeuPgPE75v/48jR/VPS0uZ9MrzcBVce+TXn41X/XHs1/EThj87vO/7H76bn5+HuEIQbLoMhJdPp9O9MWfKxUv/vjFrwTdf/+Dj7fuf6RMzMvUd31OmzNRoNFu3bYLj3NyH32/fPP0/c4IbhUgl+kTw/febVyxf++vhvyHwwMHdh37ZD4Gg9cpVi8rKyt5+a+l7K9eFh0cuXPQG/HmQwT13SUnxp599OG/OO8d+P9e7V/yHq5c9eJANp5KT74Li/fsP+X7b/gH9h3z2+WrEEULf6BDE+khWZZCRK1cugkUsmL+8S+duvr5+06bO8vTy/umnHXDKw93jtRnzdu/ZDmqu37CmVcvWQwaPMF7Ys2dfkFImkz3dp1+nTk/98ccRCJTL5V9/tWvO7IUd2sfCv6lTZimVyitXLzKXwB9j4ouvxsS0IQgCZIJm6l2Di1pQPyiw0YsTJnt6eMJVg02ewhKWHgxY5H0UN/8g8NvALp7s0In5Cj+sfbuOly6fZ76CNL8dPbRg4azc3JzvtvxkemGz6BbG49CQxr//cZg5VihKv978OZjzo0e5TAiUpMaYLVs+wRx4eHjCJ9gjfGZkpEeaOPA1xuEAK+Nj2erg4tIRfgAYBWRGpoHe3pUeCsaNmfTazFdAU3//KivFTZ1rg9GVlpbAASTGmW9MfrJD53cWvsdYWb8BXau8nrl3KyoqDAurnNDmIue+0To7b1rsKi5cplD6+fm7uLisXPGxaaCErBxd3/Ltxh7d+/zvzCmoiIExGsMZw2GALI9R8/iJo2q1GjI+uCeqandW8PT0MvXHDfaLHIPwFZemTZtD9hQY2Cg0pHxOVGZWhrdXufX999C+pOQ727cd+HH3NsjRY2O7QobInILk2aNHH+YYsrCoJtHIYEeQKhntgBMn/2DzDkFBwX+fPklRFNNjfPp/fyGukATBrlywAddWR8cnO3fu3O2jj5ZDuissLNh/YPfUaROOHDkIpx4+zIESY9qUWW5ubuPGvgxpaoNJBeXcP6fPnP0bDk4lHofaSXz8M3AcFdUMsryDP/+k1Wrh7PnzZ728vHNysq2/Q58+/cBO4c8DhQncav/+HxFXKFaFh0NaHatWroMfvGzF/OvXrzRuHAFCjByZoA9/fzHY5oABQ+AYStg5cxbNmTsNSkxIaxAyNuGlzZvXvz3/dTAZiD94kN77eFzfAampyVDX+Xjdqk6xXd96891dP2zdsfPb4uKi5s1bWXoBiDl1ysyDB/f0je8UFNRo4fwVr8+a7IhZ3LbnuHz5VrJ/iHP/l0KRw4BqGlR6P/l4U9u2HRAe3L1YlLg/Z8bH0dajiY02XojymYdlbo9Fb3NUVPSff/yDcIISrLeZRg7Ic7FHsN5mGhENUD52iHmfBconSttAlM8CNEJI7G22G5IkSEGsj+DW31dPgKEOYaYI0dz6++oHhFDz+xomNLuhIlE+Xojy8cK2fFI56eTS4OYIkTKp1EmIybkurhJlSYObZlCYXSZxEmKgMrqdZ+EjrDeScgSpN0p9g2yvpbItX+eBXjIZcfibTNRgSLqkVBSpn3s9xGZMtgtSd66+ry6jW3fzad6Ry3LhukZmkvrCnw8LH6qnvN+ETXwOy6H3bsh+mK7UaSmd1twlFtdzW17oTdvZnjGsKrfnntYfCKNyEinp6es09q3GiB32bIOjVle9BeP/22RAqcoCbnMDTdUuMY0/cOCA/fsPuMrlj59SlJYmjBkzdcqUwYMH0+acwDOPoy0s/Td6RicsRJCV/58D9tT7ZByfwZ47d+4EBfl6esrNnj137qJGU7r5m41PdmyDya44eFXoLl261K5dO0tnz5w5U1hYmJmZuWDBAoQHeMl35cqVNm3aWDp79uxZZjOrW7duLVu2DGEAXvJdvnzZkvVBui4uLmYmXUB+/eeff+7btw/VNhjJl5eXV1paamm3IDBMU+cTIOWWLVuSk5NRrYKRfFZMDzh+/DikXNOQ+/fvz5s3D9UqGPW4gHxWMj7G0EBBmUzm6enp5OR06NAhVNvgJd+MGTMsnZXL5cz2hw8ePMjPz2/ZsiXCAIwSr/Vay549e5iDoqKipUuXIjzAxfqgZGjdujXBYhZws2bNoqKiEB7gIp9106vGypUrER7gknitV5irAVqnp6cjDMBFPuu1lmpcv379xx+5z7d1AFjIB81YqIj4+/uzjN+tW7fg4GCEAVjkfWB6bdu2ZR8/wgDCACysj1PGx7B3715o4aHaBgv5OGV8DImJiThsIl77iVetViclJXFtRSQkJBBcloo5iNqXzw7TAzp16oQwoPYTL1TiOJUbDNDtvGPHDlTb1L58dpQbgJeX1xdffKFUKlGtgoX12ZF4gUWLFpWUlKBapZbzvnv37kFt2cPDA3FnwIABqLapZeuDQZ++ffsiu7h69ervv/+OapVali86OvrEiRPILn799deHDx+iWqWWEy/IB0MWhrXjcq7XPvXUU9D3h2qV2i86oJcUkiHiDnQcBAQEoFql9uWDWgvUXRBHYKBy9WrOu7MITl21vmvXrqWmpqLapq7KFxISMmvWLFTb1H6b18/Pz9nZGXpMQRH2V4kzrCqxwwAh4wPFUW1TV+XbvXt3o0aNUG1TJ+WDnoLNmzez2RTd0WDhoxLeoXPnzufOnUN1DSysD/qNY2JioC7CMj4MdBw+fBhhAC7jvJzS78mTJ+3rpBEcXOTj1PaYMWNGly5dEAbUSeuDjgYYVkcYgIt8YWFh0HVcUFBgMyZ0ES5ZsgThAUbz+0wNcMiQIZaiwcicq6srwgOMnGv37NkTKnS0gdDQ0J9//tlstPz8fJlM5ubmhjCg9tu8gwYNys7OJgwgQyUG5LMyhcXHxwdhQ+0n3vfeew+aX6ZTBuA4NjbWUvxRo0bV+vikkdqXr3379hMmTHB3r1zn6uvrC/mg2cjQTaBWq41bmdY6WBQdY8aMiY+Pl0rLcxJQBxohZmNC7/zOnTsRNuBS8sKYN9ScmXIDqnWWylao7mFSaDBgVHFZu3ZtVFSURCKxMv0Hanx//vknwgZeFZdda+4XPtTo15frmA3HaBY777BaXG7xVqZX21g4buNlSAlBkoSbh/S518Pd2LnSfhz75du0IMXNW9qup39otFzH3KtiZbbp2u5qS8mJil9NPLaUvJosZleBo8fWqT/+UNP7IJPI1fbikiD0KEt95VR+ZnLJ5OVNZXaVRnbK9+X8e09082vXyxPVC7a/d+/5WU18uc82tyfv27c+29VDWm+0Axo3dzu4yZ5hT3vky81UhjWrV9uRdB0YqCjWIu7YI59OQ/sE1itv5TJ3/Sb1edmcFbRHPq1Wp622GUndR0fBP85bdYkb0FXAyjtMdUT5KqG5C2iPfPo+JdKOPxXWQAWuhqxP/yTKzso2tti3yEZMvBXQrLbKrYYoXyV22J998tlVSmEOYY9jCDutr/6ppy93iRopeeFJVP0TkLBnb3S75COgsVLfSl5Df1YNFR34DA4LB2FP2SEWHRXYZRB2Wl/9S7sGg6ihxEvQ9dL6aqriQovOnxhqYqDSiuvxn/buem70gFOJx+P6df5s/UcQ8szgHrt+2Gq89sPVy6ZMHY/0K3+T4NobN6+9s3guHDyfMOiLjet0OmaQyqI3dA7YlZ7skU9fcHDpcbHielwmkykUpQcP7pn/9rIRw563chNmPuSatSvi4gb+duT0wvkrftz9/Z/HjyKr3tC5/Ch7CkR75NMXHNx7XMy6HoeKvkqlSkiYGB830NSfsyV694rv0zsepGzX7smQ4NDbt28gq97Qufwoe4pDu6wP4N7fV831eEpq5a6tLVuw9R1u6lTW3d2DcSht3Rs6W2hUQyWvfh4Kd+sz63qcQcZ6H3KzS2FsekN3HDXXYWXW9bhNdJTOZhyb3tBZQdhTetjZWY+4J16zrscfRyZzVioVxq9sylDr3tDZQtdY0UHbU3SYdT3+ODExbU6c/IPZoGXb95tzc3Ns3tmKN3RO2NGSt6/eZ0+rg3E9DjnUknffNLoef5wZ0+f6+vgNHdan34CuZWWquL4D2dx81cp1vXvHL1sxH+p9e/ftMnpD5wA7n57VL7JD8s9n3+n6TECLzt4s42Poevxxvn33bsK88IAQbs407O1xETFgb49LvQMKaqLG+vs4VZsxdD3+OFBBomumt9mubu36iV31vnraY1VjU4Rooj56XaypkbZ6OdbBallAdexs89bHrI+osTku9XCow5Ciamygsv5RgzOs6uNApV2IA5W8sEc+iYQkJfXN3TYJ7SiCYw+rffKRUkKtQPUMgkQBwZzls8eIvPydUq8Xo3rE2cN5Mrk9UthzTcJrYQW5KlSPFsYkXS7s0NsPccfOFZUlj3Tff5Aa3c6ryxB7nooP188WXzj6sP+EoKg29ixSt389b94D3f716WUqiiQIjaZqG9hkbS1NG6bNVXPHTDy2CheZrtc1XFMlnDa0CYiqq3bp8vp7lcAK6KqPQ9WfKJHqW2kSqaRND6+ug+wc1eS7DU5+ljbtTqlGU2040UQV/e8mTb7qFWU2azHU9GlTmZhvBunKZ2rv23dg4MD+clcXknnVyhsb7kEZrjGGMi0H/bqTioneFY+rrBVXnCIlhH+wPLwFr005MNpFyCzx8fF79uzx9mY7rlLD4C7f3bt3o6KicNhn0yy4y4c5uDceJk6caJzEhyG4L8q6fv26RMK5MVBj4J5479y5U+teJawg5n28wDrvU6lUkydPRhiDdd5XVlaWkpKCMAbrxAtlLsjXtGlThCti3scLrPO+7OzsOXPmIIzBOu9TKpVpaWkIY7BOvFB0gAFi4gjaLGLexwus8z5ocrz77rsIY7DO+4qKirKysnapV70AABAASURBVBDGYJ14FQpFfn5+aGgowhUx7+MF1nnf+fPn16xZgzAG67yvoKAgJ8f2qqJaBOvEW1xcXFJSEhzMfUfbmkLM+3iBdd6XmJiI1Qb/j4N13ge1llu3biGMwT3vKy0txcEXpSXEvI8XWOd9586dW7duHcIYrPM+GCrCwYO2FbBOvJDxQc1ZbPPWW7DO+27evLl8+XKEMVjnfRqNJjk5GWGMONbBCzHv4wXWeV9GRoY4zms/FEWJeZ/9aLVaMEAx76u3YJ33FRYWTpkyBWEM1nkfSZK3b99GGINj4gWLY0oMKDqg6qffq5KioAr9zz/YbUWEY+JNSEiAQgO6miHxQqcLKAja4TlghKN8Tz/9dPPmzU2TBRzjOb8e06Jj0qRJfn6VS10DAgLGjBmD8ANT+bp27RoTE8MYIGR8TZs2teLzuBbBt+JiNEAfH5/Ro0cjLMFXvnbt2rVv3x7KEGh19OnTB2EJt4qLugTt/SK9OF+n0UCSsnTh4yu7K88QpOW9/8oXnlcJMSyBNoSa29qRWaxu4W6oYqU0K5xkpKubJLafb6suHHzncqg2lxbqtq5M9QmSteriLZWROopxx2rOu3X5qvDqZ5iV4uZ2ICLKV9aXO8w2iUEYJWJWgVe5vvqTiEpvYfrF5mZ3O9LHIR5z+o2cCOn9pNKT+x/qKLr1Ux6IHWyt79Y5xbHd2eMXRqH6zs4PUkKbugx+JYhNZLZ534m92R36+KMGwJi3IlNvlLBcQ8xKvtv/KMFGn+hef5yRW8fF3enIt6zmFbLK+9KTSiXSBrTpl9yNLMhTsYnJSj6tWqMua0DdgiqllmWRLXpI5YUoHy9YFR167zoNab9DgvX2jqysj6n8NxzYuylnJR+BGpTxGVIbu91Z2VabG9Rmr/rNxQQseWm7PKnUXSjWW8qzK3nFsWALiBUXM1joFjIDu6KjoVVcENu8iqX1NazEy96LGst6X730Ri4ALFsdnN0gKRSK995fPHhorzffmpGcfPfpuNjLly8gexk+Mn7rtq+RwZ9vXL/OCBvYWx/ixJWrF48e/WX6f2a3bxfr7e3z4oTJgYECrK2KadV6wniHbwpGMNvyssBRJa9CUQqf8XHPMJ5KJ700FQlBq1at4R9yMAZbEa7arPfHy2VE88DBPes+eR8ORjzXr1Ns16lTZhm9BC5d9jbcDWR9/8N3lUpFTEybqa/OZBS5dy/p4M97zl84l52dGRkRNWjQ8GHPjqp2Z0i8G75Y+8fRs4mJJxYtrj5vd9t3e8PCwmFsc/M3G/535lROTnbr1u1HDHu+a9ceiAsEa8cnbBMvxcWnJ/xsT0+vZcvn7/vpKFgf5H2Vz5NKL1+5AAXRxi+2BQYELVg4a9UHS7Z++xOcWr9hDQg3e/ZC0DctLeWTTz8ICgru2qW72Ue0bt1u7ZqNxq9wbWlJiZ9fABx/+tmHh48cfG3GvN694xMTjy9Z+uaC+ct794pDrKFZN1FZdhkgARttSoVi3tzFrq6ucBzXdyCYIZQz8PWdd1ZBkg9uFALhHdrHHjly8Oy5vy3J5+XlDXGYYzD2jIz0zz/d4uLiUlZW9utv/x075qVnhz4HpwY9M+zq1Utbt23iJB8MRgtqfYLW+xqHRzLaIYODbKRft1ukD6HpvXt3nTmbaPTJGxxsezXb3bu3P1//0cIFK5o21U/Bun37hlqt7hT7lDFC+3YdwRgLiwq9PL0QewRsdQiL2T2sKYp6e8FMjUb9f5NntG8f6+Hu8drMV2zdCRUVFy1aPHvYs6P79I5nQhgf1I9fm5/3iL18NCVo4pVICEdv/nv7zs2bN699tHpDxyfLq3UgRIB/oPWrVqxYAPnjtKmzjCF+/vrsb87shaGhjU1jcqo26T1wUsJVXCgd0ukc2+ooLCyAT6NeKSnJ8K9JpLVtN3fs/Db53t3Nm3aZbuwcFhru7OyMDLknE5KfnwcllTG7YIPBcT2r3I9d4iVoR/cZQE0FCuUfftw2ZcrMgvy8zz5fDTWe7AcW9/+6dOn8pq8/h2o5KGgMDA1pHBgY9NLEKVBWRIQ3adEi5vT//oJjCF+2dDVyAI5qdXAlKKgRZP/fbf1q2PC+kO4Wzl/+KC/3ncVzJ04a9d2WPY/Hh+IV6esra00DZ0yf+9zIhIQXXmzatPmOXd+eP3/Wzc39iZi2c+YsQo6B1RShX7dlJV1WTlhU/+cHMfz0aQp01k96J8JmTLG/zyxsh4rYFR36lRUNqMOq3JMNC9gOVJINyfzYN/DZ9jY3qM5SgUfaGl7exxaW9T7UoIY7oJXFMr2xq/dRNN2QfKIamlgCFh1i4rWAOEzOC/byNagpQvrWGJuY7Of3NaDUSzP+MFkg5n28YCUfSRiWODUYpFIS6YRLvK4eThJpfXNGbgUYlpW5sEpurETpNtBXq2bdkKn7KIq1kS1YratkZ1My5OXvfGhTBmoAXDxWSNKo8yBW40oc1vPu+DAd6cih/8F3K0z+nD746N7VwikfsO0Y5rYc+vv37xfnqWXOJKKQRlslOVf1x1z+SUr0w0yoYi2zMVwCWTNVGY6QybWosooJl+vHDCu/ElRFjk4aVuXSlMkTGRfU5S6kESlFtK48HKLBEylmyKHcX3TlyzBI5aRORTs5Ey8vi0Ss4bwNTuED+twfuYpiDaWtciHzlsYD5pOUEkw0wpBJ6MOZ95YStEm4fiq7QZfq8kmJ5KTksLBwqWEszXgVo5/+To/JZ1xMDdfqm+oVb0JAGWlQE5HlEZhj47JoVzeniNZuLWK5OXjHfQuw/v3779q1y9fXF2EJ7m1ejUbj5OSEcAV3+bRaLYz/IlwR5eOFKB8vsJZPp9NJoOMc4+4KrOXD3PSQKB9PRPl4IcrHC9y3/Bflsx/R+nghyscLUT5eYN5fgETr44koHy9E+XghyscLsejghWh9vBDl44UoHy+wfjkYRPXy4rKGucbB3dFYSUkJwhi8k4ZUCukXYYwoHy9E+XghyscLUT5eiPLxQpSPF6J8vBDl44UoHy9E+XghyscLUT5eiPLxQpSPF6J8vMBxWcz06dNzcnJIkoSRtrS0tNDQUHhJOD58+DDCDBxX6cbHx2dkZCQlJYF28BWOMzMz1Wo1wg8c5RsxYkTjxlW2zoRe++joaIQfmK4Rf/HFF023y3Rzc3v++ecRfmAq3+DBgyMjIynDXlzw2aRJk7g4Dhsv1xj47lAwceJEb29vOJDL5aJzbc6AuTVv3hzK3JCQkKFDhyIsEabicu5oQdrNkpICvcdorZbSacuXaxucN+tXH5PMYmnD10pfziRNU4bN2Qia8cvHLL+iKvaggWSrVpc5OTlJJFXqp8zKZ9PF68i4HpswXU1NU9rKBV0kSVA07Swn3b2kjSLlPQf5I2fEE17ynTmcfyWxUFmqJSVIIpVIZVKpsxR+D6XVlf8Movx36jeSrnhQ5e5kRJWtnSq8a9NVfZQTFSdNQivCqv6U6rFoEhGmGwZISFpHUTpap4E/M6XTUc4ukvAWbgMm2Nhc2wp2ynf+98Izv+bCpa7eriGt/GSuDt5U3DFkXH1UlFsKmka1cR84kZU37WrYI9+WpSmKYp1/Y++g5t6o7lOco8y48ZAg6CmrOO+szFm+DfOSnF1lTbuGoPpFxtXcvMziYdNCw5u7sL+Km3wb5iYFRvr6N62fTsp1OnTzWMr4+RFeAWx7UjjIt35uUkTbRu4BclSvuX4sdcCERk3bsnIRwLbe99WCZN8Qz3qvHdC8W/iRrVksI7OS74e1GVCJCm6F6WYqwiKVE96Bbl8vuscmsm358rO1OenK5t3r89ZV1QhtE1Cmoo7temgzpm359m647+Zd/9NsNRo18735T5HNaDbkKyulFcWaqM7BCEtKSvPnvtPl4pXfkdD4hXtCC+/E3lzr0WzId3hLlpMc64UpjsPNy+X2vzYM0IZ8OfdVHr7cNhWrNwQ39ytT2tgz00b9UK2mmjR11NKAouJHPx9el5J+Wa1WtWjWNb73y4EBERCe9SBpzedjX5/yzbGT3129ccLLM7B9m36D+k1nXDpduPzbkT++VCqLYlr27N19HHIYci99sruaWNy6u4elONasL+mSgiAIJxeHdAfodLqN3/wnKeX8c0PfnjNjh7ub76dfvZz76D6ckkr07737wKoObQe8v+TU2FFLTyRuv3RNn8FlPbi7Y8/i2A6D3p71U2z7wQcOrUGORCqTQEeclQjW5HuQpiIljtoA6V7axZzclDGjlrZs/pSnh9/Qga+7uXr/dXqXMUK7J/q2ax0nlTo1bfKkn0/o/YybEPj3mZ+8vRr16/OKq6tndFTHLrHDkSMhpWRxgc5KBGuJV1Giddz+USmplyQSp2ZR5d7U4EEgU3JKpQvksJBWxmO53EOp0rv/y81LbxRU2S/SODQGORJ4K63GWvYntXoxdPw6ar9hpapEp9NAtcM00N3Nx+TpZlKGQlHk71c5himTcegdsQMSXsLqft/W5HP3cCIdNhji4e4HP/7lcVUyL9KWlyBIsxqNyvi1rKwUORLonJY6WXsla/I1inLRHc9HjiE0uLlarfT2DvL3DWNCHuVlmFqfWXy8g6/f/AvGQBihr986hRwJ9Ol7+FlrcVmTNqKVHDqyy4ocMkmnWdNOLZs9tXv/yvyC7JLSgsQzez7Z+NLZ8z9bv6rdE/HQ0th/aA30s91N/vfvM3uQI9FpqCZPWOu5slHvk8klOSn5jdsGIAfw8vi1p8/t/f7HRanpVwL8I55sN7DnUy9Yv6RFsy5DBrx2+uzeeYu7QhE8bvTS9V9PcZAvEUWeRv/EWA8rcWx0lx78KjMzuaxl73DU8Eg+kwUDna9Y3QXbRlb97KshmjIdapCoSsva9rCRF9vu1Pf0dbp7JjO6i/mxIcjFF6/qZ/aUVquGmp3ZmmOjgKgZr25CwrF52+x7aZfMntJoYKDdzHi4zEm++M1DyAJZdwphlLhTfxsNVhZjHTr0+by7rfs1sXQ+Lz/TbLhKVSKXm3fbQJJSby/7B6cfp6goV6szPwGwVFHk5mp2bIvw9bHYEXfjj9QOfX26DrJhfayGig58kZmdVtaiV0PJAVMv5OjK1C8vjbAZk1WteNi0ECcpkX4lFzUAinOVpfkKNtoh9iNtLy+PLMktzbyeh+o7aRcfTFzSlGVkbsPkXy245+Ll1ritH6qPFD1Qpl3OnrY6WsK6i47zJI2NbydDg6l5z8aofpHyT3ZJgXLqqmipjMNV9kwR2vnR/UeZKo8At4j2QpaetcWDu4V5aQUu7uRLSyIRR+ycoHb/lurw1kwYCnB2kwU08fYOrnvjIcpCTfadPFWxCjrG2nb36TbUx46b8JoeeeeiIvFgTkmBvk8BOmZJCSlxkhBwS6MbI723R6KKt2WiYsJp+dfy9mqVXjW4hCTKvRmZBhKGmac2AwEpibTVeypp0jCPFcYfddAY0ztSdHGXPNHVu8sz9k+zE2ZybvoSJRS/AAAAeUlEQVRN5a3zxblZaq2a1ukonbr8nvpJpRUehcpDDFNAK73fMo6XKmbpGq8iCIrSVWmu6KfWgsZVG5D6mCRlOgO3PLKMoNTVfxdEJiW0zEXi5S+LauXWsgsrT2zWwd1XEeaI/nl5IcrHC1E+Xojy8UKUjxeifLz4fwAAAP//CW3FMQAAAAZJREFUAwBF9rnnoIiuKQAAAABJRU5ErkJggg==", "text/plain": [ "
Final answer ──────────────────────────────────────────────────────────────────────────────────────────────────────\n", "\n" ], "text/plain": [ "\u001b[1;36mFinal answer\u001b[0m \u001b[92m──────────────────────────────────────────────────────────────────────────────────────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Step 1: Division \n", "\n", "Begin with division: 8 / 4 = 2, remaining numbers: [3, 6, 2] \n", "\n", "Step 2: Multiplication \n", "\n", "Try multiplication next: 6 * 3 = 18, remaining numbers: [2, 18] \n", "\n", "Step 3: Addition \n", "\n", "Add 2 and 18 to get 20. However, the goal is to reach 24, and the provided steps do not directly achieve this. The \n", "closest expression from the given steps is (8 / 4) * (6 * 3) + 2, but this does not equal 24 as per the initial \n", "task. \n", "\n", "The final answer is: $\\boxed{8 / 4 * 6 * 3 + 2 = 20}$ \n", "\n" ], "text/plain": [ "\u001b[4;35mStep 1: Division\u001b[0m \n", "\n", "Begin with division: 8 / 4 = 2, remaining numbers: [3, 6, 2] \n", "\n", "\u001b[4;35mStep 2: Multiplication\u001b[0m \n", "\n", "Try multiplication next: 6 * 3 = 18, remaining numbers: [2, 18] \n", "\n", "\u001b[4;35mStep 3: Addition\u001b[0m \n", "\n", "Add 2 and 18 to get 20. However, the goal is to reach 24, and the provided steps do not directly achieve this. The \n", "closest expression from the given steps is (8 / 4) * (6 * 3) + 2, but this does not equal 24 as per the initial \n", "task. \n", "\n", "The final answer is: $\\boxed{8 / 4 * 6 * 3 + 2 = 20}$ \n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "
tree size: 16 thoughts · max depth reached: 3 · best leaf score: 2/5 ──────────────────────────────────────────\n", "\n" ], "text/plain": [ "\u001b[1;36mtree size: \u001b[0m\u001b[1;36m16\u001b[0m\u001b[1;36m thoughts · max depth reached: \u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;36m · best leaf score: \u001b[0m\u001b[1;36m2\u001b[0m\u001b[1;36m/\u001b[0m\u001b[1;36m5\u001b[0m \u001b[92m──────────────────────────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "TASK = (\n", " \"Game of 24. Numbers: [3, 4, 6, 8]. Find an arithmetic expression using \"\n", " \"each number EXACTLY ONCE, with operators +, -, *, /, that equals 24.\\n\\n\"\n", " \"At each reasoning step, perform ONE arithmetic operation combining two \"\n", " \"numbers from the current set, then report the result + the updated set \"\n", " \"of remaining numbers. Continue until one number remains. The final answer \"\n", " \"is the full expression and whether it equals 24.\"\n", ")\n", "\n", "result = arch.run(TASK)\n", "\n", "print_header(\"Final answer\")\n", "print_md(result.output)\n", "print()\n", "print_header(\n", " f\"tree size: {result.metadata['total_thoughts']} thoughts · \"\n", " f\"max depth reached: {result.metadata['max_depth_reached']} · \"\n", " f\"best leaf score: {result.metadata['best_leaf_score']}/5\"\n", ")" ] }, { "cell_type": "markdown", "id": "c0c61d7e", "metadata": { "papermill": { "duration": 0.016815, "end_time": "2026-05-27T10:42:53.395648+00:00", "exception": false, "start_time": "2026-05-27T10:42:53.378833+00:00", "status": "completed" }, "tags": [] }, "source": [ "### 8.0 · What just happened, briefly\n", "\n", "Three counts to inspect above:\n", "\n", "- **`tree size`** — should be ≈ 1 (root) + branching × max_depth × beam_width = 1 + 3·3·2 = 19 if every branch survives pruning. Smaller = beam pruned aggressively (a *good* sign on this task: bad arithmetic branches should get low scores and die).\n", "- **`best leaf score`** — Game-of-24 forces objective scoring. A healthy tree has scores ranging 1-5: wrong arithmetic gets 1-2; arithmetic that's right but heads away from 24 gets 2-3; arithmetic that's right and looks promising gets 4-5.\n", "- **`max depth reached`** — should equal `max_depth=3` (3 operations reduces 4 numbers to 1)." ] }, { "cell_type": "markdown", "id": "38069d88", "metadata": { "papermill": { "duration": 0.013763, "end_time": "2026-05-27T10:42:53.425676+00:00", "exception": false, "start_time": "2026-05-27T10:42:53.411913+00:00", "status": "completed" }, "tags": [] }, "source": [ "### 8.1 · Tree visualisation" ] }, { "cell_type": "code", "execution_count": 5, "id": "7e8f887a", "metadata": { "execution": { "iopub.execute_input": "2026-05-27T10:42:53.494015Z", "iopub.status.busy": "2026-05-27T10:42:53.489457Z", "iopub.status.idle": "2026-05-27T10:42:53.531267Z", "shell.execute_reply": "2026-05-27T10:42:53.526138Z" }, "papermill": { "duration": 0.098077, "end_time": "2026-05-27T10:42:53.534401+00:00", "exception": false, "start_time": "2026-05-27T10:42:53.436324+00:00", "status": "completed" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [d=0 s=0/5 id=0] [ROOT] Game of 24. Numbers: [3, 4, 6, 8]. Find an arithmetic expression using each number ...\n", " [d=1 s=4/5 id=1] Start with multiplication: 3 * 4 = 12, remaining numbers: [12, 6, 8]\n", " [d=2 s=4/5 id=4] 12 + 6 = 18, remaining numbers: [18, 8]\n", " [d=3 s=1/5 id=10] 18 * 8 = 144, remaining numbers: [144]\n", " [d=3 s=1/5 id=11] 18 + 8 = 26, remaining numbers: [26]\n", " [d=3 s=1/5 id=12] 8 - 18 = -10, remaining numbers: [-10]\n", " [d=2 s=2/5 id=5] 12 * 6 = 72, remaining numbers: [72, 8]\n", " [d=2 s=2/5 id=6] 8 / 6 = 1.33, remaining numbers: [12, 1.33]\n", " [d=1 s=4/5 id=2] Begin with division: 8 / 4 = 2, remaining numbers: [3, 6, 2]\n", " [d=2 s=4/5 id=7] Try multiplication next: 6 * 3 = 18, remaining numbers: [2, 18]\n", "⭐ [d=3 s=2/5 id=13] Add 2 and 18 to get 20, then find a way to reach 24 from 20\n", " [d=3 s=2/5 id=14] Subtract 2 from 18 to get 16, then find a way to reach 24 from 16\n", " [d=3 s=2/5 id=15] Divide 18 by 2 to get 9, then find a way to reach 24 from 9\n", " [d=2 s=2/5 id=8] Attempt addition: 3 + 6 = 9, remaining numbers: [2, 9]\n", " [d=2 s=2/5 id=9] Use subtraction: 6 - 3 = 3, remaining numbers: [2, 3]\n", " [d=1 s=4/5 id=3] Unconventional approach: 8 - 6 = 2, remaining numbers: [3, 4, 2]\n", "\n", "⭐ = best-scoring leaf used to synthesise the final answer\n" ] } ], "source": [ "from collections import defaultdict\n", "\n", "children = defaultdict(list)\n", "by_id = {}\n", "for t in result.trace:\n", " by_id[t['id']] = t\n", " children[t.get('parent_id', -1)].append(t['id'])\n", "\n", "def render_tree(node_id, indent=0):\n", " if node_id not in by_id:\n", " return\n", " t = by_id[node_id]\n", " marker = '⭐' if t['id'] == result.metadata['best_leaf_id'] else ' '\n", " content = t['content'][:90].replace('\\n', ' ')\n", " print(f\"{marker} {' ' * indent}[d={t['depth']} s={t['score']}/5 id={t['id']}] {content}{'...' if len(t['content']) > 90 else ''}\")\n", " for child_id in children.get(node_id, []):\n", " render_tree(child_id, indent + 4)\n", "\n", "# Root has parent_id=-1; print its children\n", "for root_child in children.get(-1, []):\n", " render_tree(root_child)\n", "print()\n", "print('⭐ = best-scoring leaf used to synthesise the final answer')" ] }, { "cell_type": "markdown", "id": "c5f52762", "metadata": { "papermill": { "duration": 0.016595, "end_time": "2026-05-27T10:42:53.567841+00:00", "exception": false, "start_time": "2026-05-27T10:42:53.551246+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 9 · What we just observed\n", "\n", "The cells above ran a 3-deep, 3-wide beam search with `beam_width=2` against **Llama 3.3** on the **Game of 24** puzzle (objective scoring forces real discrimination).\n", "\n", "### 9.1 · Quantitative summary\n", "\n", "| Metric | Value |\n", "|---|---|\n", "| Tree size | **16** thoughts |\n", "| Max depth reached | **3** / 3 |\n", "| Best leaf score | **2**/5 |\n", "| Score distribution (non-root) | [4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1] |\n", "| Distinct score values | 3 |\n", "\n", "### 9.2 · Score distribution table\n", "\n", "| Score | Count |\n", "|---|---|\n", "| 4/5 | 5 |\n", "| 2/5 | 7 |\n", "| 1/5 | 3 |\n", "\n", "### 9.3 · A sample of captured thoughts\n", "\n", "| Depth | Score | id | Content snippet |\n", "|---|---|---|---|\n", "| 0 | 0/5 | 0 | [ROOT] Game of 24. Numbers: [3, 4, 6, 8]. Find an arithmetic expression using each number ... |\n", "| 1 | 4/5 | 1 | Start with multiplication: 3 * 4 = 12, remaining numbers: [12, 6, 8] |\n", "| 2 | 4/5 | 4 | 12 + 6 = 18, remaining numbers: [18, 8] |\n", "| 3 | 1/5 | 10 | 18 * 8 = 144, remaining numbers: [144] |\n", "| 3 | 1/5 | 11 | 18 + 8 = 26, remaining numbers: [26] |\n", "| 3 | 1/5 | 12 | 8 - 18 = -10, remaining numbers: [-10] |\n", "| 2 | 2/5 | 5 | 12 * 6 = 72, remaining numbers: [72, 8] |\n", "| 2 | 2/5 | 6 | 8 / 6 = 1.33, remaining numbers: [12, 1.33] |\n", "| 1 | 4/5 | 2 | Begin with division: 8 / 4 = 2, remaining numbers: [3, 6, 2] |\n", "| 2 | 4/5 | 7 | Try multiplication next: 6 * 3 = 18, remaining numbers: [2, 18] |\n", "| 3 | 2/5 | 13 | Add 2 and 18 to get 20, then find a way to reach 24 from 20 |\n", "| 3 | 2/5 | 14 | Subtract 2 from 18 to get 16, then find a way to reach 24 from 16 |\n", "\n", "### 9.4 · Patterns surfaced in this run\n", "\n", "- **Healthy score spread** (1-4/5). The evaluator is genuinely discriminating between branches, which means beam search is doing real work.\n", "\n", "### 9.5 · Final answer (verbatim)\n", "\n", "> Step 1: Division \n", "> \n", "> Begin with division: 8 / 4 = 2, remaining numbers: [3, 6, 2] \n", "> \n", "> Step 2: Multiplication \n", "> \n", "> Try multiplication next: 6 * 3 = 18, remaining numbers: [2, 18] \n", "> \n", "> Step 3: Addition \n", "> \n", "> Add 2 and 18 to…\n", "\n", "### 9.6 · The takeaway\n", "\n", "A *healthy* ToT run has:\n", "\n", "1. **A spread of scores** across thoughts (2-5 range, not all 5/5).\n", "2. **The tree actually pruned** — at least one low-scoring branch killed off, not just exhaustive expansion.\n", "3. **The best-leaf score visibly higher** than the average score.\n", "4. **A final answer that obviously synthesizes the winning path**, not just paraphrases the task.\n", "\n", "When the evaluator is lenient (everything 5/5), the search reduces to brute-force expansion at high cost — see § 11.1 for the mitigation. The reasoning-model default helps but doesn't solve this entirely." ] }, { "cell_type": "markdown", "id": "f536eb5b", "metadata": { "papermill": { "duration": 0.021909, "end_time": "2026-05-27T10:42:53.606282+00:00", "exception": false, "start_time": "2026-05-27T10:42:53.584373+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 10 · Compare with the Qwen3-Thinking reasoning model\n", "\n", "Same task, smaller tree, but using **Qwen3-Thinking** instead of Llama. We use a smaller `(branching=2, max_depth=2)` because reasoning models are slower per call. The expected difference is **score-distribution quality**: Qwen3-Thinking can usually distinguish a winning arithmetic step from a dead-end one even more sharply than Llama, because each evaluation gets an internal `
Re-running ToT on Qwen3-Thinking (smaller tree, slower per call) ──────────────────────────────────────────────────\n", "\n" ], "text/plain": [ "\u001b[1;36mRe-running ToT on Qwen3-Thinking \u001b[0m\u001b[1;36m(\u001b[0m\u001b[1;36msmaller tree, slower per call\u001b[0m\u001b[1;36m)\u001b[0m \u001b[92m──────────────────────────────────────────────────\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Using the numbers [4, 5, 6, 7], the solution follows the identified factor pair strategy: \n", "**Step 1:** Compute $5 + 7 = 12$ and $6 - 4 = 2$. \n", "**Step 2:** Multiply the results: $12 \\times 2 = 24$. \n", "\n", "**Final Answer:** $(5 + 7) \\times (6 - 4) = 24$\n", "\n", " Score distribution (Qwen3-Thinking, 4 non-root): [4, 4, 4, 2]\n", " Note: a healthy ToT run has a SPREAD of scores (1-5). All-5s = lenient evaluator.\n" ] } ], "source": [ "print_header(\"Re-running ToT on Qwen3-Thinking (smaller tree, slower per call)\")\n", "thinking_llm = get_llm(\n", " provider=\"nebius\",\n", " model=\"Qwen/Qwen3-235B-A22B-Thinking-2507-fast\",\n", " temperature=0.4,\n", ")\n", "thinking_arch = TreeOfThoughts(branching=2, beam_width=1, max_depth=2, llm=thinking_llm)\n", "thinking_result = thinking_arch.run(\n", " \"Game of 24. Numbers: [4, 5, 6, 7]. Find arithmetic to equal 24, step by step.\"\n", ")\n", "print(thinking_result.output[:400])\n", "print()\n", "score_dist = sorted([t['score'] for t in thinking_result.trace if t['depth'] > 0], reverse=True)\n", "print(f\" Score distribution (Qwen3-Thinking, {len(score_dist)} non-root): {score_dist}\")\n", "print(f\" Note: a healthy ToT run has a SPREAD of scores (1-5). All-5s = lenient evaluator.\")" ] }, { "cell_type": "markdown", "id": "b6849c17", "metadata": { "papermill": { "duration": 0.014536, "end_time": "2026-05-27T10:44:45.695342+00:00", "exception": false, "start_time": "2026-05-27T10:44:45.680806+00:00", "status": "completed" }, "tags": [] }, "source": [ "## 11 · Failure modes, safety, extensions\n", "\n", "### 11.1 · Where this breaks\n", "\n", "| Failure | Mechanism | Mitigation |\n", "|---|---|---|\n", "| **Lenient evaluator** | Every thought scores 5/5 → no pruning signal | Stricter rubric (\"reserve 5 for excellence\"); different model in evaluator seat |\n", "| **Candidate mode collapse** | K candidates are near-paraphrases | Tighten `_ThoughtCandidates` description; raise temperature on generator |\n", "| **Premature commitment** | Depth-1 thoughts scored before they've shown their work | Defer evaluation to depth-2 or use look-ahead scoring |\n", "| **Cost explosion** | `2 × N × K × D` calls × reasoning model | Cap depth; use a smaller evaluator model |\n", "| **Best-leaf isn't best-path** | Highest-scoring leaf may have a weak ancestor | Score *paths*, not nodes (extension) |\n", "\n", "### 11.2 · Production safety\n", "\n", "- **Bound depth + branching hard** — runaway ToT can rack up huge bills. Always set `max_depth ≤ 4`.\n", "- **Tracing matters.** With 19+ LLM calls per task, LangSmith trace is essential for debugging.\n", "- **Evaluator is a single point of failure.** If it's biased toward a framing, ToT will find paths that match the bias even if they're wrong. Use diverse rubrics or multiple judges.\n", "\n", "### 11.3 · Three extensions\n", "\n", "1. **LATS (notebook 22)** — replace beam search with Monte Carlo Tree Search + a reward model. The natural successor.\n", "2. **Path-level scoring** — score whole root-to-leaf paths instead of individual nodes; eliminates the \"best leaf isn't best path\" failure.\n", "3. **Process Reward Model (PRM)** — train a small reward model on intermediate steps and use it as the evaluator.\n", "\n", "### 11.4 · What to read next\n", "\n", "- [**21 · Self-Consistency**](./21_self_consistency.ipynb) — simpler N-sample-and-vote alternative.\n", "- [**22 · LATS**](./22_lats.ipynb) — ToT + reward → MCTS-style tree search.\n", "- [**01 · Reflection**](./01_reflection.ipynb) — single-path refinement vs ToT's multi-path search.\n", "\n", "### 11.5 · References\n", "\n", "1. Yao, S. et al. *Tree of Thoughts: Deliberate Problem Solving with Large Language Models.* NeurIPS 2023. [arXiv:2305.10601](https://arxiv.org/abs/2305.10601)\n", "2. Long, J. *Large Language Model Guided Tree-of-Thought.* 2023. [arXiv:2305.08291](https://arxiv.org/abs/2305.08291)\n", "3. Zhou et al. *Language Agent Tree Search.* 2024. [arXiv:2310.04406](https://arxiv.org/abs/2310.04406) — the LATS paper that extends ToT.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" }, "papermill": { "default_parameters": {}, "duration": 198.427667, "end_time": "2026-05-27T10:44:47.046067+00:00", "environment_variables": {}, "exception": null, "input_path": "all-agentic-architectures/notebooks/09_tree_of_thoughts.ipynb", "output_path": "all-agentic-architectures/notebooks/09_tree_of_thoughts.ipynb", "parameters": {}, "start_time": "2026-05-27T10:41:28.618400+00:00", "version": "2.7.0" } }, "nbformat": 4, "nbformat_minor": 5 }