{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "83efb6df-7d99-4fee-99f3-f2f668292110",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"https://mng.bz/lZ5B\">Build a Reasoning Model (From Scratch)</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/reasoning-from-scratch\">https://github.com/rasbt/reasoning-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"https://mng.bz/lZ5B\"><img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef2ac59f-0dc1-4c3e-bb8c-2ea79e0f6657",
   "metadata": {},
   "source": [
    "# Chapter 5: Exercise Solutions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4735f8bb-dd7f-4a4f-8761-269f26b38349",
   "metadata": {},
   "source": [
    "Packages that are being used in this notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "00e26411-6a34-4c89-bc24-2e36dd14c8eb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reasoning_from_scratch version: 0.1.13\n",
      "torch version: 2.10.0\n",
      "tokenizers version: 0.22.2\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "used_libraries = [\n",
    "    \"reasoning_from_scratch\",\n",
    "    \"torch\",\n",
    "    \"tokenizers\"  # Used by reasoning_from_scratch\n",
    "]\n",
    "\n",
    "for lib in used_libraries:\n",
    "    print(f\"{lib} version: {version(lib)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d101721-6848-4871-826a-eaf194ddb26a",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## Exercise 5.1: Using the heuristic scorer as a tie-breaker in self-consistency"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d9257c6-384b-46a0-9767-c2f3db7dbcf0",
   "metadata": {},
   "source": [
    "- There are many ways to implement this\n",
    "- The perhaps easiest way is to handle it outside the self-consistency function and work with the returned dictionary (e.g., similar to what we have done in exercise 4.4, when we implemented the tie-breaking, which we added directly to the `evaluate_math500_stream` function\n",
    "- The relevant lines are shown below"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "733ded33-7ef8-4214-bb71-4b0c206d6867",
   "metadata": {},
   "source": [
    "```python\n",
    "# ...\n",
    "from pathlib import Path\n",
    "import time\n",
    "\n",
    "from reasoning_from_scratch.ch05 import heuristic_score\n",
    "\n",
    "\n",
    "def evaluate_math500_stream(\n",
    "    model,\n",
    "    tokenizer,\n",
    "    device,\n",
    "    math_data,\n",
    "    out_path=None,\n",
    "    max_new_tokens=2048,\n",
    "    verbose=False,\n",
    "    prompt_suffix=\"\",\n",
    "    temperature=1.0,\n",
    "    top_p=1.0,\n",
    "    seed=None,\n",
    "    num_samples=10,\n",
    "):\n",
    "    if out_path is None:\n",
    "        dev_name = str(device).replace(\":\", \"-\")\n",
    "        out_path = Path(f\"math500-{dev_name}.jsonl\")\n",
    "\n",
    "    num_examples = len(math_data)\n",
    "    num_correct = 0\n",
    "    start_time = time.time()\n",
    "\n",
    "    with open(out_path, \"w\", encoding=\"utf-8\") as f:\n",
    "        for i, row in enumerate(math_data, start=1):\n",
    "            prompt = render_prompt(row[\"problem\"]) + prompt_suffix\n",
    "\n",
    "            results = self_consistency_vote(\n",
    "                model=model,\n",
    "                tokenizer=tokenizer,\n",
    "                prompt=prompt,\n",
    "                device=device,\n",
    "                num_samples=num_samples,\n",
    "                temperature=temperature,\n",
    "                top_p=top_p,\n",
    "                max_new_tokens=max_new_tokens,\n",
    "                show_progress=False,\n",
    "                show_long_answer=False,\n",
    "                seed=seed,\n",
    "            )\n",
    "\n",
    "            # Majority vote winner available\n",
    "            if results[\"final_answer\"] is not None:\n",
    "                extracted = results[\"final_answer\"]\n",
    "\n",
    "            ### NEW: Break tie with heuristic_score\n",
    "            else:\n",
    "                best = None\n",
    "                best_score = float(\"-inf\")\n",
    "            \n",
    "                for cand in results[\"majority_winners\"]:\n",
    "                    scores = [\n",
    "                        heuristic_score(results[\"full_answers\"][idx], prompt=prompt)\n",
    "                        for idx in results[\"groups\"][cand]\n",
    "                    ]\n",
    "            \n",
    "                    score = max(scores)\n",
    "            \n",
    "                    if score > best_score:\n",
    "                        best_score = score\n",
    "                        best = cand\n",
    "            \n",
    "                extracted = best\n",
    "\n",
    "            # ...\n",
    "\n",
    "    # ...\n",
    "    return num_correct, num_examples, acc\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c760503-b57a-4947-b7bc-63620ebe2af9",
   "metadata": {},
   "source": [
    "- The improvements over the baseline in chapter 3 and self-consistency from chapter 4 are shown below\n",
    "\n",
    "|   | Method                                   | Model | Accuracy | Time      |\n",
    "|---|------------------------------------------|-------|----------|-----------|\n",
    "| 1 | Chapter 4 baseline with CoT prompting    | Base  | 33.4%    | 129.2 min |\n",
    "| 2 | Self-consistency (n=3) + majority vote   | Base  | 43.2%    | 328.2 min |\n",
    "| 3 | Self-consistency (n=3) + heuristic       | Base  | 43.4%    | 326.5 min |\n",
    "| 4 | Self-consistency (n=3) + avg. logprob    | Base  | 44.8%    | 327.7 min |\n",
    "\n",
    "- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a \"cuda\" GPU (DGX Spark)\n",
    "\n",
    "- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/self_consistency_scorer_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d706d20-5d5a-4fdc-9541-076c5403c9e7",
   "metadata": {},
   "source": [
    "- However, note that as discussed in [#159](https://github.com/rasbt/reasoning-from-scratch/issues/159), we decide the majority winner based on the heuristic score but only consider the first instance in each majority pair\n",
    "- For instance"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f71cbd0-c297-400e-b64a-c7b7dd46a19e",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## Exercise 5.2: Using the heuristic scorer in a Best-of-N setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a3244da-f5b5-4919-bb44-0a1c2eefd208",
   "metadata": {},
   "source": [
    "- Best-of-N is similar to self-consistency in that we generate multiple answers\n",
    "- However, instead of selecting the final answer based on majority vote, we score all answers using a scoring function (like `heuristic_score`) and return the highest-scoring answer\n",
    "- There are several ways to implement this behavior, but the easiest one is arguably to use the existing self-consistency function from chapter 4 as a template and swap in the `heuristic_score` as shown below"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dea8804-77c7-48bd-84df-7be4700f0ef1",
   "metadata": {},
   "source": [
    "```python\n",
    "# ...\n",
    "\n",
    "from reasoning_from_scratch.ch05 import (\n",
    "    heuristic_score\n",
    ")\n",
    "\n",
    "def self_consistency_vote(\n",
    "    model,\n",
    "    tokenizer,\n",
    "    prompt,\n",
    "    device,\n",
    "    num_samples=10,\n",
    "    temperature=0.8,\n",
    "    top_p=0.9,\n",
    "    max_new_tokens=2048,\n",
    "    show_progress=True,\n",
    "    show_long_answer=False,\n",
    "    seed=None,\n",
    "):\n",
    "    full_answers, short_answers = [], []\n",
    "    counts = Counter()\n",
    "    groups = {}\n",
    "    majority_winners, final_answer = [], None\n",
    "    best_score, best_idx = float(\"-inf\"), None\n",
    "\n",
    "    for i in range(num_samples):\n",
    "        if seed is not None:\n",
    "            torch.manual_seed(seed + i + 1)\n",
    "\n",
    "        answer = generate_text_stream_concat_flex(\n",
    "            model=model,\n",
    "            tokenizer=tokenizer,\n",
    "            prompt=prompt,\n",
    "            device=device,\n",
    "            max_new_tokens=max_new_tokens,\n",
    "            verbose=show_long_answer,\n",
    "            generate_func=generate_text_top_p_stream_cache,\n",
    "            temperature=temperature,\n",
    "            top_p=top_p,\n",
    "        )\n",
    "\n",
    "        short = extract_final_candidate(answer, fallback=\"number_then_full\")\n",
    "        full_answers.append(answer)\n",
    "        short_answers.append(short)\n",
    "        counts[short] += 1\n",
    "\n",
    "        if short in groups:\n",
    "            groups[short].append(i)\n",
    "        else:\n",
    "            groups[short] = [i]\n",
    "\n",
    "        score = heuristic_score(answer, prompt=prompt)\n",
    "\n",
    "        if score > best_score:\n",
    "            best_score, best_idx = score, i\n",
    "\n",
    "        if show_progress:\n",
    "            print(f\"[Sample {i+1}/{num_samples}] → {short!r}\")\n",
    "\n",
    "    if best_idx is not None:\n",
    "        final_answer = short_answers[best_idx]\n",
    "        majority_winners = [final_answer]\n",
    "\n",
    "    return {\n",
    "        \"full_answers\": full_answers,\n",
    "        \"short_answers\": short_answers,\n",
    "        \"counts\": dict(counts),\n",
    "        \"groups\": groups,\n",
    "        \"majority_winners\": majority_winners,\n",
    "        \"final_answer\": final_answer,\n",
    "    }\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b96af53-1582-48a6-802a-c9c4a9b94d16",
   "metadata": {},
   "source": [
    "- The results are shown below\n",
    "\n",
    "|   | Method                                   | Model | Accuracy | Time      |\n",
    "|---|------------------------------------------|-------|----------|-----------|\n",
    "| 1 | Baseline with chain-of-thought prompting | Base  | 33.4%    | 129.2 min |\n",
    "| 2 | Best-of-N (n=3) + heuristic              | Base  | 40.6%    | 327.7 min |\n",
    "| 3 | Best-of-N (n=3) + avg. logprob           | Base  | 43.2%    | 330.2 min |\n",
    "\n",
    "- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a \"cuda\" GPU (DGX Spark)\n",
    "\n",
    "- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/best_of_n_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91d783e3-36be-427f-8663-ac8a25b946d0",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## Exercise 5.3: Using the logprob scorer as a tie-breaker in self-consistency"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe89dc83-a406-4393-96ca-68168866ff9d",
   "metadata": {},
   "source": [
    "- The code is similar to exercise 5.1, except that we swap `heuristic_score` with `avg_logprob_answer`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba844444-1f1a-4944-8811-8da1228f2a46",
   "metadata": {},
   "source": [
    "```python\n",
    "# ...\n",
    "# from reasoning_from_scratch.ch05 import heuristic_score\n",
    "from reasoning_from_scratch.ch05 import avg_logprob_answer\n",
    "\n",
    "\n",
    "def evaluate_math500_stream(\n",
    "    model,\n",
    "    tokenizer,\n",
    "    device,\n",
    "    math_data,\n",
    "    out_path=None,\n",
    "    max_new_tokens=2048,\n",
    "    verbose=False,\n",
    "    prompt_suffix=\"\",\n",
    "    temperature=1.0,\n",
    "    top_p=1.0,\n",
    "    seed=None,\n",
    "    num_samples=10,\n",
    "):\n",
    "    if out_path is None:\n",
    "        dev_name = str(device).replace(\":\", \"-\")\n",
    "        out_path = Path(f\"math500-{dev_name}.jsonl\")\n",
    "\n",
    "    num_examples = len(math_data)\n",
    "    num_correct = 0\n",
    "    start_time = time.time()\n",
    "\n",
    "    with open(out_path, \"w\", encoding=\"utf-8\") as f:\n",
    "        for i, row in enumerate(math_data, start=1):\n",
    "            prompt = render_prompt(row[\"problem\"]) + prompt_suffix\n",
    "\n",
    "            results = self_consistency_vote(\n",
    "                model=model,\n",
    "                tokenizer=tokenizer,\n",
    "                prompt=prompt,\n",
    "                device=device,\n",
    "                num_samples=num_samples,\n",
    "                temperature=temperature,\n",
    "                top_p=top_p,\n",
    "                max_new_tokens=max_new_tokens,\n",
    "                show_progress=False,\n",
    "                show_long_answer=False,\n",
    "                seed=seed,\n",
    "            )\n",
    "\n",
    "            # Majority vote winner available\n",
    "            if results[\"final_answer\"] is not None:\n",
    "                extracted = results[\"final_answer\"]\n",
    "\n",
    "            ### NEW: Break tie with avg_logprob_answer\n",
    "            else:\n",
    "                best = None\n",
    "                best_score = float(\"-inf\")\n",
    "            \n",
    "                # Consider all members of each majority group\n",
    "                for cand in results[\"majority_winners\"]:\n",
    "                    scores = []\n",
    "            \n",
    "                    for idx in results[\"groups\"][cand]:\n",
    "                        candidate_full = results[\"full_answers\"][idx]\n",
    "            \n",
    "                        score = avg_logprob_answer(\n",
    "                            model=model,\n",
    "                            tokenizer=tokenizer,\n",
    "                            prompt=prompt,\n",
    "                            answer=candidate_full,\n",
    "                            device=device,\n",
    "                        )\n",
    "                        scores.append(score)\n",
    "            \n",
    "                    cand_score = max(scores)\n",
    "            \n",
    "                    if cand_score > best_score:\n",
    "                        best_score = cand_score\n",
    "                        best = cand\n",
    "            \n",
    "                extracted = best\n",
    "            # ...\n",
    "\n",
    "    # ...\n",
    "    return num_correct, num_examples, acc\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21be8952-099a-4ca6-b903-f46a39b95b1f",
   "metadata": {},
   "source": [
    "- The improvements over the baseline in chapter 3 and self-consistency from chapter 4 are shown below\n",
    "\n",
    "|   | Method                                   | Model | Accuracy | Time      |\n",
    "|---|------------------------------------------|-------|----------|-----------|\n",
    "| 1 | Baseline with chain-of-thought prompting | Base  | 33.4%    | 129.2 min |\n",
    "| 2 | Self-consistency (n=3) + majority vote   | Base  | 43.2%    | 328.2 min |\n",
    "| 3 | Self-consistency (n=3) + heuristic       | Base  | 43.4%    | 326.5 min |\n",
    "| 4 | Self-consistency (n=3) + avg logprob     | Base  | 44.8%    | 327.7 min |\n",
    "\n",
    "- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a \"cuda\" GPU (DGX Spark)\n",
    "\n",
    "- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/best_of_n_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5bc62f87-3d9d-47cd-9eed-6e15982e478c",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## Exercise 5.4: Using the logprob scorer in a Best-of-N setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dc9d6e6-f8a8-438d-a1f7-4b3a1c50a251",
   "metadata": {},
   "source": [
    "- To implement Best-of-N with a logprob scorer, we can use the code from exercise 5.2 ans swap the `heuristic_score` with `avg_logprob_answer`:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f184362-4997-4eb4-afdb-51832602cdb7",
   "metadata": {},
   "source": [
    "```python\n",
    "\n",
    "from reasoning_from_scratch.ch05 import (\n",
    "    avg_logprob_answer\n",
    ")\n",
    "\n",
    "\n",
    "def self_consistency_vote(\n",
    "    model,\n",
    "    tokenizer,\n",
    "    prompt,\n",
    "    device,\n",
    "    num_samples=10,\n",
    "    temperature=0.8,\n",
    "    top_p=0.9,\n",
    "    max_new_tokens=2048,\n",
    "    show_progress=True,\n",
    "    show_long_answer=False,\n",
    "    seed=None,\n",
    "):\n",
    "    full_answers, short_answers = [], []\n",
    "    counts = Counter()\n",
    "    groups = {}\n",
    "    majority_winners, final_answer = [], None\n",
    "    best_score, best_idx = float(\"-inf\"), None\n",
    "\n",
    "    for i in range(num_samples):\n",
    "        if seed is not None:\n",
    "            torch.manual_seed(seed + i + 1)\n",
    "\n",
    "        answer = generate_text_stream_concat_flex(\n",
    "            model=model,\n",
    "            tokenizer=tokenizer,\n",
    "            prompt=prompt,\n",
    "            device=device,\n",
    "            max_new_tokens=max_new_tokens,\n",
    "            verbose=show_long_answer,\n",
    "            generate_func=generate_text_top_p_stream_cache,\n",
    "            temperature=temperature,\n",
    "            top_p=top_p,\n",
    "        )\n",
    "\n",
    "        short = extract_final_candidate(answer, fallback=\"number_then_full\")\n",
    "        full_answers.append(answer)\n",
    "        short_answers.append(short)\n",
    "        counts[short] += 1\n",
    "\n",
    "        if short in groups:\n",
    "            groups[short].append(i)\n",
    "        else:\n",
    "            groups[short] = [i]\n",
    "\n",
    "            score = avg_logprob_answer(\n",
    "                model=model,\n",
    "                tokenizer=tokenizer,\n",
    "                prompt=prompt,\n",
    "                answer=answer,\n",
    "                device=device\n",
    "            )\n",
    "        if score > best_score:\n",
    "            best_score, best_idx = score, i\n",
    "\n",
    "        if show_progress:\n",
    "            print(f\"[Sample {i+1}/{num_samples}] → {short!r}\")\n",
    "\n",
    "    if best_idx is not None:\n",
    "        final_answer = short_answers[best_idx]\n",
    "        majority_winners = [final_answer]\n",
    "\n",
    "    return {\n",
    "        \"full_answers\": full_answers,\n",
    "        \"short_answers\": short_answers,\n",
    "        \"counts\": dict(counts),\n",
    "        \"groups\": groups,\n",
    "        \"majority_winners\": majority_winners,\n",
    "        \"final_answer\": final_answer,\n",
    "    }\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9b68322-7cb9-441e-ab01-f979dd66f036",
   "metadata": {},
   "source": [
    "- The results are shown below\n",
    "\n",
    "| # | Method                                   | Model | Accuracy | Time      |\n",
    "|---|------------------------------------------|-------|----------|-----------|\n",
    "| 1 | Baseline with chain-of-thought prompting | Base  | 33.4%    | 129.2 min |\n",
    "| 2 | Best-of-N (n=3) + heuristic              | Base  | TBD      | TBD       |\n",
    "| 3 | Best-of-N (n=3) + avg. logprob           | Base  | TBD      | TBD       |\n",
    "\n",
    "- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a \"cuda\" GPU (DGX Spark)\n",
    "\n",
    "- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/best_of_n_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "015002b0-b38a-44eb-993c-92f13c7ca008",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## Exercise 5.5: Using the heuristic score for self-refinement"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6771719-1da3-43d4-be40-dedd6f25258c",
   "metadata": {},
   "source": [
    "- Using the `heuristic_score` is actually even simpler than using the logprob score, all we need to do is to change the following code:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97f2f7b6-418e-4ee8-bd56-1b4e57bc7491",
   "metadata": {},
   "source": [
    "```python\n",
    "from functools import partial\n",
    "\n",
    "avg_logprob_score = partial(\n",
    "    avg_logprob_answer,\n",
    "    model=model,\n",
    "    tokenizer=tokenizer,\n",
    "    device=device\n",
    ")\n",
    "\n",
    "\n",
    "torch.manual_seed(0)\n",
    "\n",
    "results_logprob = self_refinement_loop(\n",
    "    model=model,\n",
    "    tokenizer=tokenizer,\n",
    "    raw_prompt=raw_prompt,\n",
    "    device=device,\n",
    "    iterations=2,\n",
    "    max_response_tokens=2048,\n",
    "    max_critique_tokens=256,\n",
    "    score_fn=avg_logprob_score,\n",
    "    verbose=True,\n",
    "    temperature=0.7,\n",
    "    top_p=0.9,\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7d83dbd-a1cd-461d-960e-167a12b0ef4d",
   "metadata": {},
   "source": [
    "- The updated code is:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5711c8e7-845b-4d37-81c1-ac975121f648",
   "metadata": {},
   "source": [
    "```python\n",
    "torch.manual_seed(0)\n",
    "\n",
    "results_logprob = self_refinement_loop(\n",
    "    model=model,\n",
    "    tokenizer=tokenizer,\n",
    "    raw_prompt=raw_prompt,\n",
    "    device=device,\n",
    "    iterations=2,\n",
    "    max_response_tokens=2048,\n",
    "    max_critique_tokens=256,\n",
    "    score_fn=heuristic_score,  # NEW\n",
    "    verbose=True,\n",
    "    temperature=0.7,\n",
    "    top_p=0.9,\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a8a64eb-9761-471b-b116-2700aa2285ea",
   "metadata": {},
   "source": [
    "- The results, using the heuristic scorer, are shown in rows 4, 5, and 10:\n",
    "\n",
    "|    | Method                 | Scoring       | Iterations | Model      | Accuracy | Time      |\n",
    "|----|------------------------|---------------|------------|------------|----------|-----------|\n",
    "| 1  | Baseline (chapter 3)   | -             | -          | Base       | 15.2%    | 10.1 min  |\n",
    "| 2  | Self-refinement        | None          | 1          | Base       | 25.0%    | 84.8 min  |\n",
    "| 3  | Self-refinement        | None          | 2          | Base       | 22.0%    | 165.4 min |\n",
    "| 4  | Self-refinement        | Heuristic     | 1          | Base       | 21.6%    | 84.7 min  |\n",
    "| 5  | Self-refinement        | Heuristic     | 2          | Base       | 20.8%    | 151.4 min |\n",
    "| 6  | Self-refinement        | Avg. logprob  | 1          | Base       | 21.4%    | 85.3 min  |\n",
    "| 7  | Self-refinement        | Avg. logprob  | 2          | Base       | 22.0%    | 165.3 min |\n",
    "|    |                        |               |            |            |          |           |\n",
    "| 8  | Baseline (chapter 3)   | -             | -          | Reasoning  | 48.2%    | 182.1 min |\n",
    "| 9  | Self-refinement        | None          | 1          | Reasoning  | 56.6%    | 498.8 min |\n",
    "| 10 | Self-refinement        | Heuristic     | 1          | Reasoning  | 57.8%    | 498.6 min |\n",
    "| 11 | Self-refinement        | Avg. logprob  | 1          | Reasoning  | 48.4%    | 499.7 min |\n",
    "\n",
    "- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a \"cuda\" GPU (DGX Spark)\n",
    "- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/self_refinement_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}