{ "cells": [ { "cell_type": "markdown", "id": "83efb6df-7d99-4fee-99f3-f2f668292110", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "Supplementary code for the Build a Reasoning Model (From Scratch) book by Sebastian Raschka
\n", "
Code repository: https://github.com/rasbt/reasoning-from-scratch\n", "
\n", "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "ef2ac59f-0dc1-4c3e-bb8c-2ea79e0f6657", "metadata": {}, "source": [ "# Chapter 5: Exercise Solutions" ] }, { "cell_type": "markdown", "id": "4735f8bb-dd7f-4a4f-8761-269f26b38349", "metadata": {}, "source": [ "Packages that are being used in this notebook:" ] }, { "cell_type": "code", "execution_count": 1, "id": "00e26411-6a34-4c89-bc24-2e36dd14c8eb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "reasoning_from_scratch version: 0.1.13\n", "torch version: 2.10.0\n", "tokenizers version: 0.22.2\n" ] } ], "source": [ "from importlib.metadata import version\n", "\n", "used_libraries = [\n", " \"reasoning_from_scratch\",\n", " \"torch\",\n", " \"tokenizers\" # Used by reasoning_from_scratch\n", "]\n", "\n", "for lib in used_libraries:\n", " print(f\"{lib} version: {version(lib)}\")" ] }, { "cell_type": "markdown", "id": "8d101721-6848-4871-826a-eaf194ddb26a", "metadata": {}, "source": [ " \n", "## Exercise 5.1: Using the heuristic scorer as a tie-breaker in self-consistency" ] }, { "cell_type": "markdown", "id": "5d9257c6-384b-46a0-9767-c2f3db7dbcf0", "metadata": {}, "source": [ "- There are many ways to implement this\n", "- The perhaps easiest way is to handle it outside the self-consistency function and work with the returned dictionary (e.g., similar to what we have done in exercise 4.4, when we implemented the tie-breaking, which we added directly to the `evaluate_math500_stream` function\n", "- The relevant lines are shown below" ] }, { "cell_type": "markdown", "id": "733ded33-7ef8-4214-bb71-4b0c206d6867", "metadata": {}, "source": [ "```python\n", "# ...\n", "from pathlib import Path\n", "import time\n", "\n", "from reasoning_from_scratch.ch05 import heuristic_score\n", "\n", "\n", "def evaluate_math500_stream(\n", " model,\n", " tokenizer,\n", " device,\n", " math_data,\n", " out_path=None,\n", " max_new_tokens=2048,\n", " verbose=False,\n", " prompt_suffix=\"\",\n", " temperature=1.0,\n", " top_p=1.0,\n", " seed=None,\n", " num_samples=10,\n", "):\n", " if out_path is None:\n", " dev_name = str(device).replace(\":\", \"-\")\n", " out_path = Path(f\"math500-{dev_name}.jsonl\")\n", "\n", " num_examples = len(math_data)\n", " num_correct = 0\n", " start_time = time.time()\n", "\n", " with open(out_path, \"w\", encoding=\"utf-8\") as f:\n", " for i, row in enumerate(math_data, start=1):\n", " prompt = render_prompt(row[\"problem\"]) + prompt_suffix\n", "\n", " results = self_consistency_vote(\n", " model=model,\n", " tokenizer=tokenizer,\n", " prompt=prompt,\n", " device=device,\n", " num_samples=num_samples,\n", " temperature=temperature,\n", " top_p=top_p,\n", " max_new_tokens=max_new_tokens,\n", " show_progress=False,\n", " show_long_answer=False,\n", " seed=seed,\n", " )\n", "\n", " # Majority vote winner available\n", " if results[\"final_answer\"] is not None:\n", " extracted = results[\"final_answer\"]\n", "\n", " ### NEW: Break tie with heuristic_score\n", " else:\n", " best = None\n", " best_score = float(\"-inf\")\n", " \n", " for cand in results[\"majority_winners\"]:\n", " scores = [\n", " heuristic_score(results[\"full_answers\"][idx], prompt=prompt)\n", " for idx in results[\"groups\"][cand]\n", " ]\n", " \n", " score = max(scores)\n", " \n", " if score > best_score:\n", " best_score = score\n", " best = cand\n", " \n", " extracted = best\n", "\n", " # ...\n", "\n", " # ...\n", " return num_correct, num_examples, acc\n", "```" ] }, { "cell_type": "markdown", "id": "0c760503-b57a-4947-b7bc-63620ebe2af9", "metadata": {}, "source": [ "- The improvements over the baseline in chapter 3 and self-consistency from chapter 4 are shown below\n", "\n", "| | Method | Model | Accuracy | Time |\n", "|---|------------------------------------------|-------|----------|-----------|\n", "| 1 | Chapter 4 baseline with CoT prompting | Base | 33.4% | 129.2 min |\n", "| 2 | Self-consistency (n=3) + majority vote | Base | 43.2% | 328.2 min |\n", "| 3 | Self-consistency (n=3) + heuristic | Base | 43.4% | 326.5 min |\n", "| 4 | Self-consistency (n=3) + avg. logprob | Base | 44.8% | 327.7 min |\n", "\n", "- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a \"cuda\" GPU (DGX Spark)\n", "\n", "- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/self_consistency_scorer_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)" ] }, { "cell_type": "markdown", "id": "8d706d20-5d5a-4fdc-9541-076c5403c9e7", "metadata": {}, "source": [ "- However, note that as discussed in [#159](https://github.com/rasbt/reasoning-from-scratch/issues/159), we decide the majority winner based on the heuristic score but only consider the first instance in each majority pair\n", "- For instance" ] }, { "cell_type": "markdown", "id": "2f71cbd0-c297-400e-b64a-c7b7dd46a19e", "metadata": {}, "source": [ " \n", "## Exercise 5.2: Using the heuristic scorer in a Best-of-N setup" ] }, { "cell_type": "markdown", "id": "9a3244da-f5b5-4919-bb44-0a1c2eefd208", "metadata": {}, "source": [ "- Best-of-N is similar to self-consistency in that we generate multiple answers\n", "- However, instead of selecting the final answer based on majority vote, we score all answers using a scoring function (like `heuristic_score`) and return the highest-scoring answer\n", "- There are several ways to implement this behavior, but the easiest one is arguably to use the existing self-consistency function from chapter 4 as a template and swap in the `heuristic_score` as shown below" ] }, { "cell_type": "markdown", "id": "9dea8804-77c7-48bd-84df-7be4700f0ef1", "metadata": {}, "source": [ "```python\n", "# ...\n", "\n", "from reasoning_from_scratch.ch05 import (\n", " heuristic_score\n", ")\n", "\n", "def self_consistency_vote(\n", " model,\n", " tokenizer,\n", " prompt,\n", " device,\n", " num_samples=10,\n", " temperature=0.8,\n", " top_p=0.9,\n", " max_new_tokens=2048,\n", " show_progress=True,\n", " show_long_answer=False,\n", " seed=None,\n", "):\n", " full_answers, short_answers = [], []\n", " counts = Counter()\n", " groups = {}\n", " majority_winners, final_answer = [], None\n", " best_score, best_idx = float(\"-inf\"), None\n", "\n", " for i in range(num_samples):\n", " if seed is not None:\n", " torch.manual_seed(seed + i + 1)\n", "\n", " answer = generate_text_stream_concat_flex(\n", " model=model,\n", " tokenizer=tokenizer,\n", " prompt=prompt,\n", " device=device,\n", " max_new_tokens=max_new_tokens,\n", " verbose=show_long_answer,\n", " generate_func=generate_text_top_p_stream_cache,\n", " temperature=temperature,\n", " top_p=top_p,\n", " )\n", "\n", " short = extract_final_candidate(answer, fallback=\"number_then_full\")\n", " full_answers.append(answer)\n", " short_answers.append(short)\n", " counts[short] += 1\n", "\n", " if short in groups:\n", " groups[short].append(i)\n", " else:\n", " groups[short] = [i]\n", "\n", " score = heuristic_score(answer, prompt=prompt)\n", "\n", " if score > best_score:\n", " best_score, best_idx = score, i\n", "\n", " if show_progress:\n", " print(f\"[Sample {i+1}/{num_samples}] → {short!r}\")\n", "\n", " if best_idx is not None:\n", " final_answer = short_answers[best_idx]\n", " majority_winners = [final_answer]\n", "\n", " return {\n", " \"full_answers\": full_answers,\n", " \"short_answers\": short_answers,\n", " \"counts\": dict(counts),\n", " \"groups\": groups,\n", " \"majority_winners\": majority_winners,\n", " \"final_answer\": final_answer,\n", " }\n", "\n", "```" ] }, { "cell_type": "markdown", "id": "5b96af53-1582-48a6-802a-c9c4a9b94d16", "metadata": {}, "source": [ "- The results are shown below\n", "\n", "| | Method | Model | Accuracy | Time |\n", "|---|------------------------------------------|-------|----------|-----------|\n", "| 1 | Baseline with chain-of-thought prompting | Base | 33.4% | 129.2 min |\n", "| 2 | Best-of-N (n=3) + heuristic | Base | 40.6% | 327.7 min |\n", "| 3 | Best-of-N (n=3) + avg. logprob | Base | 43.2% | 330.2 min |\n", "\n", "- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a \"cuda\" GPU (DGX Spark)\n", "\n", "- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/best_of_n_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)" ] }, { "cell_type": "markdown", "id": "91d783e3-36be-427f-8663-ac8a25b946d0", "metadata": {}, "source": [ " \n", "## Exercise 5.3: Using the logprob scorer as a tie-breaker in self-consistency" ] }, { "cell_type": "markdown", "id": "fe89dc83-a406-4393-96ca-68168866ff9d", "metadata": {}, "source": [ "- The code is similar to exercise 5.1, except that we swap `heuristic_score` with `avg_logprob_answer`" ] }, { "cell_type": "markdown", "id": "ba844444-1f1a-4944-8811-8da1228f2a46", "metadata": {}, "source": [ "```python\n", "# ...\n", "# from reasoning_from_scratch.ch05 import heuristic_score\n", "from reasoning_from_scratch.ch05 import avg_logprob_answer\n", "\n", "\n", "def evaluate_math500_stream(\n", " model,\n", " tokenizer,\n", " device,\n", " math_data,\n", " out_path=None,\n", " max_new_tokens=2048,\n", " verbose=False,\n", " prompt_suffix=\"\",\n", " temperature=1.0,\n", " top_p=1.0,\n", " seed=None,\n", " num_samples=10,\n", "):\n", " if out_path is None:\n", " dev_name = str(device).replace(\":\", \"-\")\n", " out_path = Path(f\"math500-{dev_name}.jsonl\")\n", "\n", " num_examples = len(math_data)\n", " num_correct = 0\n", " start_time = time.time()\n", "\n", " with open(out_path, \"w\", encoding=\"utf-8\") as f:\n", " for i, row in enumerate(math_data, start=1):\n", " prompt = render_prompt(row[\"problem\"]) + prompt_suffix\n", "\n", " results = self_consistency_vote(\n", " model=model,\n", " tokenizer=tokenizer,\n", " prompt=prompt,\n", " device=device,\n", " num_samples=num_samples,\n", " temperature=temperature,\n", " top_p=top_p,\n", " max_new_tokens=max_new_tokens,\n", " show_progress=False,\n", " show_long_answer=False,\n", " seed=seed,\n", " )\n", "\n", " # Majority vote winner available\n", " if results[\"final_answer\"] is not None:\n", " extracted = results[\"final_answer\"]\n", "\n", " ### NEW: Break tie with avg_logprob_answer\n", " else:\n", " best = None\n", " best_score = float(\"-inf\")\n", " \n", " # Consider all members of each majority group\n", " for cand in results[\"majority_winners\"]:\n", " scores = []\n", " \n", " for idx in results[\"groups\"][cand]:\n", " candidate_full = results[\"full_answers\"][idx]\n", " \n", " score = avg_logprob_answer(\n", " model=model,\n", " tokenizer=tokenizer,\n", " prompt=prompt,\n", " answer=candidate_full,\n", " device=device,\n", " )\n", " scores.append(score)\n", " \n", " cand_score = max(scores)\n", " \n", " if cand_score > best_score:\n", " best_score = cand_score\n", " best = cand\n", " \n", " extracted = best\n", " # ...\n", "\n", " # ...\n", " return num_correct, num_examples, acc\n", "```" ] }, { "cell_type": "markdown", "id": "21be8952-099a-4ca6-b903-f46a39b95b1f", "metadata": {}, "source": [ "- The improvements over the baseline in chapter 3 and self-consistency from chapter 4 are shown below\n", "\n", "| | Method | Model | Accuracy | Time |\n", "|---|------------------------------------------|-------|----------|-----------|\n", "| 1 | Baseline with chain-of-thought prompting | Base | 33.4% | 129.2 min |\n", "| 2 | Self-consistency (n=3) + majority vote | Base | 43.2% | 328.2 min |\n", "| 3 | Self-consistency (n=3) + heuristic | Base | 43.4% | 326.5 min |\n", "| 4 | Self-consistency (n=3) + avg logprob | Base | 44.8% | 327.7 min |\n", "\n", "- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a \"cuda\" GPU (DGX Spark)\n", "\n", "- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/best_of_n_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)" ] }, { "cell_type": "markdown", "id": "5bc62f87-3d9d-47cd-9eed-6e15982e478c", "metadata": {}, "source": [ " \n", "## Exercise 5.4: Using the logprob scorer in a Best-of-N setup" ] }, { "cell_type": "markdown", "id": "3dc9d6e6-f8a8-438d-a1f7-4b3a1c50a251", "metadata": {}, "source": [ "- To implement Best-of-N with a logprob scorer, we can use the code from exercise 5.2 ans swap the `heuristic_score` with `avg_logprob_answer`:" ] }, { "cell_type": "markdown", "id": "7f184362-4997-4eb4-afdb-51832602cdb7", "metadata": {}, "source": [ "```python\n", "\n", "from reasoning_from_scratch.ch05 import (\n", " avg_logprob_answer\n", ")\n", "\n", "\n", "def self_consistency_vote(\n", " model,\n", " tokenizer,\n", " prompt,\n", " device,\n", " num_samples=10,\n", " temperature=0.8,\n", " top_p=0.9,\n", " max_new_tokens=2048,\n", " show_progress=True,\n", " show_long_answer=False,\n", " seed=None,\n", "):\n", " full_answers, short_answers = [], []\n", " counts = Counter()\n", " groups = {}\n", " majority_winners, final_answer = [], None\n", " best_score, best_idx = float(\"-inf\"), None\n", "\n", " for i in range(num_samples):\n", " if seed is not None:\n", " torch.manual_seed(seed + i + 1)\n", "\n", " answer = generate_text_stream_concat_flex(\n", " model=model,\n", " tokenizer=tokenizer,\n", " prompt=prompt,\n", " device=device,\n", " max_new_tokens=max_new_tokens,\n", " verbose=show_long_answer,\n", " generate_func=generate_text_top_p_stream_cache,\n", " temperature=temperature,\n", " top_p=top_p,\n", " )\n", "\n", " short = extract_final_candidate(answer, fallback=\"number_then_full\")\n", " full_answers.append(answer)\n", " short_answers.append(short)\n", " counts[short] += 1\n", "\n", " if short in groups:\n", " groups[short].append(i)\n", " else:\n", " groups[short] = [i]\n", "\n", " score = avg_logprob_answer(\n", " model=model,\n", " tokenizer=tokenizer,\n", " prompt=prompt,\n", " answer=answer,\n", " device=device\n", " )\n", " if score > best_score:\n", " best_score, best_idx = score, i\n", "\n", " if show_progress:\n", " print(f\"[Sample {i+1}/{num_samples}] → {short!r}\")\n", "\n", " if best_idx is not None:\n", " final_answer = short_answers[best_idx]\n", " majority_winners = [final_answer]\n", "\n", " return {\n", " \"full_answers\": full_answers,\n", " \"short_answers\": short_answers,\n", " \"counts\": dict(counts),\n", " \"groups\": groups,\n", " \"majority_winners\": majority_winners,\n", " \"final_answer\": final_answer,\n", " }\n", "```" ] }, { "cell_type": "markdown", "id": "b9b68322-7cb9-441e-ab01-f979dd66f036", "metadata": {}, "source": [ "- The results are shown below\n", "\n", "| # | Method | Model | Accuracy | Time |\n", "|---|------------------------------------------|-------|----------|-----------|\n", "| 1 | Baseline with chain-of-thought prompting | Base | 33.4% | 129.2 min |\n", "| 2 | Best-of-N (n=3) + heuristic | Base | TBD | TBD |\n", "| 3 | Best-of-N (n=3) + avg. logprob | Base | TBD | TBD |\n", "\n", "- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a \"cuda\" GPU (DGX Spark)\n", "\n", "- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/best_of_n_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)" ] }, { "cell_type": "markdown", "id": "015002b0-b38a-44eb-993c-92f13c7ca008", "metadata": {}, "source": [ " \n", "## Exercise 5.5: Using the heuristic score for self-refinement" ] }, { "cell_type": "markdown", "id": "d6771719-1da3-43d4-be40-dedd6f25258c", "metadata": {}, "source": [ "- Using the `heuristic_score` is actually even simpler than using the logprob score, all we need to do is to change the following code:" ] }, { "cell_type": "markdown", "id": "97f2f7b6-418e-4ee8-bd56-1b4e57bc7491", "metadata": {}, "source": [ "```python\n", "from functools import partial\n", "\n", "avg_logprob_score = partial(\n", " avg_logprob_answer,\n", " model=model,\n", " tokenizer=tokenizer,\n", " device=device\n", ")\n", "\n", "\n", "torch.manual_seed(0)\n", "\n", "results_logprob = self_refinement_loop(\n", " model=model,\n", " tokenizer=tokenizer,\n", " raw_prompt=raw_prompt,\n", " device=device,\n", " iterations=2,\n", " max_response_tokens=2048,\n", " max_critique_tokens=256,\n", " score_fn=avg_logprob_score,\n", " verbose=True,\n", " temperature=0.7,\n", " top_p=0.9,\n", ")\n", "```" ] }, { "cell_type": "markdown", "id": "a7d83dbd-a1cd-461d-960e-167a12b0ef4d", "metadata": {}, "source": [ "- The updated code is:" ] }, { "cell_type": "markdown", "id": "5711c8e7-845b-4d37-81c1-ac975121f648", "metadata": {}, "source": [ "```python\n", "torch.manual_seed(0)\n", "\n", "results_logprob = self_refinement_loop(\n", " model=model,\n", " tokenizer=tokenizer,\n", " raw_prompt=raw_prompt,\n", " device=device,\n", " iterations=2,\n", " max_response_tokens=2048,\n", " max_critique_tokens=256,\n", " score_fn=heuristic_score, # NEW\n", " verbose=True,\n", " temperature=0.7,\n", " top_p=0.9,\n", ")\n", "```" ] }, { "cell_type": "markdown", "id": "5a8a64eb-9761-471b-b116-2700aa2285ea", "metadata": {}, "source": [ "- The results, using the heuristic scorer, are shown in rows 4, 5, and 10:\n", "\n", "| | Method | Scoring | Iterations | Model | Accuracy | Time |\n", "|----|------------------------|---------------|------------|------------|----------|-----------|\n", "| 1 | Baseline (chapter 3) | - | - | Base | 15.2% | 10.1 min |\n", "| 2 | Self-refinement | None | 1 | Base | 25.0% | 84.8 min |\n", "| 3 | Self-refinement | None | 2 | Base | 22.0% | 165.4 min |\n", "| 4 | Self-refinement | Heuristic | 1 | Base | 21.6% | 84.7 min |\n", "| 5 | Self-refinement | Heuristic | 2 | Base | 20.8% | 151.4 min |\n", "| 6 | Self-refinement | Avg. logprob | 1 | Base | 21.4% | 85.3 min |\n", "| 7 | Self-refinement | Avg. logprob | 2 | Base | 22.0% | 165.3 min |\n", "| | | | | | | |\n", "| 8 | Baseline (chapter 3) | - | - | Reasoning | 48.2% | 182.1 min |\n", "| 9 | Self-refinement | None | 1 | Reasoning | 56.6% | 498.8 min |\n", "| 10 | Self-refinement | Heuristic | 1 | Reasoning | 57.8% | 498.6 min |\n", "| 11 | Self-refinement | Avg. logprob | 1 | Reasoning | 48.4% | 499.7 min |\n", "\n", "- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a \"cuda\" GPU (DGX Spark)\n", "- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/self_refinement_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 5 }