{ "cells": [ { "cell_type": "markdown", "id": "83efb6df-7d99-4fee-99f3-f2f668292110", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "Supplementary code for the Build a Reasoning Model (From Scratch) book by Sebastian Raschka
\n", "
Code repository: https://github.com/rasbt/reasoning-from-scratch\n", "
\n", "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "ef2ac59f-0dc1-4c3e-bb8c-2ea79e0f6657", "metadata": {}, "source": [ "# Chapter 3: Exercise Solutions" ] }, { "cell_type": "markdown", "id": "4735f8bb-dd7f-4a4f-8761-269f26b38349", "metadata": {}, "source": [ "Packages that are being used in this notebook:" ] }, { "cell_type": "code", "execution_count": 1, "id": "00e26411-6a34-4c89-bc24-2e36dd14c8eb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "reasoning_from_scratch version: 0.1.4\n", "torch version: 2.7.1\n", "tokenizers version: 0.21.4\n" ] } ], "source": [ "from importlib.metadata import version\n", "\n", "used_libraries = [\n", " \"reasoning_from_scratch\",\n", " \"torch\",\n", " \"tokenizers\" # Used by reasoning_from_scratch\n", "]\n", "\n", "for lib in used_libraries:\n", " print(f\"{lib} version: {version(lib)}\")" ] }, { "cell_type": "markdown", "id": "8d101721-6848-4871-826a-eaf194ddb26a", "metadata": {}, "source": [ " \n", "## Exercise 3.1: Adding more test cases" ] }, { "cell_type": "markdown", "id": "45297333-a7f9-4867-b983-20c973287363", "metadata": {}, "source": [ "- There is an endless number of different test cases we may add\n", "- Below is a selection of some interesting ones" ] }, { "cell_type": "code", "execution_count": null, "id": "d95a115c-54fc-4060-8e3b-95946c0dad27", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test | Expect | Got | Status\n", "check_17 | True | True | PASS \n", "check_18 | True | True | PASS \n", "check_19 | True | True | PASS \n", "check_20 | True | False | FAIL \n", "\n", "Passed 3/4\n" ] } ], "source": [ "from reasoning_from_scratch.ch03 import (\n", " run_demos_table\n", ")\n", "\n", "more_tests = [\n", " # Different bracket types\n", " (\"check_17\", \"[1, 2]\", \"(1, 2)\", True),\n", "\n", " # Scientific notation\n", " (\"check_18\", \"1e-3\", \"0.001\", True),\n", "\n", " # Algebraic simplification with caret exponent\n", " (\"check_19\", \"(-3)^2\", \"9\", True),\n", "\n", " # Unicode minus (U+2212) vs ASCII hyphen-minus\n", " (\"check_20\", \"−1\", \"-1\", True),\n", "\n", "]\n", "\n", "run_demos_table(more_tests)" ] }, { "cell_type": "markdown", "id": "ec64c54a-b323-46a8-a6d3-4d5de9122b0a", "metadata": {}, "source": [ "- As we can see, the tests pass in all cases except for `check_20`, which swaps the regular sign with a Unicode version of a minus sign that looks indistinguishable to the human eye\n", "- We could fix this test case by adding one of the following lines anywhere to the `normalize_text` function\n", "\n", "```python\n", "text = text.replace(\"−\", \"-\")\n", "# or\n", "text = text.replace(\"\\u2212\", \"-\")\n", "```\n", "\n", "- At first glance, another interesting test is the following one:" ] }, { "cell_type": "code", "execution_count": 2, "id": "ba984636-a415-4a87-b945-adce82b20ed1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test | Expect | Got | Status\n", "check_21 | True | False | FAIL \n", "\n", "Passed 0/1\n" ] } ], "source": [ "extra_tests_1 = [\n", " (\"check_21\", \"Text around answer 3.\", \"3\", True)\n", "]\n", "\n", "run_demos_table(extra_tests_1)" ] }, { "cell_type": "markdown", "id": "c96f00f7-2831-442d-a4ca-63ce6805fef3", "metadata": {}, "source": [ "- While it may seem that our code cannot handle such text-containing cases, this is actually a poorly designed test\n", "- In practice, the `run_demos_table` function is intended specifically to test the `grade_answer` function; nothing more, nothing less\n", "- The `grade_answer` function would never receive the entire answer in this form, since the answer would have been extracted from the text before being passed to it" ] }, { "cell_type": "markdown", "id": "f5855ab1-c7cf-4d03-a442-2d0742dc8de9", "metadata": {}, "source": [ "I.e., if we want to test text answers, we need to call the test as follows:" ] }, { "cell_type": "code", "execution_count": 3, "id": "a5f2533c-40b1-4560-b3f8-36804a02b789", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test | Expect | Got | Status\n", "check_21 | True | True | PASS \n", "\n", "Passed 1/1\n" ] } ], "source": [ "from reasoning_from_scratch.ch03 import (\n", " extract_final_candidate\n", ")\n", "\n", "\n", "extra_tests_2 = [\n", " (\"check_21\",\n", " extract_final_candidate(\"Text around answer 3.\"),\n", " \"3\", True)\n", "]\n", "run_demos_table(extra_tests_2)" ] }, { "cell_type": "markdown", "id": "43174d8d-b0a1-4fa6-b02f-8c91bfd0fdf1", "metadata": {}, "source": [ " \n", "## Exercise 3.2: Calculating the average response length" ] }, { "cell_type": "markdown", "id": "a9a4acba-c550-4034-96c2-75a550303e9d", "metadata": {}, "source": [ "- Option A: We could modify the `evaluate_math500_stream` function by adding the following lines:\n", "\n", "```python\n", "# ...\n", "# below `num_correct = 0`\n", "total_len = 0\n", "\n", "# ...\n", "# inside for i, row in enumerate(math_data, start=1):\n", "# anywhere below `gen_text = ...`\n", "total_len += len(tokenizer.encode(gen_text))\n", "\n", "# ...\n", "# anywhere at the bottom before the return statement\n", "avg_len = total_len / num_examples\n", "print(f\"Average length: {avg_len:.2f} tokens\")\n", "```" ] }, { "cell_type": "markdown", "id": "03718c8e-c926-4285-a5b8-eeb4dabd8652", "metadata": {}, "source": [ "- Alternatively, we can also calculate the response lengths from the `.jsonl` files that were created when we ran the `evaluate_math500_stream` function in the main chapter\n", "- First, we load the `.jsonl` file as follows:" ] }, { "cell_type": "code", "execution_count": 5, "id": "5d160527-d358-49fc-ac78-50b16038bdd8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of entries: 10\n" ] } ], "source": [ "import json\n", "from pathlib import Path\n", "\n", "WHICH_MODEL = \"base\"\n", "\n", "dev_name = \"mps\" # e.g., \"cuda\", \"cpu\"\n", "\n", "# You may need to adjust this path:\n", "local_path = Path(f\"math500-{dev_name}.jsonl\")\n", "if not local_path.exists():\n", " raise FileNotFoundError(\n", " f\"{local_path} not found. Run ch03_main.ipynb to create it.\"\n", " )\n", "\n", "results = []\n", "with open(local_path, \"r\") as f:\n", " for line in f:\n", " if line.strip():\n", " results.append(json.loads(line))\n", "\n", "print(\"Number of entries:\", len(results))\n" ] }, { "cell_type": "markdown", "id": "ff5a0eba-1a4d-4dc5-9be1-b89a973ef894", "metadata": {}, "source": [ "- Note that each entry has multiple keys, however, we are only interested in the `\"generated_text\"` key, which contains the models full answer:" ] }, { "cell_type": "code", "execution_count": 6, "id": "0bcfe049-12a9-4156-973e-888680aff717", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dict_keys(['index', 'problem', 'gtruth_answer', 'generated_text', 'extracted', 'correct'])\n" ] } ], "source": [ "print(results[0].keys())" ] }, { "cell_type": "markdown", "id": "84783139-fa9f-404c-ace6-eb61a625eac9", "metadata": {}, "source": [ "- Note that each entry has multiple keys; however, we are only interested in the `\"generated_text\"` key, which contains the model's full answer:" ] }, { "cell_type": "code", "execution_count": 7, "id": "ace9b5b1-8e2f-4765-a62e-6ff67649a4b7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "✓ qwen3/tokenizer-base.json already up-to-date\n" ] } ], "source": [ "from reasoning_from_scratch.qwen3 import (\n", " download_qwen3_small,\n", " Qwen3Tokenizer\n", ")\n", "\n", "if WHICH_MODEL == \"base\":\n", "\n", " download_qwen3_small(\n", " kind=\"base\", tokenizer_only=True, out_dir=\"qwen3\"\n", " )\n", " tokenizer_path = Path(\"qwen3\") / \"tokenizer-base.json\"\n", " tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)\n", "\n", "elif WHICH_MODEL == \"reasoning\":\n", "\n", " download_qwen3_small(\n", " kind=\"reasoning\", tokenizer_only=True, out_dir=\"qwen3\"\n", " )\n", " tokenizer_path = Path(\"qwen3\") / \"tokenizer-reasoning.json\"\n", " tokenizer = Qwen3Tokenizer(\n", " tokenizer_file_path=tokenizer_path,\n", " apply_chat_template=True,\n", " add_generation_prompt=True,\n", " add_thinking=True,\n", " )" ] }, { "cell_type": "markdown", "id": "f88842d5-7c6d-404f-9b9f-bc359ff5f19f", "metadata": {}, "source": [ "- Then, we can calculate the average length as follows, which is similar to how we could have modified the `evaluate_math500_stream` function:" ] }, { "cell_type": "code", "execution_count": 8, "id": "29ac2c32-2c1b-403e-8f09-78609378659a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average length: 98.00 tokens\n" ] } ], "source": [ "total_len = 0\n", "\n", "for item in results:\n", " num_tokens = len(tokenizer.encode(item[\"generated_text\"]))\n", " total_len += num_tokens\n", "\n", "avg_len = total_len / len(results)\n", "print(f\"Average length: {avg_len:.2f} tokens\")" ] }, { "cell_type": "markdown", "id": "ba80c4c4-9fb4-476c-9711-9bcc922755ac", "metadata": {}, "source": [ "| Mode | Device | Average length | MATH-500 size |\n", "|-----------|---------|----------------|----------------|\n", "| Base | CPU | 97.3 | 10 |\n", "| Base | MPS | 98.0 | 10 |\n", "| Reasoning | CPU | 891.80 | 10 |\n", "| Reasoning | MPS | 1159.30 | 10 |\n", "| | | | |\n", "| Base | CUDA | 96.74 | 500 |\n", "| Reasoning | CUDA | 1361.21 | 500 |\n" ] }, { "cell_type": "markdown", "id": "20204302-92e3-4ee7-9572-83247efc271f", "metadata": {}, "source": [ "- As we can see, and as expected, the reasoning model writes much longer responses" ] }, { "cell_type": "markdown", "id": "8121971d-48a0-4fa1-bfde-5ff2df9b3c41", "metadata": {}, "source": [ " \n", "## Exercise 3.3: Extending or changing the evaluation dataset" ] }, { "cell_type": "markdown", "id": "7cbf0102-a84e-4236-baca-dc75b27780f2", "metadata": {}, "source": [ "- To evaluate the model on a larger dataset, we can simply change the `math_data[:10]` to a different slice or larger number (up to 500)\n", "\n", "```python\n", "num_correct, num_examples, acc = evaluate_math500_stream(\n", " model, tokenizer, device, \n", " math_data=math_data[:10],\n", " max_new_tokens=2048,\n", " verbose=False\n", ")\n", "```\n", "\n", "- The table below shows the accuracy values for different dataset sizes (since the MATH-500 test set is already shuffled, no additional shuffling was applied)" ] }, { "cell_type": "markdown", "id": "74f49767-da9b-4832-a6aa-983746671b9c", "metadata": {}, "source": [ "| Mode | Device | Accuracy | MATH-500 size |\n", "|-----------|---------|----------|----------------|\n", "| Base | CUDA | 30.0% | 10 |\n", "| Base | CUDA | 34.0% | 50 |\n", "| Base | CUDA | 27.0% | 100 |\n", "| Base | CUDA | 31.0% | 200 |\n", "| Base | CUDA | 15.3% | 500 |\n", "| | | | |\n", "| Reasoning | CUDA | 90.0% | 10 |\n", "| Reasoning | CUDA | 58.0% | 50 |\n", "| Reasoning | CUDA | 58.0% | 100 |\n", "| Reasoning | CUDA | 56.0% | 200 |\n", "| Reasoning | CUDA | 48.2% | 500 |" ] }, { "cell_type": "markdown", "id": "393bd5fa-a950-482d-8c7c-248cc4abc7bb", "metadata": {}, "source": [ "- As we can see based on the results above, the first 10 examples are not very representative of the MATH-500 performance evaluated on the whole 500 examples" ] }, { "cell_type": "markdown", "id": "2c88f81f-c9b4-45d1-8a94-a1c440817c67", "metadata": {}, "source": [ "- In addition, we can create an entirely new dataset in a similar style to MATH-500\n", "- For example, a dataset in MATH-500 style is included in this repository; we can use it in the main chapter by changing the filename from `math500_test.json` to `math_new50_exercise.json` (this dataset is included in this book's GitHub repository at https://github.com/rasbt/reasoning-from-scratch/tree/main/ch03/01_main-chapter-code)\n", "- The performance of the base and reasoning models is as follows:\n", " - base: 36.0% (18/50)\n", " - reasoning: 80.0% (40/50)\n", "- From this, we can conclude that while the original MATH-500 test dataset may have been included in Qwen3's training dataset, the model shows similar performance on new math questions, which indicates that it is not suffering from extensive overfitting to the original MATH-500 data" ] }, { "cell_type": "markdown", "id": "8d8dae53-5984-4f9d-af27-97b24bfe92e3", "metadata": {}, "source": [ " \n", "## Exercise 3.4: Experimenting with different prompt templates " ] }, { "cell_type": "markdown", "id": "95288d91-3fbe-4305-822b-de0aa04af513", "metadata": {}, "source": [ "- We could use the alternative prompt similar to the one suggested in the chapter, which modifies the prompt to use \"Problem\" instead of \"Question\":" ] }, { "cell_type": "markdown", "id": "ce473e57-7ace-4780-9837-e26b01863997", "metadata": {}, "source": [ "```python\n", "def render_prompt(prompt):\n", " template = (\n", " \"You are a helpful math assistant.\\n\"\n", " \"Solve the problem and write the final result on a new line as:\\n\"\n", " \"\\\\boxed{ANSWER}\\n\\n\"\n", " f\"Problem:\\n{prompt}\\n\\nAnswer:\"\n", " )\n", " return template\n", "```" ] }, { "cell_type": "markdown", "id": "57447dd1-87da-4978-8422-63c7302039ec", "metadata": {}, "source": [ "- Using this prompt improves the performance of the base model, on the 500 examples, from 15.3% to 31.2%\n", "- And vice versa, it reduces the performance of the reasoning model from 50.8% to 50.0%\n", "- From these observations, we may conclude that the base model is much more sensitive to the prompt format (likely due to memorizing some prompt-formatted MATH-500 examples from the training set) than the reasoning model; the latter seems largely unaffected" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }