{ "cells": [ { "cell_type": "markdown", "id": "83efb6df-7d99-4fee-99f3-f2f668292110", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "Supplementary code for the Build a Reasoning Model (From Scratch) book by Sebastian Raschka
\n", "
Code repository: https://github.com/rasbt/reasoning-from-scratch\n", "
\n", "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "ef2ac59f-0dc1-4c3e-bb8c-2ea79e0f6657", "metadata": {}, "source": [ "# Chapter 6: Exercise Solutions" ] }, { "cell_type": "markdown", "id": "4735f8bb-dd7f-4a4f-8761-269f26b38349", "metadata": {}, "source": [ "Packages that are being used in this notebook:" ] }, { "cell_type": "code", "execution_count": null, "id": "00e26411-6a34-4c89-bc24-2e36dd14c8eb", "metadata": {}, "outputs": [], "source": [ "from importlib.metadata import version\n", "\n", "used_libraries = [\n", " \"reasoning_from_scratch\",\n", " \"torch\",\n", " \"tokenizers\" # Used by reasoning_from_scratch\n", "]\n", "\n", "for lib in used_libraries:\n", " print(f\"{lib} version: {version(lib)}\")" ] }, { "cell_type": "markdown", "id": "8d101721-6848-4871-826a-eaf194ddb26a", "metadata": {}, "source": [ " \n", "## Exercise 6.1: Adding format-aware reward shaping" ] }, { "cell_type": "markdown", "id": "0d91cb55-cbd9-4758-8c45-c3dc0a4a5d20", "metadata": {}, "source": [ "- We can assign a partial reward (score 0.5) if no \"\\boxed{}\" answer is found as follows, using the `fallback=\"number_then_full\"` fallback we coded in chapter 3:" ] }, { "cell_type": "code", "execution_count": null, "id": "29c0be3e-f253-4cb6-8c14-231a650fa196", "metadata": {}, "outputs": [], "source": [ "from reasoning_from_scratch.ch03 import (\n", " extract_final_candidate, grade_answer\n", ")\n", "\n", "def partial_reward_rlvr(answer_text, ground_truth):\n", " \n", " # 1) Try to extract a boxed answer\n", " boxed = extract_final_candidate(\n", " answer_text, fallback=None\n", " )\n", " if boxed:\n", " correct = grade_answer(boxed, ground_truth)\n", " return 1.0 if correct else 0.0\n", "\n", " # 2) If no boxed answer is found, look for number\n", " unboxed = extract_final_candidate(\n", " answer_text, fallback=\"number_then_full\"\n", " )\n", " if unboxed:\n", " correct = grade_answer(unboxed, ground_truth)\n", " return 0.5 if correct else 0.0\n", "\n", " return 0.0" ] }, { "cell_type": "markdown", "id": "9d599d14-218c-40b8-af62-eb7489d29f5e", "metadata": {}, "source": [ "- When plugged into the chapter 6 code and trained under the same settings, the partial-reward variant achieves lower accuracy (37.8%) than the standard GRPO setup (47.4%), despite using a similar number of tokens on average" ] }, { "cell_type": "markdown", "id": "31b15189-adfe-4e03-907d-5afb5fc167b1", "metadata": {}, "source": [ "| # | Method | Step | Max tokens | Num rollouts | Accuracy | Average tokens |\n", "|---|------------------------------------------|------|------------|--------------|----------|----------------|\n", "| 1 | GRPO (chapter 6) | 50 | 512 | 8 | 47.4% | 586.11 |\n", "| 2 | GRPO partial rewards (exercise 6.1) | 50 | 512 | 8 | 37.8% | 550.33 |" ] }, { "cell_type": "markdown", "id": "1862561c-aff9-4713-a280-69d5a81ea691", "metadata": {}, "source": [ " \n", "## Exercise 6.2: Zero-advantage cases" ] }, { "cell_type": "markdown", "id": "4e16a293-67d0-4ce7-b9c8-de88043d5f26", "metadata": {}, "source": [ "- If the rewards are all equal (for instance, they are all 0 or all 1), the advantages will all be 0, because subtracting the mean removes the shared reward value and leaves only zeros, which we can demonstrate below" ] }, { "cell_type": "code", "execution_count": 3, "id": "9d469faa-bffb-4a2a-a58d-4ba212c74347", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor([0., 0., 0., 0.])\n" ] } ], "source": [ "import torch\n", "\n", "rollout_rewards = [0., 0., 0., 0.]\n", "rewards = torch.tensor(rollout_rewards)\n", "advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)\n", "\n", "print(advantages)" ] }, { "cell_type": "code", "execution_count": 4, "id": "97132523-84b7-4e6f-bb73-243593dda3b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor([0., 0., 0., 0.])\n" ] } ], "source": [ "rollout_rewards = [1., 1., 1., 1.]\n", "rewards = torch.tensor(rollout_rewards)\n", "advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)\n", "\n", "print(advantages)" ] }, { "cell_type": "markdown", "id": "7b5cea5b-1db2-40d0-9bea-bd7d79647121", "metadata": {}, "source": [ "- Now, if all advantages are 0, the loss will be zero as well, because the loss multiplies the advantages by the log probabilities, and multiplying by zero eliminates the contribution" ] }, { "cell_type": "markdown", "id": "c42d17b5-8218-4261-8f7c-cdbcc00eed12", "metadata": {}, "source": [ "```python\n", "pg_loss = -(advantages.detach() * logps).mean()\n", "```" ] }, { "cell_type": "markdown", "id": "cea793ef-994a-4b03-bdf9-80a636e771b3", "metadata": {}, "source": [ "- As a result, the policy gradient is zero and the model parameters are not updated for that prompt" ] }, { "cell_type": "markdown", "id": "fa290e0a-e617-4800-97a5-16937bced1fe", "metadata": {}, "source": [ "- This behavior is intentional; if all rollouts are equally bad or equally good, there is no relative signal to tell the model which behavior to reinforce or suppress\n", "- Intuitively, if the model answers all the questions correctly, there is no need to update it\n", "- Vice versa, if the model answers all questions incorrectly, we don't want to update the model to reinforce this behavior" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 5 }