{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "83efb6df-7d99-4fee-99f3-f2f668292110",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"https://mng.bz/lZ5B\">Build a Reasoning Model (From Scratch)</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/reasoning-from-scratch\">https://github.com/rasbt/reasoning-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"https://mng.bz/lZ5B\"><img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef2ac59f-0dc1-4c3e-bb8c-2ea79e0f6657",
   "metadata": {},
   "source": [
    "# Chapter 6: Exercise Solutions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4735f8bb-dd7f-4a4f-8761-269f26b38349",
   "metadata": {},
   "source": [
    "Packages that are being used in this notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00e26411-6a34-4c89-bc24-2e36dd14c8eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "used_libraries = [\n",
    "    \"reasoning_from_scratch\",\n",
    "    \"torch\",\n",
    "    \"tokenizers\"  # Used by reasoning_from_scratch\n",
    "]\n",
    "\n",
    "for lib in used_libraries:\n",
    "    print(f\"{lib} version: {version(lib)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d101721-6848-4871-826a-eaf194ddb26a",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## Exercise 6.1: Adding format-aware reward shaping"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d91cb55-cbd9-4758-8c45-c3dc0a4a5d20",
   "metadata": {},
   "source": [
    "- We can assign a partial reward (score 0.5) if no \"\\boxed{}\" answer is found as follows, using the `fallback=\"number_then_full\"` fallback we coded in chapter 3:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29c0be3e-f253-4cb6-8c14-231a650fa196",
   "metadata": {},
   "outputs": [],
   "source": [
    "from reasoning_from_scratch.ch03 import (\n",
    "    extract_final_candidate, grade_answer\n",
    ")\n",
    "\n",
    "def partial_reward_rlvr(answer_text, ground_truth):\n",
    "    \n",
    "    # 1) Try to extract a boxed answer\n",
    "    boxed = extract_final_candidate(\n",
    "        answer_text, fallback=None\n",
    "    )\n",
    "    if boxed:\n",
    "        correct = grade_answer(boxed, ground_truth)\n",
    "        return 1.0 if correct else 0.0\n",
    "\n",
    "    # 2) If no boxed answer is found, look for number\n",
    "    unboxed = extract_final_candidate(\n",
    "        answer_text, fallback=\"number_then_full\"\n",
    "    )\n",
    "    if unboxed:\n",
    "        correct = grade_answer(unboxed, ground_truth)\n",
    "        return 0.5 if correct else 0.0\n",
    "\n",
    "    return 0.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d599d14-218c-40b8-af62-eb7489d29f5e",
   "metadata": {},
   "source": [
    "- When plugged into the chapter 6 code and trained under the same settings, the partial-reward variant achieves lower accuracy (37.8%) than the standard GRPO setup (47.4%), despite using a similar number of tokens on average"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31b15189-adfe-4e03-907d-5afb5fc167b1",
   "metadata": {},
   "source": [
    "| # | Method                                   | Step | Max tokens | Num rollouts | Accuracy | Average tokens |\n",
    "|---|------------------------------------------|------|------------|--------------|----------|----------------|\n",
    "| 1 | GRPO (chapter 6)                         | 50   | 512        | 8            | 47.4%    | 586.11         |\n",
    "| 2 | GRPO partial rewards (exercise 6.1)      | 50   | 512        | 8            | 37.8%    | 550.33         |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1862561c-aff9-4713-a280-69d5a81ea691",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## Exercise 6.2: Zero-advantage cases"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e16a293-67d0-4ce7-b9c8-de88043d5f26",
   "metadata": {},
   "source": [
    "- If the rewards are all equal (for instance, they are all 0 or all 1), the advantages will all be 0, because subtracting the mean removes the shared reward value and leaves only zeros, which we can demonstrate below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "9d469faa-bffb-4a2a-a58d-4ba212c74347",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([0., 0., 0., 0.])\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "rollout_rewards = [0., 0., 0., 0.]\n",
    "rewards = torch.tensor(rollout_rewards)\n",
    "advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)\n",
    "\n",
    "print(advantages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "97132523-84b7-4e6f-bb73-243593dda3b5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([0., 0., 0., 0.])\n"
     ]
    }
   ],
   "source": [
    "rollout_rewards = [1., 1., 1., 1.]\n",
    "rewards = torch.tensor(rollout_rewards)\n",
    "advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)\n",
    "\n",
    "print(advantages)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b5cea5b-1db2-40d0-9bea-bd7d79647121",
   "metadata": {},
   "source": [
    "- Now, if all advantages are 0, the loss will be zero as well, because the loss multiplies the advantages by the log probabilities, and multiplying by zero eliminates the contribution"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c42d17b5-8218-4261-8f7c-cdbcc00eed12",
   "metadata": {},
   "source": [
    "```python\n",
    "pg_loss = -(advantages.detach() * logps).mean()\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cea793ef-994a-4b03-bdf9-80a636e771b3",
   "metadata": {},
   "source": [
    "- As a result, the policy gradient is zero and the model parameters are not updated for that prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa290e0a-e617-4800-97a5-16937bced1fe",
   "metadata": {},
   "source": [
    "- This behavior is intentional; if all rollouts are equally bad or equally good, there is no relative signal to tell the model which behavior to reinforce or suppress\n",
    "- Intuitively, if the model answers all the questions correctly, there is no need to update it\n",
    "- Vice versa, if the model answers all questions incorrectly, we don't want to update the model to reinforce this behavior"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}