{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c109c0e7-1aad-42ab-88d8-0990559b59e5",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"https://mng.bz/lZ5B\">Build a Reasoning Model (From Scratch)</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/reasoning-from-scratch\">https://github.com/rasbt/reasoning-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"https://mng.bz/lZ5B\"><img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88c613ef-f4e5-49c3-b19d-3cf36dce0bf1",
   "metadata": {},
   "source": [
    "# Appendix E: Batching and throughput-oriented execution"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c1cd731-7e23-4430-8ec6-c4a86a177f81",
   "metadata": {},
   "source": [
    "Packages that are being used in this notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b6882804-a2c4-4c98-ad42-1b108cbffa5b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reasoning_from_scratch version: 0.1.17\n",
      "torch version: 2.10.0\n",
      "tokenizers version: 0.21.4\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "used_libraries = [\n",
    "    \"reasoning_from_scratch\",  # for download functions\n",
    "    \"torch\",\n",
    "    \"tokenizers\"\n",
    "]\n",
    "\n",
    "for lib in used_libraries:\n",
    "    print(f\"{lib} version: {version(lib)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d13703b-f75b-43fe-9c8a-e7459a884f36",
   "metadata": {},
   "source": [
    "- Throughout the main chapters, we usually process one example at a time\n",
    "- This keeps the code compact and easier to understand\n",
    "- But also, the code is already very expensive to run, so adding batching support would add little benefit due to hardware and resource limitations\n",
    "- However, in certain contexts, having the ability to run the code in batched mode is still useful\n",
    "- This appendix explains the broad idea behind batched execution and shows how to use it for the different chapters using code from the supplementary materials"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11887fc6-cccd-4dd1-90a7-109c9e846135",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-e/Appendix_E_F01_raschka.webp\" width=\"400px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8700acaf-2b57-453f-8f65-1eed2e67147d",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## E.1 Why batching helps"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e2c1e91-d9bb-46fe-912c-1486bcf333cc",
   "metadata": {},
   "source": [
    "- There are two different performance goals:\n",
    "  - latency: how quickly we get the answer for a single prompt;\n",
    "  - throughput: how many prompts we can process in a given amount of time.\n",
    "- Single-example generation is often best for minimizing latency and for debugging code\n",
    "- Batching targets throughput primarily\n",
    "- If we want to evaluate hundreds of problems on MATH-500, generate many self-consistency samples, or train on many supervised examples, batching can reduce the total runtime substantially on suitable hardware\n",
    "  - That said, batching is not guaranteed to be faster on every device\n",
    "  - Small models on CPUs or some less optimized GPUs may not benefit from batching; we may even get slowdowns, because the additional padding and batching overhead can offset the gains from parallelism"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cd42b54-dcfa-43f0-a9a3-b7c78745d6a2",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## E.2 Running batched generation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8da03f3-e953-4a7b-b30a-5e9541e3af22",
   "metadata": {},
   "source": [
    "- The main technical obstacle in batching is that prompts usually have different lengths\n",
    "- For example, one math problem may tokenize to 40 tokens while another may tokenize to 120 tokens\n",
    "- Since tensors in PyTorch must have rectangular shapes, we pad the shorter sequences so they all fit into a single batch tensor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2755ee99-a7a5-4bd5-9fb5-e12a0a13a732",
   "metadata": {},
   "source": [
    "- Conceptually, this makes batched generation much more difficult to implement than single-prompt generation\n",
    "- In the main chapter, we used the `Qwen3Model` class from `reasoning_from_scratch.qwen3` (which uses the Qwen3 implementation explained in appendix C)\n",
    "- For batched generation, since we have to keep track of padding tokens, etc., there is a separate `Qwen3Model` class in `reasoning_from_scratch.qwen3_batched` (the source code can be viewed in the supplementary materials at https://github.com/rasbt/reasoning-from-scratch/blob/main/reasoning_from_scratch/qwen3_batched.py)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13de4a75-87de-4cbe-8299-dc1371c32975",
   "metadata": {},
   "source": [
    "- To illustrate the usage of the batched generation utilities, let's take a look at a concrete example\n",
    "- We start with a single-sequence text generation example similar to what we have used in the main chapters\n",
    "- Here, were apply it to two prompts (`[\"2+2?\", \"3+3=6?\"]`) sequentially:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "852cccaa-0f00-4846-a034-ff125ccbca05",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using Apple Silicon GPU (MPS)\n",
      "✓ qwen3/qwen3-0.6B-base.pth already up-to-date\n",
      " \\boxed{4}\n",
      " \\boxed{6}\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "from reasoning_from_scratch.ch02 import (\n",
    "    get_device,\n",
    "    generate_text_basic_stream_cache,\n",
    ")\n",
    "from reasoning_from_scratch.ch03 import (\n",
    "    load_model_and_tokenizer,\n",
    "    render_prompt,\n",
    ")\n",
    "\n",
    "device = get_device()\n",
    "model, tokenizer = load_model_and_tokenizer(\n",
    "    which_model=\"base\",\n",
    "    device=device,\n",
    "    use_compile=False,\n",
    ")\n",
    "\n",
    "for problem in [\"2+2?\", \"3+3=6?\"]:\n",
    "    prompt = render_prompt(problem)\n",
    "    input_ids = torch.tensor(\n",
    "        tokenizer.encode(prompt),\n",
    "        dtype=torch.long,\n",
    "        device=device,\n",
    "    ).unsqueeze(0)\n",
    "\n",
    "    for token in generate_text_basic_stream_cache(\n",
    "        model=model,\n",
    "        token_ids=input_ids,\n",
    "        max_new_tokens=32,\n",
    "        eos_token_id=tokenizer.eos_token_id,\n",
    "    ):\n",
    "        next_token_id = token.squeeze(0)\n",
    "        print(tokenizer.decode(next_token_id.tolist()), end=\"\", flush=True)\n",
    "\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97e45dcc-b6e0-45a3-b952-ce77d5cdcad3",
   "metadata": {},
   "source": [
    "- Below, we will use similar code from `reasoning_from_scratch.qwen3_batched` that supports batching\n",
    "- Note that the batched version does not support streaming, though, meaning we have to wait until all results are generated before they are decoded and printed\n",
    "- Here, the batched generation uses left padding, which will be explained in the next section\n",
    "- For now, let's start with a usage example to illustrate how it's used (before we get into how it works internally)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ac94b7bf-50c7-41a0-ad16-1cd829eb2220",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✓ qwen3/qwen3-0.6B-base.pth already up-to-date\n",
      " \\boxed{4}\n",
      " \\boxed{6}\n"
     ]
    }
   ],
   "source": [
    "from reasoning_from_scratch.qwen3_batched import (\n",
    "    generate_text_basic_batched_cache,\n",
    "    load_model_and_tokenizer,\n",
    ")\n",
    "\n",
    "model, tokenizer = load_model_and_tokenizer(\n",
    "    which_model=\"base\",\n",
    "    device=device,\n",
    "    use_compile=False,\n",
    ")\n",
    "\n",
    "problems = [\"2+2?\", \"3+3=6?\"]\n",
    "prompts = [render_prompt(problem) for problem in problems]\n",
    "tokenized = [tokenizer.encode(p) for p in prompts]\n",
    "pad_id = tokenizer.pad_token_id\n",
    "max_len = max(len(t) for t in tokenized)\n",
    "\n",
    "left_padded = [\n",
    "    [pad_id] * (max_len - len(t)) + t\n",
    "    for t in tokenized\n",
    "]\n",
    "input_ids = torch.tensor(left_padded, dtype=torch.long, device=device)\n",
    "\n",
    "generated = generate_text_basic_batched_cache(\n",
    "    model=model,\n",
    "    token_ids=input_ids,\n",
    "    max_new_tokens=32,\n",
    "    eos_token_id=tokenizer.eos_token_id,\n",
    "    pad_id=pad_id,\n",
    ")\n",
    "\n",
    "for row in generated:\n",
    "    eos_pos = (row == tokenizer.eos_token_id).nonzero(as_tuple=True)[0]\n",
    "    if len(eos_pos) > 0:\n",
    "        row = row[:eos_pos[0]]\n",
    "    print(tokenizer.decode(row.tolist()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a31ae932-3f9b-4072-8e3a-03edea0146ef",
   "metadata": {},
   "source": [
    "- As we can see, the results are exactly the same as before\n",
    "- The difference is that these results were generated in parallel via `generate_text_basic_batched_cache`\n",
    "- The next section briefly explains how this works under the hood"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1ec9fda-54cd-472f-96e2-06242ebaf794",
   "metadata": {},
   "source": [
    "- An even more optimized code implementation replaces `generate_text_basic_batched_cache` with `generate_text_basic_batched_cache_stop`\n",
    "- `generate_text_basic_batched_cache` keeps every row in the active batch for every decode step \n",
    "- `generate_text_basic_batched_cache_stop`removes finished rows from the active compute batch (it's more complicated to implement internally, but can optimize performance\n",
    "- This is illustrated in the figure below"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37729e70-96c4-4643-9ed8-0623653fadad",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-e/Appendix_E_F02_raschka.webp?1\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1454be8-8e00-49ed-b7c3-12453f71e618",
   "metadata": {},
   "source": [
    "- Side note: in Qwen3, the The `<eos>` tokens are `<|endoftext|>`, but the figure uses `<eos>` for visual compactness"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40d5ba01-99a7-48fe-b180-b1c22df33f5b",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## E.3 Padding and attention masks"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f39de403-5383-41e8-82fc-5dbc19966334",
   "metadata": {},
   "source": [
    "- In single-example mode, if we tokenize a short prompt such as `\"2+2?\"`, we can pass it to the model as a simple tensor of shape `(1, 4)`:\n",
    "  - `input_ids = torch.tensor([[17, 10, 17, 30]])`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56057d0f-e3b3-461e-8b42-43b45acb85cc",
   "metadata": {},
   "source": [
    "- Internally, the model builds a standard causal attention mask internally so that each position can only attend to itself and earlier tokens\n",
    "- If you are unfamiliar with self-attention, I have an article that provides more background information: https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention\n",
    "- Conceptually, that mask looks like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa512a3c-d039-4a3f-af4a-9a0c594222a5",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-e/Appendix_E_F03_raschka.webp\" width=\"400px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa7f6b8a-8b6d-4b60-a7a2-0e8dab93eb75",
   "metadata": {},
   "source": [
    "- `1` means \"masked out\" and `0` means \"allowed\"\n",
    "- So the first token cannot look ahead to later positions, the second token can only look at the first two positions, and so on\n",
    "- This is the standard autoregressive masking pattern"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91f0e8a8-7635-4ec7-b378-15d1cd04e464",
   "metadata": {},
   "source": [
    "- Batching changes the situation because different prompts usually have different lengths\n",
    "- Suppose we process `\"2+2?\"` together with the slightly longer prompt `\"3+3=6?\"`\n",
    "- Since PyTorch tensors must be rectangular, the shorter row has to be padded to match the longer one\n",
    "- Here, this is done with left padding:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "791aab1e-c066-4420-9c30-370c5f6e5283",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-e/Appendix_E_F04_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e044e27f-fea0-4f58-83c1-5bec57b137c7",
   "metadata": {},
   "source": [
    "- Note that we keep an additional `attn_mask` internally; this is just to keep track off the padded positions\n",
    "- In this `attn_mask`, `True` means padded and `False` means not padded\n",
    "- We use this additional `attn_mask` to identify the tokens in the causal mask that correspond to the pad token IDs\n",
    "- Masking padded keys and zeroing padded queries are important steps to make batching behave similarly to the single-example execution"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64316817-70bf-43d0-8238-d74aa5dbad37",
   "metadata": {},
   "source": [
    "- By the way, we use the `<|endoftext|>` token, but it does not really matter because the corresponding token positions are ignored anyways"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "80a0dbad-4f43-4d15-a2bb-471e2a713754",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "151643\n"
     ]
    }
   ],
   "source": [
    "print(tokenizer.pad_token_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "26872d96-1b98-4969-ab05-3038c32fd9a7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<|endoftext|>\n"
     ]
    }
   ],
   "source": [
    "print(tokenizer.decode([151643]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86edccc8-944f-4c9e-acd3-cf35e9ebbffa",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## E.4 Chapter 3: batched MATH-500 evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4ca16c9-a00d-493f-8ebe-48e3022544f3",
   "metadata": {},
   "source": [
    "- The supplementary materials includes a script for the evaluation method implemented in chapter 3 that we can download and use similar to how we did it in chapter 6:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "57db75e8-3019-4b72-b765-81e0c437d6b9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "evaluate_math500.py: 3.5 KB\n",
      "math500_test.json: 462.1 KB\n"
     ]
    }
   ],
   "source": [
    "from reasoning_from_scratch.ch07 import download_from_github\n",
    "\n",
    "download_from_github(\n",
    "    \"ch03/02_math500-verifier-scripts/evaluate_math500.py\"\n",
    ")\n",
    "download_from_github(\n",
    "    \"ch03/01_main-chapter-code/math500_test.json\",\n",
    "    out=\"math500_test.json\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d784cdb0-b391-44b7-9d4c-eb4cb86bb904",
   "metadata": {},
   "source": [
    "- Then, to run it, we can execute the following command in a code terminal (replace `uv run` with `python` if you are not a uv user):"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31cc2a1a-c8ab-4f78-b9f4-43d53451d28e",
   "metadata": {},
   "source": [
    "```bash\n",
    "uv run evaluate_math500.py \\\n",
    "  --dataset_size 500 \\\n",
    "  --which_model \"reasoning\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d712291-5186-408e-8511-aef4fe3d2aa8",
   "metadata": {},
   "source": [
    "- The bonus material also includes a version of this for batched generation that applies the batching method we discussed previously\n",
    "- The download is similar to before, except that we replace `evaluate_math500.py` with `evaluate_math500_batched.py`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "38c530b5-0958-41f8-93cc-7edc959a249d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "evaluate_math500_batched.py: 8.3 KB\n"
     ]
    }
   ],
   "source": [
    "download_from_github(\n",
    "    \"ch03/02_math500-verifier-scripts/evaluate_math500_batched.py\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "224f801d-49c9-4115-8ab1-740a85ad2b07",
   "metadata": {},
   "source": [
    "- The usage is also similar to the non-batched version, except that we now provide an additional `--batch_size` argument to specify how many prompts and answers the LLM should process in parallel"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0efd481-42da-45d4-8c0c-e17509da33e7",
   "metadata": {},
   "source": [
    "```bash\n",
    "uv run evaluate_math500_batched.py \\\n",
    "  --dataset_size 500 \\\n",
    "  --which_model \"reasoning\" \\\n",
    "  --batch_size 64\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a207780-630d-40b8-b48b-6efbc046c817",
   "metadata": {},
   "source": [
    "- The ideal batch size depends on what your hardware can handle; a batch size of 64 uses approximately 23.39 GB RAM (the non-batched script uses approximately 1.84 GB RAM)\n",
    "- We will compare and discuss the performance difference towards the end of the appendix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d444b175-d5a0-4666-b31d-95ac8edf79c5",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## E.5 Chapter 4: batched self-consistency sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16fd29af-7200-4560-b5e3-2b055bf9f02c",
   "metadata": {},
   "source": [
    "- The optional `self_consistency_math500_batched.py` script that implements self-consistency sampling in chapter 4 does not mix different prompts into one padded tensor\n",
    "-  Instead, it repeats the same prompt `num_samples` times and samples several continuations in parallel for self-consistency voting\n",
    "- Because every row starts from the same prompt length, this script uses the regular `Qwen3Model` from reasoning_from_scratch.qwen3 instead of reasoning_from_scratch.qwen3_batched, since padding is not needed for equal prompt lengths"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21f66b82-5107-4e7c-b9b2-0b4662001247",
   "metadata": {},
   "source": [
    "- We can download the script as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f94f180-5047-40b1-9180-1eafbaed9af2",
   "metadata": {},
   "outputs": [],
   "source": [
    "download_from_github(\n",
    "    \"ch04/02_math500-inference-scaling-scripts/self_consistency_math500_batched.py\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01ea01a9-918c-449c-b693-b1bef4524347",
   "metadata": {},
   "source": [
    "- To download the non-batched version, simply drop the `\"_batched\"` in the file name above\n",
    "- We can run the script as follows (the syntax for the non-batched script is identical)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65fb0072-ef63-4b01-a881-1b318f1bf668",
   "metadata": {},
   "source": [
    "```bash\n",
    "uv run self_consistency_math500_batched.py \\\n",
    "  --which_model base \\\n",
    "  --temperature 0.9 \\\n",
    "  --top_p 0.9 \\\n",
    "  --num_samples 3 \\\n",
    "  --dataset_size 500 \\\n",
    "  --prompt_suffix \"\\n\\nExplain step by step.\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9353377-e426-4a89-893d-c7203ee9491b",
   "metadata": {},
   "source": [
    "- More about the performance at the end of this appendix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1dfcfbe-33c5-440b-9d14-a089fd64d493",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## E.6 Chapter 6: batched GRPO rollouts"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a129086-1f2c-4183-be7d-4d964eaaacd9",
   "metadata": {},
   "source": [
    "- Self-refinement in chapter 5 is a sequential technique that itself does not benefit from batching\n",
    "- One could run self-refinement loops for multiple inputs in parallel, but this is non-trivial to implement and thus not part of the supplementary material\n",
    "- Instead, we continue with a batched version of RLVR in chapter 6\n",
    "- In chapter 6, we use the same prompt for the different rollouts; hence, no padding is required here; so, similar to section E.5, the code uses the regular `Qwen3Model` class from `reasoning_from_scratch.qwen3`\n",
    "- The relevant scripts can be fetched with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6446395e-3879-4bb9-a209-04aa84c95ae1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Non-batched version\n",
    "download_from_github(\n",
    "    \"ch06/02_rlvr_grpo_scripts_intro/rlvr_grpo_original_no_kl.py\"\n",
    ")\n",
    "\n",
    "# Batched version\n",
    "download_from_github(\n",
    "    \"ch06/02_rlvr_grpo_scripts_intro/rlvr_grpo_original_no_kl_batched.py\"\n",
    ")\n",
    "\n",
    "# Batched version with GPU support\n",
    "download_from_github(\n",
    "    \"ch06/02_rlvr_grpo_scripts_intro/rlvr_grpo_original_no_kl_batched_fsdp.py\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "472ed887-7888-4738-b435-6e3247d75e52",
   "metadata": {},
   "source": [
    "```bash\n",
    "uv run rlvr_grpo_original_no_kl_batched.py \\\n",
    "  --num_rollouts 8 \\\n",
    "  --steps 100 \\\n",
    "  --batch_size 4 \\\n",
    "  --max_new_tokens 1024\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5286486f-96b3-4b39-80e5-3fff2675aa67",
   "metadata": {},
   "source": [
    "- In the current script, `--batch_size` controls how many rollouts are generated in parallel within a step\n",
    "- This increases throughput, but it also increases memory pressure, so in practice you may need to reduce `--num_rollouts` or `--max_new_tokens`\n",
    "- If you have multiple GPUs, the FSDP variant follows the same pattern and adds `--num_gpus`\n",
    "- Again, we will return to the performance discussion at the end of this appendix\n",
    "- As of this writing, batched versions of the chapter 7 scripts are not available in the supplementary materials yet, but will be added over time; conceptually, they will work similarly to the chapter 6 scripts"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e656bca7-ab7e-4c7a-a1a0-96fa4b5b4e2f",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## E.7 Chapter 8: batched distillation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65aff052-175f-4f09-9bd8-7e8191d1c431",
   "metadata": {},
   "source": [
    "- Chapter 8 returns to the padding-aware style from chapter 3, since distillation examples have different prompt and answer lengths\n",
    "- You can download the script and sample training dataset as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b2b13023-584a-4826-a3da-fc3b71722f2d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "distill_batched.py: 17.9 KB\n",
      "deepseek-r1-math-train.json: 107538.0 KB\n"
     ]
    }
   ],
   "source": [
    "from reasoning_from_scratch.ch08 import load_distill_data\n",
    "\n",
    "download_from_github(\n",
    "    \"ch08/04_train_with_distillation/distill_batched.py\"\n",
    ")\n",
    "_ = load_distill_data(\n",
    "    partition=\"deepseek-r1-math-train\",\n",
    "    local_path=\"deepseek-r1-math-train.json\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20639239-16e6-4b4f-94b3-09920a306562",
   "metadata": {},
   "source": [
    "- For the non-batched version, drop the `\"_batched\"` in the file name\n",
    "- We can run the script as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f04e5b6-ecc2-4724-ad17-625a396847dd",
   "metadata": {},
   "source": [
    "```bash\n",
    "uv run distill_batched.py \\\n",
    "  --data_path deepseek-r1-math-train.json \\\n",
    "  --dataset_size 12000 \\\n",
    "  --validation_size 10 \\\n",
    "  --epochs 2 \\\n",
    "  --use_think_tokens \\\n",
    "  --max_seq_len 1024 \\\n",
    "  --batch_size 4\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b79895b-40fa-4a34-8a83-30cb8f185b26",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## E.8 Single-sequence versus batch generation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c9994ca-d128-4c0a-a014-3e2f51410d23",
   "metadata": {},
   "source": [
    "- The table below summarizes the runtime and RAM usage numbers for the scripts above"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d36c900d-0d57-4697-9d8a-32b665aa2212",
   "metadata": {},
   "source": [
    "| Row | Script                                   | Batch size | RAM      | H100 Total time (min) | DGX Spark Total time (min) |\n",
    "|-----|------------------------------------------|------------|----------|------------------------|-----------------------------|\n",
    "| 1   | evaluate_math500.py                      | -          | 1.8 GB   | 90.0                   | 174.7                       |\n",
    "| 2   | evaluate_math500_batched.py              | 64         | 23.39 GB | 16.0                   | 108.4                       |\n",
    "|     |                                          |            |          |                        |                             |\n",
    "| 3   | self_consistency_math500.py              | -          | 1.79 GB  | 252.0                  | 340.8                       |\n",
    "| 4   | self_consistency_math500_batched.py      | 3          | 2.45 GB  | 129.0                  | 243.3                       |\n",
    "|     |                                          |            |          |                        |                             |\n",
    "| 5   | rlvr_grpo_original_no_kl.py              | -          | 43.35 GB | 68.0                   | 63.7                        |\n",
    "| 6   | rlvr_grpo_original_no_kl_batched.py      | 4          | 44.91 GB | 19.0                   | 23.1                        |\n",
    "|     |                                          |            |          |                        |                             |\n",
    "| 7   | distill.py                              | -          | 8.29 GB  | 10.9                   | 32.8                        |\n",
    "| 8   | distill_batched.py                      | 4          | 8.34 GB  | 9.1                    | 28.2                        |"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}