{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8e231c33-44b8-4da2-8d17-90aea2f49ae6",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"https://mng.bz/lZ5B\">Build a Reasoning Model (From Scratch)</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/reasoning-from-scratch\">https://github.com/rasbt/reasoning-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"https://mng.bz/lZ5B\"><img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7bc506a9-fa2f-475e-9981-0c4f36109808",
   "metadata": {},
   "source": [
    "# Chapter 2: Generating Text with a Pre-trained LLM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78ae290f-bceb-4791-8c38-ef8f7756d1e6",
   "metadata": {},
   "source": [
    "Packages that are being used in this notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f6986129-4aac-496e-a35a-e1dc4c4df4ed",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reasoning_from_scratch version: 0.1.13\n",
      "torch version: 2.10.0\n",
      "tokenizers version: 0.21.4\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "used_libraries = [\n",
    "    \"reasoning_from_scratch\",\n",
    "    \"torch\",\n",
    "    \"tokenizers\"  # Used by reasoning_from_scratch\n",
    "]\n",
    "\n",
    "for lib in used_libraries:\n",
    "    print(f\"{lib} version: {version(lib)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7fcbc9a-603f-4193-85aa-38dd2d8776bf",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F01_raschka.webp?1\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "378654ef-4323-42f9-8fd4-00f51636e0e1",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 2.1 Introduction to LLMs for text generation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "caba81dd-5a98-4e29-b00d-43b84bd71899",
   "metadata": {},
   "source": [
    "- No code in this section\n",
    "- How do LLMs generate text?\n",
    "- This chapter is a setup chapter: setting up the coding environment and LLM we will be using throughout the book\n",
    "- We also code text generation functions that we will use and extend in upcoming chapters"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8dbef742-7e61-403a-af0a-b95b890c239a",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F02_raschka.webp?1\" width=\"300px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0362617f-7976-4507-9308-c2969eab3bba",
   "metadata": {},
   "source": [
    "- LLM (and neural network) flowcharts are traditionally read and drawn from top to bottom"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "186d56ba-6c74-4d69-a14f-d5bbae7852fe",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 2.2 Setting up the coding environment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcc92122-0ce4-4426-8e85-4a48e1ebe593",
   "metadata": {},
   "source": [
    "- If you are reading this book, you likely coded in Python before\n",
    "- The simplest way to install dependencies, if you already have a Python environment set up (with Python 3.10 or newer), is to use `pip`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "eaf19556-05a5-47c3-8031-cb2f4a2fac2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install -r https://raw.githubusercontent.com/rasbt/reasoning-from-scratch/refs/heads/main/requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0f86ccc-395c-415e-a631-71e9e6ec59a4",
   "metadata": {},
   "source": [
    "- For this chapter, dependencies can also be installed manually:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "cabb6da5-891f-4aac-b478-c2e8ea28b0d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install torch>=2.10.0 tokenizers>=0.22.2 reasoning-from-scratch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0236c753-d983-49e4-b8cb-3870ecdae943",
   "metadata": {},
   "source": [
    "- My preferred way is to use the widely recommended [uv](https://docs.astral.sh/uv/) Python package and project manager\n",
    "- To install `uv`, run the installation for your OS from the official website: https://docs.astral.sh/uv/getting-started/installation/\n",
    "- Next, clone the GitHub repo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "efa5b432-536c-45f4-b73d-568d9788b934",
   "metadata": {},
   "outputs": [],
   "source": [
    "#!git clone --depth 1 https://github.com/rasbt/reasoning-from-scratch.git"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "670db5b1-66be-4382-ae8c-2f7e0189ac2e",
   "metadata": {},
   "source": [
    "- If you don't have `git` installed, you can also manually download the source code repository from the Manning website or by clicking this link: https://github.com/rasbt/reasoning-from-scratch/archive/refs/heads/main.zip (unzip it after downloading)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f3ea394-bf43-4a0a-ad04-df3a82b9874b",
   "metadata": {},
   "source": [
    "- In the terminal, navigate to the `reasoning-from-scratch` folder\n",
    "- Run `uv run jupyter lab` to launch JupyterLab and open a blank notebook or the notebook for this chapter\n",
    "- This command also sets up a local virtual environment (usually in `.venv/`) and installs all dependencies from the `pyproject.toml` file inside the `reasoning-from-scratch` folder automatically"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4373b49-ebf3-4282-9f21-7ac32ed5607b",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F03_raschka.webp?1\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0c28e8a-d328-4ff2-9a55-38af8b56ffae",
   "metadata": {},
   "source": [
    "- See [../02_setup-tips/python-instructions.md](../02_setup-tips/python-instructions.md) for additional installation details and options if needed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b16265a-7cd2-406b-b954-b5aaebf1e09b",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 2.3 Understanding hardware needs and recommendations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "331d182d-1747-48ae-a502-004bf8730bf9",
   "metadata": {},
   "source": [
    "- If you are new to PyTorch, I recommend reading through my [PyTorch in One Hour: From Tensors to Training Neural Networks on Multiple GPUs](https://sebastianraschka.com/teaching/pytorch-1h/) tutorial\n",
    "- If you followed the previous section, you should have PyTorch installed\n",
    "- Check manually if your PyTorch installation supports GPU; see what's supported on your machine: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8d205144-19bf-481f-be5b-c1e77847c025",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PyTorch version 2.10.0\n",
      "Apple Silicon GPU\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "\n",
    "print(f\"PyTorch version {torch.__version__}\")\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    print(f\"CUDA/ROCm GPU: {torch.cuda.get_device_name(0)}\")\n",
    "\n",
    "elif torch.xpu.is_available():\n",
    "    print(f\"Intel GPU: {torch.xpu.get_device_name(0)}\")\n",
    "\n",
    "elif torch.backends.mps.is_available():\n",
    "    print(\"Apple Silicon GPU\")\n",
    "\n",
    "else:\n",
    "    print(\"Only CPU\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a08b8fc-3fc6-4440-9cd5-be4c9ed7d43b",
   "metadata": {},
   "source": [
    "- Depending on the chapter, code will automatically use an NVIDIA (CUDA) GPU if available, otherwise run on CPU (or Apple Silicon GPU if recommended for a particular section or chapter)\n",
    "- Chapters 2-4 can be executed in a reasonable time on a CPU\n",
    "- Code in chapters 5-7 will be very slow when executed on a CPU, and a GPU with CUDA support is recommended for these chapters (more on the exact resource needs in those upcoming chapters)\n",
    "- My personal preference is [Lightning AI Studio](https://lightning.ai/), which offers users free compute credits after the sign-up and verification process; alternatively, [Google Colab](https://colab.research.google.com/) is another good choice"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fd7329f-5b6b-4001-afcc-ddde1778689f",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F04_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a7b5c47-d22b-4cce-9f82-64be1ca2ce57",
   "metadata": {},
   "source": [
    "- See [../02_setup-tips/gpu-instructions.md](../02_setup-tips/gpu-instructions.md) for cloud compute recommendations if needed\n",
    "- But for now, there is no need to use GPUs yet; the first chapters run fine on non-GPU hardware"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f846ae6-c6ca-44a5-8c19-c5b79728e8c3",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 2.4 Preparing input texts for LLMs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2dfd4adc-9774-4813-873d-a456ff129f5c",
   "metadata": {},
   "source": [
    "- In this section, we learn how to use a tokenizer; we use it to convert (encode) input text into a token ID representation as input to the LLM\n",
    "- We also use the tokenizer to convert (decode) the LLM output back into a human-readable text representation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56135ed3-a423-43b3-ab00-75fc3d19e2d6",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F05_raschka.webp?1\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2123a05e-91b3-42ad-a468-8d6694a49c45",
   "metadata": {},
   "source": [
    "- As mentioned earlier, implementing the LLM and tokenizer from scratch is outside the scope of this book, which is focused on implementing reasoning methods from scratch on top of an existing LLM and tokenizer\n",
    "- In this book, we will work with a pre-trained LLM that we will load in the next section; here, we load the tokenizer that goes with it\n",
    "- I prepared a `reasoning_from_scratch` Python package that provides the base LLM and the corresponding tokenizer, which I coded with the help of the [`tokenizers`](https://github.com/huggingface/tokenizers) Python library package\n",
    "- The `reasoning_from_scratch` package code is part of this book's supplementary code, and it should already be installed based on the instructions in section 2.2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31f5ae78-5aed-477d-8c1e-4fc1df857834",
   "metadata": {},
   "source": [
    "- Next, we download the tokenizer files (this is a tokenizer for the Qwen3 base LLM, but more on that in the next section):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6a9aba9e-0fa8-4710-804c-a26299beb2f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from reasoning_from_scratch.qwen3 import download_qwen3_small\n",
    "\n",
    "download_qwen3_small(kind=\"base\", tokenizer_only=True, out_dir=\"qwen3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "079987a0-b0dd-4dc2-8432-10be34029990",
   "metadata": {},
   "source": [
    "- Now, we can load the tokenizer settings from the tokenizer file into the `Qwen3Tokenizer`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "564206d6-8180-44de-813f-3bf7290d59f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from reasoning_from_scratch.qwen3 import Qwen3Tokenizer\n",
    "\n",
    "tokenizer_path = Path(\"qwen3\") / \"tokenizer-base.json\"\n",
    "tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4e7efec-1fda-430b-b06e-4f856ec3be13",
   "metadata": {},
   "source": [
    "- Since we haven't loaded the LLM itself yet, we will do a simpler round-trip: we encode the text into token IDs and then decode it back into its string representation:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2435704e-765a-4cb0-8960-25314cb96e4d",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F06_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6c6c4fb1-2c54-45e2-b30d-22717527fffc",
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = \"Explain large language models.\"\n",
    "input_token_ids_list = tokenizer.encode(prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c914b8cd-5905-4fae-92f6-df9aecf3811b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "840 --> Ex\n",
      "20772 --> plain\n",
      "3460 -->  large\n",
      "4128 -->  language\n",
      "4119 -->  models\n",
      "13 --> .\n"
     ]
    }
   ],
   "source": [
    "for i in input_token_ids_list:\n",
    "    print(f\"{i} --> {tokenizer.decode([i])}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "94d58966-cee7-4ed4-ba41-80d064dd4fd8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Explain large language models.\n"
     ]
    }
   ],
   "source": [
    "text = tokenizer.decode(input_token_ids_list)\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33683b20-c8ad-434d-8eeb-7fa779378238",
   "metadata": {},
   "source": [
    "- In case of the `Qwen3Tokenizer`, there are about 151 thousand unique tokens (vocabulary size)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b48b9bb-99d7-4688-ad37-79596ca89701",
   "metadata": {},
   "source": [
    "- Additional resources on tokenization:\n",
    "  - [Build a Large Language Model (from Scratch)](https://mng.bz/M96o) chapter 2\n",
    "  - [Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch](https://sebastianraschka.com/blog/2025/bpe-from-scratch.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93295da8-8c52-49fe-93c8-5daa8e8c679c",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 2.5 Loading pre-trained models"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe267761-51f1-49eb-b36e-f415f367d831",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F07_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a31c2558-45d7-4d62-af94-1d91f869f8a2",
   "metadata": {},
   "source": [
    "- As hinted at in the previous section, when loading the tokenizer, this book uses Qwen3 0.6B; after thinking long and hard about which open weight base model to use, I opted for Qwen3 because\n",
    "  - Qwen3 is the leading open-weight model in terms of modeling performance as of this writing\n",
    "  - Qwen3 0.6B is more memory efficient than Llama 3 1B\n",
    "  - There's both a base model (which we focus on for reasoning model development) and an official reasoning variant that we can use as a reference model\n",
    "- (Note that the canonical spelling does not include a whitespace in \"Qwen3\" whereas it includes one in \"Llama 3\")\n",
    "- In the spirit of \"from-scratch\" we are using a reimplementation of Qwen3 that I wrote in pure PyTorch without any external LLM library dependencies; this from-scratch implementation is compatible with the original Qwen3 model weights\n",
    "- However, we will not go over the Qwen3 code implementation in this book as this would be a whole book by itself (similar to my [Build A Large Language Model (From Scratch)](https://github.com/rasbt/LLMs-from-scratch) book; instead, this book (Build A Reasoning Model From Scratch) focuses on implementing reasoning methods from scratch on top of a base model (here Qwen3)\n",
    "- See appendix C for the Qwen3 model code\n",
    "- See appendix D for loading the reasoning variant and larger Qwen3 models\n",
    "- See the Qwen3 [GitHub repository](https://github.com/QwenLM/Qwen3) and [technical report](https://arxiv.org/abs/2505.09388) for (even) more details"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3afd498b-38f4-4430-9a52-0eec8d288c88",
   "metadata": {},
   "source": [
    "- The model is purposefully small (but still very capable) to run on consumer hardware\n",
    "- It runs fine on CPU, NVIDIA GPUs (`\"cuda\"`), Apple Silicon GPUs (`\"mps\"`), and Intel GPUs (`\"xpu\"`); more about the performance trade-offs later in this chapter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "afbc8f36-b76f-481c-b291-83abd6fb20ad",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using Apple Silicon GPU (MPS)\n"
     ]
    }
   ],
   "source": [
    "def get_device(enable_tensor_cores=True):\n",
    "    if torch.cuda.is_available():\n",
    "        device = torch.device(\"cuda\")\n",
    "        print(\"Using NVIDIA CUDA GPU\")\n",
    "        \n",
    "        if enable_tensor_cores:\n",
    "            major, minor = map(int, torch.__version__.split(\".\")[:2])\n",
    "            if (major, minor) >= (2, 9):\n",
    "                torch.backends.cuda.matmul.fp32_precision = \"tf32\"\n",
    "                torch.backends.cudnn.conv.fp32_precision = \"tf32\"\n",
    "            else:\n",
    "                torch.backends.cuda.matmul.allow_tf32 = True\n",
    "                torch.backends.cudnn.allow_tf32 = True\n",
    "\n",
    "    elif torch.backends.mps.is_available():\n",
    "        device = torch.device(\"mps\")\n",
    "        print(\"Using Apple Silicon GPU (MPS)\")\n",
    "\n",
    "    elif torch.xpu.is_available():\n",
    "        device = torch.device(\"xpu\")\n",
    "        print(\"Using Intel GPU\")\n",
    "\n",
    "    else:\n",
    "        device = torch.device(\"cpu\")\n",
    "        print(\"Using CPU\")\n",
    "\n",
    "    return device\n",
    "\n",
    "device = get_device()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6068984f-bff3-4189-b843-25002573a2e6",
   "metadata": {},
   "source": [
    "- I recommend running the code on `\"cpu\"` on the first run-through, so we hardcode the device below: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "542875a3-cd13-4b66-a5a8-510817fe4d3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Recommended: Use CPU on the first run-through\n",
    "device = torch.device(\"cpu\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a76c6f6a-f215-4248-ac5f-8bc4212e288e",
   "metadata": {},
   "source": [
    "- Then, we download the file containing the pre-trained model weights, which is approximately 1.5 GB in size:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "52145acb-24b6-44fb-963b-33d9fe9398e3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "✓ qwen3/qwen3-0.6B-base.pth already up-to-date\n"
     ]
    }
   ],
   "source": [
    "download_qwen3_small(kind=\"base\", tokenizer_only=False, out_dir=\"qwen3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef90039d-2a46-48c7-9d7f-51822d1ad18f",
   "metadata": {},
   "source": [
    "- The architectural structure of the Qwen3 0.6B model we are loading is shown below for readers who are familiar with LLM architectures, but note that for this book, it's **not** essential or important to understand this architecture as we are not modifying but rather adding reasoning techniques on top in later chapters"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f02be0d-e6b0-4cb3-8533-8111dd9313b5",
   "metadata": {},
   "source": [
    "- I coded the Qwen3 model architecture from scratch for the [reasoning-from-scratch](https://github.com/rasbt/reasoning-from-scratch/blob/main/reasoning_from_scratch/qwen3.py) Python package contained in this code repository; the source code is also shown in appendix C; but again, this is only as a bonus for those who are curious, and it's not necessary to look at or understand these internals to follow the rest of the book"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "501d6703-5fbf-4254-8b86-010bb0244b75",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Qwen3Model(\n",
       "  (tok_emb): Embedding(151936, 1024)\n",
       "  (trf_blocks): ModuleList(\n",
       "    (0-27): 28 x TransformerBlock(\n",
       "      (att): GroupedQueryAttention(\n",
       "        (W_query): Linear(in_features=1024, out_features=2048, bias=False)\n",
       "        (W_key): Linear(in_features=1024, out_features=1024, bias=False)\n",
       "        (W_value): Linear(in_features=1024, out_features=1024, bias=False)\n",
       "        (out_proj): Linear(in_features=2048, out_features=1024, bias=False)\n",
       "        (q_norm): RMSNorm()\n",
       "        (k_norm): RMSNorm()\n",
       "      )\n",
       "      (ff): FeedForward(\n",
       "        (fc1): Linear(in_features=1024, out_features=3072, bias=False)\n",
       "        (fc2): Linear(in_features=1024, out_features=3072, bias=False)\n",
       "        (fc3): Linear(in_features=3072, out_features=1024, bias=False)\n",
       "      )\n",
       "      (norm1): RMSNorm()\n",
       "      (norm2): RMSNorm()\n",
       "    )\n",
       "  )\n",
       "  (final_norm): RMSNorm()\n",
       "  (out_head): Linear(in_features=1024, out_features=151936, bias=False)\n",
       ")"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from reasoning_from_scratch.qwen3 import Qwen3Model, QWEN_CONFIG_06_B\n",
    "\n",
    "model_path = Path(\"qwen3\") / \"qwen3-0.6B-base.pth\"\n",
    "\n",
    "model = Qwen3Model(QWEN_CONFIG_06_B)\n",
    "model.load_state_dict(torch.load(model_path))\n",
    "\n",
    "model.to(device)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da0583d6-74ef-4629-b436-6f64f8c0841b",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F08_raschka.webp\" width=\"300px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d2c7723-2e57-4d1a-8a10-3b4daafe4d50",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 2.6 Understanding the sequential LLM text generation process"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b38845f2-55e9-4e4a-a8a5-d402d7450501",
   "metadata": {},
   "source": [
    "- In this section, we code a simple wrapper function so we can use the LLM to generate text (we will extend this function with extra functionality in chapter 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dd89a18-de70-4958-8d24-d0aaddc519f3",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F09_raschka.webp?1\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "446fdb75-6971-4bad-bfde-4f8dab1e5315",
   "metadata": {},
   "source": [
    "- LLMs generate one word at a time:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9cee173-54bc-4275-8fcc-320c1a1a38fb",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F10_raschka.webp?2\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61bb43dc-9007-48b3-bd7b-432c7b616aba",
   "metadata": {},
   "source": [
    "- The figure above is a simplification, only showing the newly generated word; the figure below zooms in on the first iteration:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e78cff76-1215-4d10-9c3c-3a5f2afbb987",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F11_raschka.webp\" width=\"3b00px\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "94ed4960-8e72-48ae-90ab-662b468f77a2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([1, 2, 3])\n",
      "tensor([[1, 2, 3]])\n"
     ]
    }
   ],
   "source": [
    "example = torch.tensor([1, 2, 3]) \n",
    "print(example)\n",
    "print(example.unsqueeze(0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "1d65f427-6cb2-4533-85e6-0b09f9226e0c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[1, 2, 3]])\n",
      "tensor([1, 2, 3])\n"
     ]
    }
   ],
   "source": [
    "example = torch.tensor([[1, 2, 3]]) \n",
    "print(example)\n",
    "print(example.squeeze(0))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cbc1181-eca7-4a25-8586-d6a9f08fa3f0",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F11_raschka.webp?2\" width=\"300px\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "f2c44846-dd0a-4877-bc23-7baa96ff50c8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of input tokens: 6\n",
      "Formatted Output tensor shape: torch.Size([6, 151936])\n"
     ]
    }
   ],
   "source": [
    "prompt = \"Explain large language models.\"\n",
    "input_token_ids_list = tokenizer.encode(prompt)\n",
    "print(f\"Number of input tokens: {len(input_token_ids_list)}\")\n",
    "\n",
    "input_tensor = torch.tensor(input_token_ids_list)\n",
    "input_tensor_fmt = input_tensor.unsqueeze(0).to(device)\n",
    "\n",
    "with torch.inference_mode():\n",
    "    output_tensor = model(input_tensor_fmt)\n",
    "\n",
    "output_tensor_fmt = output_tensor.squeeze(0)\n",
    "print(f\"Formatted Output tensor shape: {output_tensor_fmt.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7de206fa-5f1c-4356-8c77-3c0365a7f1c6",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F12_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "72c10d27-f29e-4fa6-a748-7e3096b25979",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([ 7.3750,  2.0312,  8.0000,  ..., -2.5469, -2.5469, -2.5469],\n",
      "       dtype=torch.bfloat16)\n"
     ]
    }
   ],
   "source": [
    "last_token = output_tensor_fmt[-1]\n",
    "print(last_token)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "7fa3b6ab-daee-46ed-b476-20feaa284a4f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([20286])\n"
     ]
    }
   ],
   "source": [
    "print(torch.argmax(last_token, dim=-1, keepdim=True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "a705ea38-2bc7-430a-8fe7-e8b58e994dd4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Large\n"
     ]
    }
   ],
   "source": [
    "print(tokenizer.decode([20286]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "008ad321-9e3b-428f-9dba-89aa86738870",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor(3)\n",
      "tensor(2)\n"
     ]
    }
   ],
   "source": [
    "example = torch.tensor([-2, 1, 3, 1])\n",
    "print(torch.max(example))\n",
    "print(torch.argmax(example))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05c87e6d-d77c-4f4a-b265-0d0a95d684c1",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 2.7 Coding a minimal text generation function\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4e89807-92c0-4462-a447-d9c45bbf02ca",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F13_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ec19437-a3f0-45b7-b572-d7a520165ff1",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F14_raschka.webp?2\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dea54fe-7292-4901-bc99-62064bba48ee",
   "metadata": {},
   "source": [
    "- The `generate_text_basic_stream` function implements this sequential text generation process:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a0c1b331-79f6-4fad-a56f-8aaf8b2a1239",
   "metadata": {},
   "outputs": [],
   "source": [
    "@torch.inference_mode()\n",
    "def generate_text_basic_stream(\n",
    "    model,\n",
    "    token_ids,\n",
    "    max_new_tokens, \n",
    "    eos_token_id=None\n",
    "):\n",
    "    model.eval()\n",
    "\n",
    "    for _ in range(max_new_tokens):\n",
    "        out = model(token_ids)[:, -1]\n",
    "        next_token = torch.argmax(out, dim=-1, keepdim=True)\n",
    "\n",
    "        # Stop if we encounter an end-of-sequence token\n",
    "        if (eos_token_id is not None\n",
    "                and torch.all(next_token == eos_token_id)):\n",
    "            break\n",
    "\n",
    "        yield next_token  # Yield each token as it's generated\n",
    "        \n",
    "        token_ids = torch.cat([token_ids, next_token], dim=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74f14a54-fd55-4bc5-a834-7c2ccd3f30ae",
   "metadata": {},
   "source": [
    "- Let's use it to generate a 100-token response to a simple `\"Explain large language models in a single sentence.\"` prompt to see how it works (we get to the reasoning parts in later chapters)\n",
    "- The following code will be slow and can take 1-3 minutes to complete, depending on your computer (we will speed it up in later sections) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "33ff989f-bb51-47d0-bcc7-bd37d1c38b8f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.<|endoftext|>Human language is a complex and dynamic system that has evolved over millions of years to enable effective communication and social interaction. It is composed of a vast array of symbols, including letters, numbers, and words, which are used to convey meaning and express thoughts and ideas. The evolution of language has"
     ]
    }
   ],
   "source": [
    "prompt = \"Explain large language models in a single sentence.\"\n",
    "input_token_ids_tensor = torch.tensor(\n",
    "    tokenizer.encode(prompt),\n",
    "    device=device\n",
    "    ).unsqueeze(0)\n",
    "max_new_tokens = 100\n",
    "\n",
    "\n",
    "for token in generate_text_basic_stream(\n",
    "    model=model,\n",
    "    token_ids=input_token_ids_tensor,\n",
    "    max_new_tokens=max_new_tokens,\n",
    "):\n",
    "    token_id = token.squeeze(0).tolist()\n",
    "    print(\n",
    "        tokenizer.decode(token_id),\n",
    "        end=\"\",\n",
    "        flush=True  # Deactivates buffering so tokens are printed live\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3489d09f-1178-4870-a7ee-4faabf507717",
   "metadata": {},
   "source": [
    "- Notice that the LLM follows the instruction quite well, but the response becomes nonsensical/off-topic after `<|endoftext|>`, which is a token used as a delimiter between different documents during training\n",
    "- When using the LLM, we want it to stop generating after encountering this token"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "38e676de-a07e-4eb6-90d0-9f8f10da9caf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[151643]\n"
     ]
    }
   ],
   "source": [
    "print(tokenizer.encode(\"<|endoftext|>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f146f8b-4ed1-4939-9d3a-3de81e8039e0",
   "metadata": {},
   "source": [
    "- For convenience, this token ID is stored as a tokenizer attribute (eos = end of sequence):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "8a547b6e-9154-49be-9b0b-c4fcc3da462d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "151643\n"
     ]
    }
   ],
   "source": [
    "print(tokenizer.eos_token_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "676290fe-1d2e-41db-8ca2-717fe8cb2561",
   "metadata": {},
   "source": [
    "- We can use it to tell the LLM (or rather the `generate_text_basic_stream` function) when to stop generating text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "4c97d10c-e3de-4fcd-a669-3c4198be4018",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content."
     ]
    }
   ],
   "source": [
    "for token in generate_text_basic_stream(\n",
    "    model=model,\n",
    "    token_ids=input_token_ids_tensor,\n",
    "    max_new_tokens=max_new_tokens,\n",
    "    eos_token_id=tokenizer.eos_token_id  # Use EOS token\n",
    "):\n",
    "    token_id = token.squeeze(0).tolist()\n",
    "    print(\n",
    "        tokenizer.decode(token_id),\n",
    "        end=\"\",\n",
    "        flush=True\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0150d6f6-e6cc-4576-b77a-7f12b53d3aa6",
   "metadata": {},
   "source": [
    "- The response above is what you get when running to code on CPU, the generated text may differ slightly differ depending on the device"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5669f8e-b6e9-490f-90cb-52735b3aab1a",
   "metadata": {},
   "source": [
    "- Before we wrap up this section and see how we can speed up the code, let's implement a simple benchmarking function to track the computational performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "9efe08e3-6819-490f-bbfa-88b4e61b4468",
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "def generate_stats(output_token_ids, tokenizer, start_time,\n",
    "                   end_time):\n",
    "    total_time = end_time - start_time\n",
    "    print(f\"\\n\\nTime: {total_time:.2f} sec\")\n",
    "    print(f\"{int(output_token_ids.numel() / total_time)} tokens/sec\")\n",
    "\n",
    "    for name, backend in ((\"CUDA\", getattr(torch, \"cuda\", None)),\n",
    "                          (\"XPU\", getattr(torch, \"xpu\", None))):\n",
    "        if backend is not None and backend.is_available():\n",
    "\n",
    "            # Check whether we are actually using this backend\n",
    "            device_type = output_token_ids.device.type\n",
    "            if device_type != name.lower():\n",
    "                warnings.warn(\n",
    "                    f\"{name} is available but tensors are on \"\n",
    "                    f\"{device_type}. Memory stats may be 0.\"\n",
    "                )\n",
    "    \n",
    "            # Synchronize if supported (important for async backends)\n",
    "            if hasattr(backend, \"synchronize\"):\n",
    "                backend.synchronize()\n",
    "            \n",
    "            max_mem_bytes = backend.max_memory_allocated()\n",
    "            max_mem_gb = max_mem_bytes / (1024 ** 3)\n",
    "            print(f\"Max {name} memory allocated: {max_mem_gb:.2f} GB\")\n",
    "            backend.reset_peak_memory_stats()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "385b0cb7-6ba6-4b60-8bcb-7030feb1c194",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.\n",
      "\n",
      "Time: 1.39 sec\n",
      "29 tokens/sec\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "start_time = time.time()\n",
    "generated_ids = []\n",
    "\n",
    "for token in generate_text_basic_stream(\n",
    "    model=model,\n",
    "    token_ids=input_token_ids_tensor,\n",
    "    max_new_tokens=max_new_tokens,\n",
    "    eos_token_id=tokenizer.eos_token_id\n",
    "):\n",
    "    token_id = token.squeeze(0).tolist()\n",
    "    print(\n",
    "        tokenizer.decode(token_id),\n",
    "        end=\"\",\n",
    "        flush=True\n",
    "    )\n",
    "\n",
    "    next_token_id = token.squeeze(0)\n",
    "    generated_ids.append(next_token_id)  # Collect generated tokens\n",
    "\n",
    "end_time = time.time()\n",
    "\n",
    "output_token_ids_tensor = torch.cat(generated_ids, dim=0)\n",
    "generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce1c6163-e78d-4f03-8ec3-b2d8af767564",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 2.8 Faster inference via KV caching"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ba8fe63-1dc8-4992-b4d1-ab994160abb3",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F15_raschka.webp?2\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56dbd7d4-f6ed-43ab-9751-8534e06dfcc7",
   "metadata": {},
   "source": [
    "- Note that the code in this book emphasizes code readability, and a whole separate book can be written about optimizations\n",
    "- Here, we look at an engineering trick called \"KV caching\" (KV refers to the keys and values inside the attention mechanism of the LLM)\n",
    "- If you are unfamiliar with these terms, don't worry, all you need to know is that there is a way we can store (cache) intermediate values that are reused in each iteration"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2253ebb3-3125-4cde-b562-a0d8bf662888",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F16_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02471c83-8237-48f6-adff-2811152ca203",
   "metadata": {},
   "source": [
    "- For more details on the mechanics of KV caching, see my [Understanding and Coding the KV Cache in LLMs from Scratch](https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms) article\n",
    "- Below is a modified version of the `generate_text_basic_stream` function that uses a KV cache"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "d4095088-9dd5-4ffd-b2f5-24d848ae27f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from reasoning_from_scratch.qwen3 import KVCache\n",
    "\n",
    "@torch.inference_mode()\n",
    "def generate_text_basic_stream_cache(\n",
    "    model,\n",
    "    token_ids,\n",
    "    max_new_tokens,\n",
    "    eos_token_id=None\n",
    "):\n",
    "    model.eval()\n",
    "    cache = KVCache(n_layers=model.cfg[\"n_layers\"])  # New\n",
    "    model.reset_kv_cache()                           # New\n",
    "\n",
    "    out = model(token_ids, cache=cache)[:, -1]\n",
    "    for _ in range(max_new_tokens):\n",
    "        next_token = torch.argmax(out, dim=-1, keepdim=True)\n",
    "\n",
    "        if (eos_token_id is not None\n",
    "                and torch.all(next_token == eos_token_id)):\n",
    "            break\n",
    "\n",
    "        yield next_token\n",
    "        out = model(next_token, cache=cache)[:, -1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad5e6c37-fafe-4143-a88d-959916e4f301",
   "metadata": {},
   "source": [
    "- The usage is similar to before:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "d8027190-6de5-4d6f-ae3a-2baf2027b2f0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.\n",
      "\n",
      "Time: 0.84 sec\n",
      "49 tokens/sec\n"
     ]
    }
   ],
   "source": [
    "start_time = time.time()\n",
    "generated_ids = []\n",
    "\n",
    "for token in generate_text_basic_stream_cache(\n",
    "    model=model,\n",
    "    token_ids=input_token_ids_tensor,\n",
    "    max_new_tokens=max_new_tokens,\n",
    "    eos_token_id=tokenizer.eos_token_id\n",
    "):\n",
    "    token_id = token.squeeze(0).tolist()\n",
    "    print(\n",
    "        tokenizer.decode(token_id),\n",
    "        end=\"\",\n",
    "        flush=True\n",
    "    )\n",
    "\n",
    "    next_token_id = token.squeeze(0)\n",
    "    generated_ids.append(next_token_id)  # Collect generated tokens\n",
    "\n",
    "end_time = time.time()\n",
    "\n",
    "output_token_ids_tensor = torch.cat(generated_ids, dim=0)\n",
    "generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4235f50d-c8cc-4e53-bcf1-abd8dffac617",
   "metadata": {},
   "source": [
    "- As we can see, it is magnitudes faster than before (28 tokens/sec instead of 4 tokens/sec; run on a Mac Mini M4 CPU)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f4e58ad-df61-4f5f-8582-9378e8ba7ed0",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 2.9 Faster inference via PyTorch model compilation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb5e1ded-e4a9-47a3-bf68-7f7241d878e5",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/ch02/CH02_F17_raschka.webp?2\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12863930-7bb9-4d0d-8a79-5d5582a5a38e",
   "metadata": {},
   "source": [
    "- Another technique to speed up the model inference (text generation) by a lot is using `torch.compile`\n",
    "- The usage is simple, we just call `torch.compile` on the model (see [the documentation](https://docs.pytorch.org/docs/stable/torch.compiler_api.html) for additional options)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "a23ac9c1-4e83-4b05-b5b2-093178450931",
   "metadata": {},
   "outputs": [],
   "source": [
    "major, minor = map(int, torch.__version__.split(\".\")[:2])\n",
    "if (major, minor) >= (2, 8):\n",
    "    # This avoids retriggering model recompilations \n",
    "    # in PyTorch 2.8 and newer\n",
    "    # if the model contains code like self.pos = self.pos + 1\n",
    "    torch._dynamo.config.allow_unspec_int_on_nn_module = True\n",
    "\n",
    "model_compiled = torch.compile(model)\n",
    "\n",
    "# If you have issues with torch.compile on \"mps\" devices and get an InductorError,\n",
    "# make sure you are using PyTorch 2.9 or newer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6a7c684-0026-4971-8909-54004e47e0c0",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "**Windows note 1**\n",
    "\n",
    "- Compilation can be tricky on Windows\n",
    "- `torch.compile()` uses Inductor, which JIT-compiles kernels and needs a working C/C++ toolchain\n",
    "- For CUDA, Inductor also depends on Triton, available via the community package `triton-windows`\n",
    "  - If you see `cl not found`, [install Visual Studio Build Tools with the \"C++ workload\"](https://learn.microsoft.com/en-us/cpp/build/vscpp-step-0-installation?view=msvc-170) and run Python from the \"x64 Native Tools\" prompt\n",
    "  - If you see `triton not found` with CUDA, install `triton-windows` (for example, `uv pip install \"triton-windows<3.4\"`).\n",
    "- For CPU, a reader further recommended following this [PyTorch Inductor guide for Windows](https://docs.pytorch.org/tutorials/unstable/inductor_windows.html)\n",
    "  - Here, it is important to install the English language package when installing Visual Studio 2022 to avoid a UTF-8 error\n",
    "  - Also, please note that the code needs to be run via the \"Visual Studio 2022 Developer Command Prompt\" rather than a notebook\n",
    "- If this setup proves tricky, you can skip compilation; **compilation is optional, and all code examples work fine without it**\n",
    "\n",
    "**Windows note 2**\n",
    "\n",
    "- Readers reported that there is no speed-up when running `torch.compile` with default settings on Windows; however, running `torch.compile` with the `\"max-autotune\"` mode resulted in a 2x speed-up: `torch.compile(model, mode=\"max-autotune\")`\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d179612a-2f46-44d5-b180-385eb3bfadc9",
   "metadata": {},
   "source": [
    "- The first iteration can be a bit slow as it does the initial compilation and optimization; hence, we repeat the text generation multiple times\n",
    "- First, let's start with the non-cached version (this can be a bit slow and might take xx minutes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "43044619-36a7-4036-a7a6-c0172ea7630e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "W0213 17:02:09.090000 73246 torch/_inductor/utils.py:1679] [0/0] Not enough SMs to use max_autotune_gemm mode\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays.\n",
      "\n",
      "Warm-up run\n",
      "\n",
      "\n",
      "Time: 27.15 sec\n",
      "1 tokens/sec\n",
      "\n",
      "------------------------------\n",
      "\n",
      " Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays.\n",
      "\n",
      "Timed run 1:\n",
      "\n",
      "\n",
      "Time: 0.82 sec\n",
      "42 tokens/sec\n",
      "\n",
      "------------------------------\n",
      "\n",
      " Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays.\n",
      "\n",
      "Timed run 2:\n",
      "\n",
      "\n",
      "Time: 0.82 sec\n",
      "42 tokens/sec\n",
      "\n",
      "------------------------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for i in range(3):\n",
    "\n",
    "    start_time = time.time()\n",
    "    generated_ids = []\n",
    "    \n",
    "    for token in generate_text_basic_stream(\n",
    "        model=model_compiled,\n",
    "        token_ids=input_token_ids_tensor,\n",
    "        max_new_tokens=max_new_tokens,\n",
    "        eos_token_id=tokenizer.eos_token_id\n",
    "    ):\n",
    "        token_id = token.squeeze(0).tolist()\n",
    "        print(\n",
    "            tokenizer.decode(token_id),\n",
    "            end=\"\",\n",
    "            flush=True\n",
    "        )\n",
    "    \n",
    "        next_token_id = token.squeeze(0)\n",
    "        generated_ids.append(next_token_id)  # Collect generated tokens\n",
    "    \n",
    "    end_time = time.time()\n",
    "    \n",
    "\n",
    "    if i == 0:\n",
    "        print(\"\\n\\nWarm-up run\")\n",
    "    else:\n",
    "        print(f\"\\n\\nTimed run {i}:\")\n",
    "\n",
    "    output_token_ids_tensor = torch.cat(generated_ids, dim=0)\n",
    "    generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)\n",
    "\n",
    "    print(f\"\\n{30*'-'}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d86b44a9-0512-4c74-b699-aa7e4b68cc0c",
   "metadata": {},
   "source": [
    "- As we can see above, with 5 tokens/sec, this is only marginally faster than before (4 tokens/sec)\n",
    "- Let's now see how well the KV cache version does"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "397bd803-a554-46a1-a2fd-eb8d5e65b134",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.\n",
      "\n",
      "Warm-up run\n",
      "\n",
      "\n",
      "Time: 45.89 sec\n",
      "0 tokens/sec\n",
      "\n",
      "------------------------------\n",
      "\n",
      " Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.\n",
      "\n",
      "Timed run 1:\n",
      "\n",
      "\n",
      "Time: 0.48 sec\n",
      "84 tokens/sec\n",
      "\n",
      "------------------------------\n",
      "\n",
      " Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing articles, and even creating creative content.\n",
      "\n",
      "Timed run 2:\n",
      "\n",
      "\n",
      "Time: 0.45 sec\n",
      "90 tokens/sec\n",
      "\n",
      "------------------------------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for i in range(3):\n",
    "    \n",
    "    start_time = time.time()\n",
    "    generated_ids = []\n",
    "    \n",
    "    for token in generate_text_basic_stream_cache(\n",
    "        model=model_compiled,\n",
    "        token_ids=input_token_ids_tensor,\n",
    "        max_new_tokens=max_new_tokens,\n",
    "        eos_token_id=tokenizer.eos_token_id\n",
    "    ):\n",
    "        token_id = token.squeeze(0).tolist()\n",
    "        print(\n",
    "            tokenizer.decode(token_id),\n",
    "            end=\"\",\n",
    "            flush=True\n",
    "        )\n",
    "    \n",
    "        next_token_id = token.squeeze(0)\n",
    "        generated_ids.append(next_token_id)  # Collect generated tokens\n",
    "    \n",
    "    end_time = time.time()\n",
    "\n",
    "    if i == 0:\n",
    "        print(\"\\n\\nWarm-up run\")\n",
    "    else:\n",
    "        print(f\"\\n\\nTimed run {i}:\")\n",
    "\n",
    "    output_token_ids_tensor = torch.cat(generated_ids, dim=0)\n",
    "    generate_stats(\n",
    "        output_token_ids_tensor, tokenizer, start_time, end_time\n",
    "    )\n",
    "\n",
    "    print(f\"\\n{30*'-'}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4c98ca9-a9fd-4f82-957f-eaae1e6907a7",
   "metadata": {},
   "source": [
    "- As we can see, the compilation resulted in a substantial 2x speed-up (64 tokens/sec versus 30 tokens/sec)\n",
    "- Below is a table with additional results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e403cdac-c633-49a6-a713-7735efc46a60",
   "metadata": {},
   "source": [
    "| Model      | Mode              | Hardware             | Tokens/sec    | GPU Memory (VRAM) |\n",
    "|------------|-------------------|----------------------|---------------|-------------------|\n",
    "| Qwen3Model | Regular           | Mac Mini M4 CPU      | 5             | -                 |\n",
    "| Qwen3Model | Regular compiled  | Mac Mini M4 CPU      | 5             | -                 |\n",
    "| Qwen3Model | KV cache          | Mac Mini M4 CPU      | 29            | -                 |\n",
    "| Qwen3Model | KV cache compiled | Mac Mini M4 CPU      | 68            | -                 |\n",
    "|            |                   |                      |               |                   |\n",
    "| Qwen3Model | Regular           | Mac Mini M4 GPU      | 27            | -                 |\n",
    "| Qwen3Model | Regular compiled  | Mac Mini M4 GPU      | 43            | -                 |\n",
    "| Qwen3Model | KV cache          | Mac Mini M4 GPU      | 41            | -                 |\n",
    "| Qwen3Model | KV cache compiled | Mac Mini M4 GPU      | 71            | -                 |\n",
    "|            |                   |                      |               |                   |\n",
    "| Qwen3Model | Regular           | NVIDIA H100 GPU      | 51            | 1.55 GB           |\n",
    "| Qwen3Model | Regular compiled  | NVIDIA H100 GPU      | 164           | 1.81 GB           |\n",
    "| Qwen3Model | KV cache          | NVIDIA H100 GPU      | 48            | 1.52 GB           |\n",
    "| Qwen3Model | KV cache compiled | NVIDIA H100 GPU      | 141           | 1.81 GB           |\n",
    "|            |                   |                      |               |                   |\n",
    "| Qwen3Model | Regular           | NVIDIA DGX Spark GPU | 74            | 1.53 GB           |\n",
    "| Qwen3Model | Regular compiled  | NVIDIA DGX Spark GPU | 103           | 1.49 GB           |\n",
    "| Qwen3Model | KV cache          | NVIDIA DGX Spark GPU | 68            | 1.47 GB           |\n",
    "| Qwen3Model | KV cache compiled | NVIDIA DGX Spark GPU | 98            | 1.47 GB           |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d05c6a6-3170-4251-9bdf-07489d1104c6",
   "metadata": {},
   "source": [
    "- The NVIDIA DGX Spark above uses a GB10 (Blackwell) GPU\n",
    "- Note that we ran all the examples with a single prompt (i.e., a batch size of 1); if you are curious about batched inference, see appendix E"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0db3a7b3-9fb2-47de-9f42-4ad1f1f25de8",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## Summary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40a8d136-06df-46d4-b616-202dd9a8cdeb",
   "metadata": {},
   "source": [
    "- No code in this section"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}