{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cbbc1fe3-bff1-4631-bf35-342e19c54cc0",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b022374-e3f6-4437-b86f-e6f8f94cbebc",
   "metadata": {},
   "source": [
    "# **扩展 Tiktoken BPE 分词器，添加新 Token**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcd624b1-2060-49af-bbf6-40517a58c128",
   "metadata": {},
   "source": [
    "- 本笔记本介绍 **如何扩展现有的 BPE 分词器**，并重点讲解 **如何在 OpenAI 的 [Tiktoken](https://github.com/openai/tiktoken) 实现中添加新 Token**。  \n",
    "- 如果需要 **分词的基础知识**，请参考 [第 2 章](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb) 和 **BPE from Scratch** [教程](link)。  \n",
    "- 例如，假设我们有一个 **GPT-2 分词器**，并希望对以下文本进行编码："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "798d4355-a146-48a8-a1a5-c5cec91edf2c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[15496, 11, 2011, 3791, 30642, 62, 16, 318, 257, 649, 11241, 13, 220, 50256]\n"
     ]
    }
   ],
   "source": [
    "import tiktoken\n",
    "\n",
    "base_tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
    "sample_text = \"Hello, MyNewToken_1 is a new token. <|endoftext|>\"\n",
    "\n",
    "token_ids = base_tokenizer.encode(sample_text, allowed_special={\"<|endoftext|>\"})\n",
    "print(token_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b09b19b-772d-4449-971b-8ab052ee726d",
   "metadata": {},
   "source": [
    "- **遍历每个 Token ID**，可以帮助我们更好地理解 **如何通过词汇表解码 Token ID**：  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "21fd634b-bb4c-4ba3-8b69-9322b727bf58",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "15496 -> Hello\n",
      "11 -> ,\n",
      "2011 ->  My\n",
      "3791 -> New\n",
      "30642 -> Token\n",
      "62 -> _\n",
      "16 -> 1\n",
      "318 ->  is\n",
      "257 ->  a\n",
      "649 ->  new\n",
      "11241 ->  token\n",
      "13 -> .\n",
      "220 ->  \n",
      "50256 -> <|endoftext|>\n"
     ]
    }
   ],
   "source": [
    "for token_id in token_ids:\n",
    "    print(f\"{token_id} -> {base_tokenizer.decode([token_id])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd5b1b9b-b1a9-489e-9711-c15a8e081813",
   "metadata": {},
   "source": [
    "- 如上所示，**\"MyNewToken_1\" 被拆分为 5 个子词 Token**，这对于 BPE 处理 **未知词汇** 时是正常行为。  \n",
    "- 但如果 **\"MyNewToken_1\" 是一个特殊 Token**，我们希望它像 **`\"<|endoftext|>\"`** 一样 **作为单个 Token 进行编码**，本笔记本将讲解如何实现该功能。  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65f62ab6-df96-4f88-ab9a-37702cd30f5f",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 1. 添加特殊的token"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4379fdb-57ba-4a75-9183-0aee0836c391",
   "metadata": {},
   "source": [
    "- 需要注意，我们必须 **将新 Token 作为特殊 Token 添加**。原因在于：  \n",
    "  - **新 Token 在原始分词器训练过程中并未出现**，因此 **没有对应的“合并规则”（merges）**。  \n",
    "  - 即使我们手动创建这些合并规则，**也很难在不破坏现有分词体系的情况下，将其正确整合**（详情请参考 **BPE from Scratch** 笔记本 [链接] 了解“合并规则”）。  \n",
    "\n",
    "- 例如，假设我们希望 **添加 2 个新 Token**："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "265f1bba-c478-497d-b7fc-f4bd191b7d55",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define custom tokens and their token IDs\n",
    "custom_tokens = [\"MyNewToken_1\", \"MyNewToken_2\"]\n",
    "custom_token_ids = {\n",
    "    token: base_tokenizer.n_vocab + i for i, token in enumerate(custom_tokens)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c6f3d98-1ab6-43cf-9ae2-2bf53860f99e",
   "metadata": {},
   "source": [
    "- 接下来，我们创建一个自定义的 **`Encoding`** 对象，用于存储 **特殊 Token**，具体如下：  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1f519852-59ea-4069-a8c7-0f647bfaea09",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a new Encoding object with extended tokens\n",
    "extended_tokenizer = tiktoken.Encoding(\n",
    "    name=\"gpt2_custom\",\n",
    "    pat_str=base_tokenizer._pat_str,\n",
    "    mergeable_ranks=base_tokenizer._mergeable_ranks,\n",
    "    special_tokens={**base_tokenizer._special_tokens, **custom_token_ids},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90af6cfa-e0cc-4c80-89dc-3a824e7bdeb2",
   "metadata": {},
   "source": [
    "- 就这样！现在我们可以验证 **分词器是否能够正确编码示例文本**：  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "153e8e1d-c4cb-41ff-9c55-1701e9bcae1c",
   "metadata": {},
   "source": [
    "- 如我们所见，**新添加的 Token**（`50257` 和 `50258`）**已成功编码到输出中**：  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "eccc78a4-1fd4-47ba-a114-83ee0a3aec31",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[36674, 2420, 351, 220, 50257, 290, 220, 50258, 13, 220, 50256]\n"
     ]
    }
   ],
   "source": [
    "special_tokens_set = set(custom_tokens) | {\"<|endoftext|>\"}\n",
    "\n",
    "token_ids = extended_tokenizer.encode(\n",
    "    \"Sample text with MyNewToken_1 and MyNewToken_2. <|endoftext|>\",\n",
    "    allowed_special=special_tokens_set\n",
    ")\n",
    "print(token_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc0547c1-bbb5-4915-8cf4-caaebcf922eb",
   "metadata": {},
   "source": [
    "- 同样，我们还可以 **逐个 Token 检查编码结果**：  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "7583eff9-b10d-4e3d-802c-f0464e1ef030",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "36674 -> Sample\n",
      "2420 ->  text\n",
      "351 ->  with\n",
      "220 ->  \n",
      "50257 -> MyNewToken_1\n",
      "290 ->  and\n",
      "220 ->  \n",
      "50258 -> MyNewToken_2\n",
      "13 -> .\n",
      "220 ->  \n",
      "50256 -> <|endoftext|>\n"
     ]
    }
   ],
   "source": [
    "for token_id in token_ids:\n",
    "    print(f\"{token_id} -> {extended_tokenizer.decode([token_id])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17f0764e-e5a9-4226-a384-18c11bd5fec3",
   "metadata": {},
   "source": [
    "- 如上所示，我们已成功 **更新分词器**。  \n",
    "- 但如果要将其用于 **预训练的 LLM**，还需要 **更新 LLM 的嵌入层（embedding layer）和输出层（output layer）**，具体方法将在 **下一节** 进行讲解。  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ec7f98d-8f09-4386-83f0-9bec68ef7f66",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## 2. 更新预训练的LLM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8a4f68b-04e9-4524-8df4-8718c7b566f2",
   "metadata": {},
   "source": [
    "- 本节将讲解 **如何在更新分词器后，对现有的预训练 LLM 进行相应调整**。  \n",
    "- 我们将使用 **书中主章节所采用的原始预训练 GPT-2 模型** 进行演示。  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a9b252e-1d1d-4ddf-b9f3-95bd6ba505a9",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "### 2.1 加载预训练的GPT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ded29b4e-9b39-4191-b61c-29d6b2360bae",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "checkpoint: 100%|███████████████████████████| 77.0/77.0 [00:00<00:00, 34.4kiB/s]\n",
      "encoder.json: 100%|███████████████████████| 1.04M/1.04M [00:00<00:00, 4.78MiB/s]\n",
      "hparams.json: 100%|█████████████████████████| 90.0/90.0 [00:00<00:00, 24.7kiB/s]\n",
      "model.ckpt.data-00000-of-00001: 100%|███████| 498M/498M [00:33<00:00, 14.7MiB/s]\n",
      "model.ckpt.index: 100%|███████████████████| 5.21k/5.21k [00:00<00:00, 1.05MiB/s]\n",
      "model.ckpt.meta: 100%|██████████████████████| 471k/471k [00:00<00:00, 2.33MiB/s]\n",
      "vocab.bpe: 100%|████████████████████████████| 456k/456k [00:00<00:00, 2.45MiB/s]\n"
     ]
    }
   ],
   "source": [
    "# Relative import from the gpt_download.py contained in this folder\n",
    "from gpt_download import download_and_load_gpt2\n",
    "\n",
    "settings, params = download_and_load_gpt2(model_size=\"124M\", models_dir=\"gpt2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "93dc0d8e-b549-415b-840e-a00023bddcf9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Relative import from the gpt_download.py contained in this folder\n",
    "from previous_chapters import GPTModel\n",
    "\n",
    "GPT_CONFIG_124M = {\n",
    "    \"vocab_size\": 50257,   # Vocabulary size\n",
    "    \"context_length\": 256, # Shortened context length (orig: 1024)\n",
    "    \"emb_dim\": 768,        # Embedding dimension\n",
    "    \"n_heads\": 12,         # Number of attention heads\n",
    "    \"n_layers\": 12,        # Number of layers\n",
    "    \"drop_rate\": 0.1,      # Dropout rate\n",
    "    \"qkv_bias\": False      # Query-key-value bias\n",
    "}\n",
    "\n",
    "# Define model configurations in a dictionary for compactness\n",
    "model_configs = {\n",
    "    \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n",
    "    \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n",
    "    \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n",
    "    \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n",
    "}\n",
    "\n",
    "# Copy the base configuration and update with specific model settings\n",
    "model_name = \"gpt2-small (124M)\"  # Example model name\n",
    "NEW_CONFIG = GPT_CONFIG_124M.copy()\n",
    "NEW_CONFIG.update(model_configs[model_name])\n",
    "NEW_CONFIG.update({\"context_length\": 1024, \"qkv_bias\": True})\n",
    "\n",
    "gpt = GPTModel(NEW_CONFIG)\n",
    "gpt.eval();"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83f898c0-18f4-49ce-9b1f-3203a277b29e",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "### 2.2 使用预训练过的GPT"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5a1f5e1-e806-4c60-abaa-42ae8564908c",
   "metadata": {},
   "source": [
    "- 接下来，我们使用 **原始分词器** 和 **更新后的分词器** 对以下示例文本进行编码，并进行对比：  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9a88017d-cc8f-4ba1-bba9-38161a30f673",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "sample_text = \"Sample text with MyNewToken_1 and MyNewToken_2. <|endoftext|>\"\n",
    "\n",
    "original_token_ids = base_tokenizer.encode(\n",
    "    sample_text, allowed_special={\"<|endoftext|>\"}\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "1ee01bc3-ca24-497b-b540-3d13c52c29ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "new_token_ids = extended_tokenizer.encode(\n",
    "    \"Sample text with MyNewToken_1 and MyNewToken_2. <|endoftext|>\",\n",
    "    allowed_special=special_tokens_set\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1143106b-68fe-4234-98ad-eaff420a4d08",
   "metadata": {},
   "source": [
    "- 现在，我们将 **原始 Token ID 输入到 GPT 模型中**：  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "6b06827f-b411-42cc-b978-5c1d568a3200",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[[ 0.2204,  0.8901,  1.0138,  ...,  0.2585, -0.9192, -0.2298],\n",
      "         [ 0.6745, -0.0726,  0.8218,  ..., -0.1768, -0.4217,  0.0703],\n",
      "         [-0.2009,  0.0814,  0.2417,  ...,  0.3166,  0.3629,  1.3400],\n",
      "         ...,\n",
      "         [ 0.1137, -0.1258,  2.0193,  ..., -0.0314, -0.4288, -0.1487],\n",
      "         [-1.1983, -0.2050, -0.1337,  ..., -0.0849, -0.4863, -0.1076],\n",
      "         [-1.0675, -0.5905,  0.2873,  ..., -0.0979, -0.8713,  0.8415]]])\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "with torch.no_grad():\n",
    "    out = gpt(torch.tensor([original_token_ids]))\n",
    "\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "082c7a78-35a8-473e-a08d-b099a6348a74",
   "metadata": {},
   "source": [
    "- 如上所示，模型能够正常运行 **（为简洁起见，代码仅显示原始输出，未将其转换回文本）**。  \n",
    "- 若需了解 **如何将模型输出转换回文本**，请参考 **第 5 章 [链接] 的 `generate` 函数（5.3.3 节）**。  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "628265b5-3dde-44e7-bde2-8fc594a2547d",
   "metadata": {},
   "source": [
    "- 如果我们 **使用更新后的分词器生成的 Token ID** 再次输入模型，会发生什么情况？  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9796ad09-787c-4c25-a7f5-6d1dfe048ac3",
   "metadata": {},
   "source": [
    "```python\n",
    "with torch.no_grad():\n",
    "    gpt(torch.tensor([new_token_ids]))\n",
    "\n",
    "print(out)\n",
    "\n",
    "...\n",
    "# IndexError: index out of range in self\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77d00244-7e40-4de0-942e-e15cdd8e3b18",
   "metadata": {},
   "source": [
    "- 如我们所见，这会导致 **索引错误（Index Error）**。  \n",
    "- 这是因为 **GPT 模型的输入嵌入层（Embedding Layer）和输出层（Output Layer）** 预设了固定的 **词汇表大小（Vocabulary Size）**，而更新后的分词器可能已超出该范围：\n",
    "\n",
    "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/extend-tiktoken/gpt-updates.webp\" width=\"400px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dec38b24-c845-4090-96a4-0d3c4ec241d6",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "### **2.3 更新嵌入层（Updating the Embedding Layer）**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1328726-8297-4162-878b-a5daff7de742",
   "metadata": {},
   "source": [
    "- 我们首先 **更新模型的嵌入层（Embedding Layer）**。  \n",
    "- 首先，需要注意 **嵌入层包含 50,257 个条目**，这正好对应于 **原始词汇表的大小**：  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "23ecab6e-1232-47c7-a318-042f90e1dff3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Embedding(50257, 768)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gpt.tok_emb"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d760c683-d082-470a-bff8-5a08b30d3b61",
   "metadata": {},
   "source": [
    "- 我们希望 **扩展嵌入层**，**增加 2 个新 Token**。  \n",
    "- 简而言之，我们 **创建一个更大的嵌入层**，然后 **将原始嵌入层的权重复制到新嵌入层中**。  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "4ec5c48e-c6fe-4e84-b290-04bd4da9483f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Embedding(50259, 768)\n"
     ]
    }
   ],
   "source": [
    "num_tokens, emb_size = gpt.tok_emb.weight.shape\n",
    "new_num_tokens = num_tokens + 2\n",
    "\n",
    "# Create a new embedding layer\n",
    "new_embedding = torch.nn.Embedding(new_num_tokens, emb_size)\n",
    "\n",
    "# Copy weights from the old embedding layer\n",
    "new_embedding.weight.data[:num_tokens] = gpt.tok_emb.weight.data\n",
    "\n",
    "# Replace the old embedding layer with the new one in the model\n",
    "gpt.tok_emb = new_embedding\n",
    "\n",
    "print(gpt.tok_emb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63954928-31a5-4e7e-9688-2e0c156b7302",
   "metadata": {},
   "source": [
    "- 如上所示，我们的 **嵌入层（Embedding Layer）已成功扩展**。  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e68bea5-255b-47bb-b352-09ea9539bc25",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "### **2.4 更新输出层（Updating the Output Layer）**  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90a4a519-bf0f-4502-912d-ef0ac7a9deab",
   "metadata": {},
   "source": [
    "- 接下来，我们需要 **扩展输出层（Output Layer）**，该层当前包含 **50,257 个输出特征**，其大小与嵌入层的词汇表大小相同。  \n",
    "- **（顺带一提，你可能会对额外的学习资料感兴趣，其中探讨了 PyTorch 中 `Linear` 层与 `Embedding` 层的相似性。）**  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "6105922f-d889-423e-bbcc-bc49156d78df",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Linear(in_features=768, out_features=50257, bias=False)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gpt.out_head"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29f1ff24-9c00-40f6-a94f-82d03aaf0890",
   "metadata": {},
   "source": [
    "- **扩展输出层（Output Layer）的过程** 与 **扩展嵌入层（Embedding Layer）** 类似：  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "354589db-b148-4dae-8068-62132e3fb38e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Linear(in_features=768, out_features=50259, bias=True)\n"
     ]
    }
   ],
   "source": [
    "original_out_features, original_in_features = gpt.out_head.weight.shape\n",
    "\n",
    "# Define the new number of output features (e.g., adding 2 new tokens)\n",
    "new_out_features = original_out_features + 2\n",
    "\n",
    "# Create a new linear layer with the extended output size\n",
    "new_linear = torch.nn.Linear(original_in_features, new_out_features)\n",
    "\n",
    "# Copy the weights and biases from the original linear layer\n",
    "with torch.no_grad():\n",
    "    new_linear.weight[:original_out_features] = gpt.out_head.weight\n",
    "    if gpt.out_head.bias is not None:\n",
    "        new_linear.bias[:original_out_features] = gpt.out_head.bias\n",
    "\n",
    "# Replace the original linear layer with the new one\n",
    "gpt.out_head = new_linear\n",
    "\n",
    "print(gpt.out_head)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df5d2205-1fae-4a4f-a7bd-fa8fc37eeec2",
   "metadata": {},
   "source": [
    "- 首先，我们先 **使用原始 Token ID 测试模型**，观察其是否仍能正常运行： "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "df604bbc-6c13-4792-8ba8-ecb692117c25",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[[ 0.2267,  0.9132,  1.0494,  ..., -0.2330, -0.3008, -1.1458],\n",
      "         [ 0.6808, -0.0495,  0.8574,  ...,  0.0671,  0.5572, -0.7873],\n",
      "         [-0.1947,  0.1045,  0.2773,  ...,  1.3368,  0.8479, -0.9660],\n",
      "         ...,\n",
      "         [ 0.1200, -0.1027,  2.0549,  ..., -0.1519, -0.2096,  0.5651],\n",
      "         [-1.1920, -0.1819, -0.0981,  ..., -0.1108,  0.8435, -0.3771],\n",
      "         [-1.0612, -0.5674,  0.3229,  ...,  0.8383, -0.7121, -0.4850]]])\n"
     ]
    }
   ],
   "source": [
    "with torch.no_grad():\n",
    "    output = gpt(torch.tensor([original_token_ids]))\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d80717e-50e6-4927-8129-0aadfa2628f5",
   "metadata": {},
   "source": [
    "- 接下来，让我们 **测试更新后的模型在新增 Token 上的表现**。   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "75f11ec9-bdd2-440f-b8c8-6646b75891c6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[[ 0.2267,  0.9132,  1.0494,  ..., -0.2330, -0.3008, -1.1458],\n",
      "         [ 0.6808, -0.0495,  0.8574,  ...,  0.0671,  0.5572, -0.7873],\n",
      "         [-0.1947,  0.1045,  0.2773,  ...,  1.3368,  0.8479, -0.9660],\n",
      "         ...,\n",
      "         [-0.0656, -1.2451,  0.7957,  ..., -1.2124,  0.1044,  0.5088],\n",
      "         [-1.1561, -0.7380, -0.0645,  ..., -0.4373,  1.1401, -0.3903],\n",
      "         [-0.8961, -0.6437, -0.1667,  ...,  0.5663, -0.5862, -0.4020]]])\n"
     ]
    }
   ],
   "source": [
    "with torch.no_grad():\n",
    "    output = gpt(torch.tensor([new_token_ids]))\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d88a1bba-db01-4090-97e4-25dfc23ed54c",
   "metadata": {},
   "source": [
    "- 如我们所见，**模型已成功支持扩展后的 Token 集**。  \n",
    "- 实际应用中，我们通常需要 **对模型进行微调（Fine-tuning）或持续预训练（Continual Pretraining）**，特别是 **新扩展的嵌入层（Embedding Layer）和输出层（Output Layer）**，以确保模型能够有效学习新 Token 的表示。  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6de573ad-0338-40d9-9dad-de60ae349c4f",
   "metadata": {},
   "source": [
    "### **关于权重共享（Weight Tying）**  \n",
    "\n",
    "- **如果模型使用了权重共享（Weight Tying）**，即 **嵌入层（Embedding Layer）与输出层（Output Layer）共享相同的权重**（类似于 **Llama 3** [链接]），那么 **扩展输出层的过程将更为简单**。  \n",
    "- 在这种情况下，我们 **只需直接将嵌入层的权重复制到输出层**：  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "4cbc5f51-c7a8-49d0-b87f-d3d87510953b",
   "metadata": {},
   "outputs": [],
   "source": [
    "gpt.out_head.weight = gpt.tok_emb.weight"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "d0d553a8-edff-40f0-bdc4-dff900e16caf",
   "metadata": {},
   "outputs": [],
   "source": [
    "with torch.no_grad():\n",
    "    output = gpt(torch.tensor([new_token_ids]))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}