{ "cells": [ { "cell_type": "markdown", "id": "9dec0dfb-3d60-41d0-a63a-b010dce67e32", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", "
汉化的库: https://github.com/GoatCsu/CN-LLMs-from-scratch.git\n", "
\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "5e475425-8300-43f2-a5e8-6b5d2de59925", "metadata": {}, "source": [ "# 从零开始构建字节对编码(BPE)分词器" ] }, { "cell_type": "markdown", "id": "a1bfc3f3-8ec1-4fd3-b378-d9a3d7807a54", "metadata": {}, "source": [ "- 这是一个独立的notebook,旨在从零开始实现字节对编码(BPE)分词算法,该算法广泛应用于 GPT-2 至 GPT-4、Llama 3 等大模型。\n", "- 分词的具体目的请参见[第二章](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb);这里的代码为额外材料,用于解释 BPE 算法。\n", "- OpenAI 为训练原始 GPT 模型所实现的 BPE 分词器可以在[此处](https://github.com/openai/gpt-2/blob/master/src/encoder.py)。\n", "- BPE 算法最早由 Philip Gage 于 1994 年提出,详见:“[一种新的数据压缩算法](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)”。\n", "- 目前,大多数项目,包括 Llama 3,使用 OpenAI 开源的 [tiktoken 库](https://github.com/openai/tiktoken),因为其卓越的计算性能;它支持加载预训练的 GPT-2 和 GPT-4 分词器(Llama 3 模型同样是使用 GPT-4 分词器训练的)。\n", "- 上述案例与我在此notebook中提供有所区别,其一,我们的分词器主要是用于教学目的;其二,我们构建了一个用于训练分词器的函数。\n", "- 还有一个[minBPE](https://github.com/karpathy/minbpe),可能有更好的性能;与 `minbpe` 不同,书中的代码还支持加载 OpenAI 原始分词器的词汇表与自己的单词表合并。\n" ] }, { "cell_type": "markdown", "id": "f62336db-f45c-4894-9167-7583095dbdf1", "metadata": {}, "source": [ " \n", "# 1. 字节对编码(BPE)的主要思想" ] }, { "cell_type": "markdown", "id": "cd3f1231-bd42-41b5-a017-974b8c660a44", "metadata": {}, "source": [ "- BPE 的主要思想是将文本转换为整数表示(token ID)用于大规模语言模型(LLM)的训练(详见[第二章](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb))。\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "760c625d-26a1-4896-98a2-0fdcd1591256", "metadata": {}, "source": [ " \n", "## 1.1 Bits 和 bytes" ] }, { "cell_type": "markdown", "id": "d4ddaa35-0ed7-4012-827e-911de11c266c", "metadata": {}, "source": [ "- 在介绍 BPE 算法之前,我们先来了解一下字节的概念。\n", "- 我们将文本转换为字节数组(毕竟 BPE 代表的是“字节”对编码):" ] }, { "cell_type": "code", "execution_count": 1, "id": "8c9bc9e4-120f-4bac-8fa6-6523c568d12e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bytearray(b'This is some text')\n" ] } ], "source": [ "text = \"This is some text\"\n", "byte_ary = bytearray(text, \"utf-8\")\n", "print(byte_ary)" ] }, { "cell_type": "markdown", "id": "dbd92a2a-9d74-4dc7-bb53-ac33d6cf2fab", "metadata": {}, "source": [ "- 当我们对 `bytearray` 对象调用 `list()` 时,每个字节会被视为一个独立的元素,结果是一个包含字节值对应整数的列表:" ] }, { "cell_type": "code", "execution_count": 2, "id": "6c586945-d459-4f9a-855d-bf73438ef0e3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[84, 104, 105, 115, 32, 105, 115, 32, 115, 111, 109, 101, 32, 116, 101, 120, 116]\n" ] } ], "source": [ "ids = list(byte_ary)\n", "print(ids)" ] }, { "cell_type": "markdown", "id": "71efea37-f4c3-4cb8-bfa5-9299175faf9a", "metadata": {}, "source": [ "- 这种方式是将文本转换为 LLM 嵌入层中所需的token ID的一种有效方法。\n", "- 然而,这种方法的缺点是,它为每个字符分配一个 ID(对于短文本来说,这将产生大量的 ID!)。\n", "- 也就是说,对于一个包含 17 个字符的输入文本,我们需要使用 17 个token ID 作为 LLM 的输入:" ] }, { "cell_type": "code", "execution_count": 3, "id": "0d5b61d9-79a0-48b4-9b3e-64ab595c5b01", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of characters: 17\n", "Number of token IDs: 17\n" ] } ], "source": [ "print(\"Number of characters:\", len(text))\n", "print(\"Number of token IDs:\", len(ids))" ] }, { "cell_type": "markdown", "id": "68cc833a-c0d4-4d46-9180-c0042fd6addc", "metadata": {}, "source": [ "- 如果你之前使用过 LLM,你可能知道 BPE 分词器有一个词汇表,其中我们按照整个词或子词分配ID,按照单词。\n", "- 例如,GPT-2 分词器将相同的文本(\"This is some text\")分词为 4 个令牌,而不是 17 个:`1212, 318, 617, 2420`\n", "- 你可以使用互动式的 [tiktoken 应用](https://tiktokenizer.vercel.app/?model=gpt2) 或 [tiktoken 库](https://github.com/openai/tiktoken)进行验证:\n", "\n", "\n", "\n", "\n", "```python\n", "import tiktoken\n", "\n", "gpt2_tokenizer = tiktoken.get_encoding(\"gpt2\")\n", "gpt2_tokenizer.encode(\"This is some text\")\n", "# prints [1212, 318, 617, 2420]\n", "```" ] }, { "cell_type": "markdown", "id": "425b99de-cbfc-441c-8b3e-296a5dd7bb27", "metadata": {}, "source": [ "- 由于一个字节由 8 位组成,因此单个字节可以表示 28 = 256 个可能的值,范围从 0 到 255。\n", "- 你可以通过执行代码 `bytearray(range(0, 257))` 来确认这一点,代码会提醒你 `ValueError: byte must be in range(0, 256)`。\n", "- 一个 BPE 分词器通常将这 256 个值作为其前 256 个单字符令牌;你可以通过运行以下代码来直观地查看这一点:\n", "\n", "\n", "```python\n", "import tiktoken\n", "gpt2_tokenizer = tiktoken.get_encoding(\"gpt2\")\n", "\n", "for i in range(300):\n", " decoded = gpt2_tokenizer.decode([i])\n", " print(f\"{i}: {decoded}\")\n", "\"\"\"\n", "prints:\n", "0: !\n", "1: \"\n", "2: #\n", "...\n", "255: � # <---- single character tokens up to here\n", "256: t\n", "257: a\n", "...\n", "298: ent\n", "299: n\n", "\"\"\"\n", "```" ] }, { "cell_type": "markdown", "id": "97ff0207-7f8e-44fa-9381-2a4bd83daab3", "metadata": {}, "source": [ "- 上述代码中,注意到条目 256 和 257 不是单字符值,而是双字符值(一个空格 + 一个字母),这是原始 GPT-2 BPE 分词器的一个小缺点(在 GPT-4 分词器中已有改进)。" ] }, { "cell_type": "markdown", "id": "8241c23a-d487-488d-bded-cdf054e24920", "metadata": {}, "source": [ " \n", "## 1.2 创建词汇表" ] }, { "cell_type": "markdown", "id": "d7c2ceb7-0b3f-4a62-8dcc-07810cd8886e", "metadata": {}, "source": [ "- BPE 分词算法的目标是构建一个包含常见子词的词汇表,例如 `298: ent`(这个子词可以在 *entangle, entertain, enter, entrance, entity,* 等词中找到),甚至是完整的单词。\n", "\n", "\n", "```\n", "318: is\n", "617: some\n", "1212: This\n", "2420: text\n", "```" ] }, { "cell_type": "markdown", "id": "8c0d4420-a4c7-4813-916a-06f4f46bc3f0", "metadata": {}, "source": [ "- BPE 算法最早在 1994 年由 Philip Gage 提出,详见:“[一种新的数据压缩算法](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)”。\n", "- 在我们进入实际代码实现之前,今天用于 LLM 分词器的形式可以总结如下:" ] }, { "cell_type": "markdown", "id": "ebc71db9-b070-48c4-8412-81f45b308ab3", "metadata": {}, "source": [ " \n", "## 1.3 BPE算法大览\n", "\n", "**1. 识别出现次数最多的字节对**\n", "- 在每次迭代中,扫描文本,找到最常出现的字节对(或字符对)。\n", "\n", "**2. 替换并记录**\n", "\n", "- 用一个新的占位符 ID 替换该字节对(该 ID 尚未被使用,例如,如果我们从 0...255 开始,第一个占位符将是 256)。\n", "- 将该映射记录在查找表中。\n", "- 查找表的大小是一个超参数,称为“词汇表大小”(对于 GPT-2,这个值为 50,257)。\n", "\n", "**3. 重复直到没有进一步优化**\n", "\n", "- 持续执行步骤 1 和 2,反复合并最频繁的字节对。\n", "- 当无法再进行压缩时停止(例如,没有字节对出现超过一次)。\n", "\n", "**解压缩(解码)**\n", "\n", "- 为恢复原始文本,通过查找表将每个 ID 替换为其对应的字节对,逆向执行此过程。\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "e9f5ac9a-3528-4186-9468-8420c7b2ac00", "metadata": {}, "source": [ " \n", "## 1.4 举个例子\n", "### 1.4.1 Concrete example of the encoding part (steps 1 & 2)\n", "\n", "- 假设训练数据集`the cat in the hat`,我们希望为 BPE 分词器构建词汇表。\n", "\n", "**第一次迭代**\n", "\n", "1. 识别最多出现字节对\n", " - 在这段文本中,\"th\" 出现了两次(分别位于开头和第二个 \"e\" 前)。\n", "\n", "2. 替换并记录\n", " - 将 \"th\" 替换为一个尚未使用的新的令牌 ID,例如 256。\n", " - 新的文本变为:`<256>e cat in <256>e hat`。\n", " - 新的词汇表如下:\n", "\n", "\n", "```\n", " 0: ...\n", " ...\n", " 256: \"th\"\n", "```\n", "**第二次迭代**\n", "\n", "1. **识别最多出现字节对** \n", " - 在文本 `<256>e cat in <256>e hat` 中,字节对 `<256>e` 出现了两次。\n", "\n", "2. **替换并记录** \n", " - 将 `<256>e` 替换为一个尚未使用的新的token ID,例如 257。 \n", " - 更新后的文本为:`<257> cat in <257> hat`。\n", "\n", " ```\n", " <257> cat in <257> hat\n", " ```\n", " - 更新后的将会是\n", " ```\n", " 0: ...\n", " ...\n", " 256: \"th\"\n", " 257: \"<256>e\"\n", " ```\n", "**第三次迭代**\n", "\n", "1. **识别最多出现字节对** \n", " - 在文本 `<257> cat in <257> hat` 中,字节对 `<257> ` 出现了两次(一次在开头,一次在“hat”之前)。\n", "\n", "2. **替换并记录** \n", " - 将 `<257> ` 替换为一个尚未使用的新的token ID,例如 258。 \n", " - 更新后的文本为:`<258>cat in <258>hat`。\n", "\n", " ```\n", " <258>cat in <258>hat\n", " ```\n", " - The updated vocabulary is:\n", " ```\n", " 0: ...\n", " ...\n", " 256: \"th\"\n", " 257: \"<256>e\"\n", " 258: \"<257> \"\n", " ```\n", " \n", "- 如此循环\n", "\n", "### 1.4.2 解码部分的具体示例(步骤 3)\n", "\n", "- 为了恢复原始文本,我们通过逆向执行过程,将每个token ID 替换为它对应的字节对,按它们引入的逆序进行替换。\n", "- 从最终压缩后的文本开始:`<258>cat in <258>hat`\n", "- 替换 `<258>` → `<257> `:`<257> cat in <257> hat`\n", "- 替换 `<257>` → `<256>e`:`<256>e cat in <256>e hat`\n", "- 替换 `<256>` → \"th\":`the cat in the hat`\n" ] }, { "cell_type": "markdown", "id": "a2324948-ddd0-45d1-8ba8-e8eda9fc6677", "metadata": {}, "source": [ " \n", "## 2. BPE的实现" ] }, { "cell_type": "markdown", "id": "429ca709-40d7-4e3d-bf3e-4f5687a2e19b", "metadata": {}, "source": [ "- 以下是上述算法的具体代码,作为一个 Python 类,其模仿了 `tiktoken` Python 接口。\n", "- 注意,上面的编码部分描述了通过 `train()` 进行的原始训练步骤;然而,`encode()` 方法的工作原理相似:\n", "\n", "1. 将输入文本拆分为单个字节\n", "2. 反复查找并替换(合并)相邻的令牌(字节对),当它们与学习到的 BPE 合并中的任何对匹配时(从最高到最低的“排名”,即按它们被学习的顺序)\n", "3. 继续合并,直到不能再应用任何合并\n", "4. 最终的token ID 列表就是编码后的输出" ] }, { "cell_type": "code", "execution_count": 77, "id": "3e4a15ec-2667-4f56-b7c1-34e8071b621d", "metadata": {}, "outputs": [], "source": [ "from collections import Counter, deque\n", "from functools import lru_cache\n", "import json\n", "\n", "\n", "class BPETokenizerSimple:\n", " def __init__(self):\n", " # 映射 token_id 到 token_str(例如:{11246: \"some\"})\n", " self.vocab = {}\n", " # 映射 token_str 到 token_id(例如:{\"some\": 11246})\n", " self.inverse_vocab = {}\n", " # BPE 合并字典:{(token_id1, token_id2): merged_token_id}\n", " self.bpe_merges = {}\n", "\n", " def train(self, text, vocab_size, allowed_special={\"<|endoftext|>\"}):\n", " \"\"\"\n", " 从头开始训练 BPE 分词器。\n", "\n", " 参数:\n", " text (str): 训练文本。\n", " vocab_size (int): 目标词汇表大小。\n", " allowed_special (set): 要包含的特殊令牌集。\n", " \"\"\"\n", "\n", " # 预处理:将空格替换为 'Ġ'\n", " # 注意,Ġ 是 GPT-2 BPE 实现的一个特性\n", " # 例如,\"Hello world\" 可能被标记为 [\"Hello\", \"Ġworld\"]\n", " # (GPT-4 BPE 会将其标记为 [\"Hello\", \" world\"])\n", " processed_text = []\n", " for i, char in enumerate(text):\n", " if char == \" \" and i != 0:\n", " processed_text.append(\"Ġ\")\n", " if char != \" \":\n", " processed_text.append(char)\n", " processed_text = \"\".join(processed_text)\n", "\n", " # 初始化词汇表,包含唯一字符,包括 'Ġ'(如果存在)\n", " # 从前 256 个 ASCII 字符开始\n", " unique_chars = [chr(i) for i in range(256)]\n", "\n", " # 扩展 unique_chars,包含处理后的文本中未包含的字符\n", " unique_chars.extend(char for char in sorted(set(processed_text)) if char not in unique_chars)\n", "\n", " # 可选:确保 'Ġ' 包含在内(如果它对文本处理相关)\n", " if 'Ġ' not in unique_chars:\n", " unique_chars.append('Ġ')\n", "\n", " # 创建词汇表和逆词汇表\n", " self.vocab = {i: char for i, char in enumerate(unique_chars)}\n", " self.inverse_vocab = {char: i for i, char in self.vocab.items()}\n", "\n", " # 添加允许的特殊令牌\n", " if allowed_special:\n", " for token in allowed_special:\n", " if token not in self.inverse_vocab:\n", " new_id = len(self.vocab)\n", " self.vocab[new_id] = token\n", " self.inverse_vocab[token] = new_id\n", "\n", " # 将处理后的文本标记化为令牌 ID\n", " token_ids = [self.inverse_vocab[char] for char in processed_text]\n", "\n", " # BPE 步骤 1-3:反复查找并替换频繁的字节对\n", " for new_id in range(len(self.vocab), vocab_size):\n", " pair_id = self.find_freq_pair(token_ids, mode=\"most\")\n", " if pair_id is None: # 无更多字节对可合并,停止训练\n", " break\n", " token_ids = self.replace_pair(token_ids, pair_id, new_id)\n", " self.bpe_merges[pair_id] = new_id\n", "\n", " # 使用合并后的令牌构建词汇表\n", " for (p0, p1), new_id in self.bpe_merges.items():\n", " merged_token = self.vocab[p0] + self.vocab[p1]\n", " self.vocab[new_id] = merged_token\n", " self.inverse_vocab[merged_token] = new_id\n", "\n", " def load_vocab_and_merges_from_openai(self, vocab_path, bpe_merges_path):\n", " \"\"\"\n", " 从 OpenAI 的 GPT-2 文件加载预训练的词汇表和 BPE 合并。\n", "\n", " 参数:\n", " vocab_path (str): 词汇文件路径(GPT-2 称其为 'encoder.json')。\n", " bpe_merges_path (str): BPE 合并文件路径(GPT-2 称其为 'vocab.bpe')。\n", " \"\"\"\n", " # 加载词汇表\n", " with open(vocab_path, \"r\", encoding=\"utf-8\") as file:\n", " loaded_vocab = json.load(file)\n", " # loaded_vocab 将 token_str 映射到 token_id\n", " self.vocab = {int(v): k for k, v in loaded_vocab.items()} # token_id: token_str\n", " self.inverse_vocab = {k: int(v) for k, v in loaded_vocab.items()} # token_str: token_id\n", "\n", " # 加载 BPE 合并\n", " with open(bpe_merges_path, \"r\", encoding=\"utf-8\") as file:\n", " lines = file.readlines()\n", " # 如果有头部注释行,跳过它\n", " if lines and lines[0].startswith(\"#\"):\n", " lines = lines[1:]\n", "\n", " for rank, line in enumerate(lines):\n", " pair = tuple(line.strip().split())\n", " if len(pair) != 2:\n", " print(f\"第 {rank+1} 行有超过 2 个条目:{line.strip()}\")\n", " continue\n", " token1, token2 = pair\n", " if token1 in self.inverse_vocab and token2 in self.inverse_vocab:\n", " token_id1 = self.inverse_vocab[token1]\n", " token_id2 = self.inverse_vocab[token2]\n", " merged_token = token1 + token2\n", " if merged_token in self.inverse_vocab:\n", " merged_token_id = self.inverse_vocab[merged_token]\n", " self.bpe_merges[(token_id1, token_id2)] = merged_token_id\n", " else:\n", " print(f\"合并的令牌 '{merged_token}' 未在词汇表中找到,跳过。\")\n", " else:\n", " print(f\"跳过配对 {pair},因为其中一个令牌不在词汇表中。\")\n", "\n", " def encode(self, text):\n", " \"\"\"\n", " 将输入文本编码为令牌 ID 列表。\n", "\n", " 参数:\n", " text (str): 要编码的文本。\n", "\n", " 返回:\n", " List[int]: 令牌 ID 列表。\n", " \"\"\"\n", " tokens = []\n", " # 将文本拆分为令牌,确保换行符 intact\n", " words = text.replace(\"\\n\", \" \\n \").split() # 确保 '\\n' 被视为独立的令牌\n", "\n", " for i, word in enumerate(words):\n", " if i > 0 and not word.startswith(\"\\n\"):\n", " tokens.append(\"Ġ\" + word) # 如果单词前有空格或换行符,添加 'Ġ'\n", " else:\n", " tokens.append(word) # 处理第一个单词或独立的 '\\n'\n", "\n", " token_ids = []\n", " for token in tokens:\n", " if token in self.inverse_vocab:\n", " # token 已经存在于词汇表中\n", " token_id = self.inverse_vocab[token]\n", " token_ids.append(token_id)\n", " else:\n", " # 尝试通过 BPE 处理子词分词\n", " sub_token_ids = self.tokenize_with_bpe(token)\n", " token_ids.extend(sub_token_ids)\n", "\n", " return token_ids\n", "\n", " def tokenize_with_bpe(self, token):\n", " \"\"\"\n", " 使用 BPE 合并对单个令牌进行分词。\n", "\n", " 参数:\n", " token (str): 要分词的令牌。\n", "\n", " 返回:\n", " List[int]: 应用 BPE 后的令牌 ID 列表。\n", " \"\"\"\n", " # 将令牌拆分为单个字符(作为初始令牌 ID)\n", " token_ids = [self.inverse_vocab.get(char, None) for char in token]\n", " if None in token_ids:\n", " missing_chars = [char for char, tid in zip(token, token_ids) if tid is None]\n", " raise ValueError(f\"未在词汇表中找到的字符:{missing_chars}\")\n", "\n", " can_merge = True\n", " while can_merge and len(token_ids) > 1:\n", " can_merge = False\n", " new_tokens = []\n", " i = 0\n", " while i < len(token_ids) - 1:\n", " pair = (token_ids[i], token_ids[i + 1])\n", " if pair in self.bpe_merges:\n", " merged_token_id = self.bpe_merges[pair]\n", " new_tokens.append(merged_token_id)\n", " i += 2 # 跳过下一个令牌,因为它已被合并\n", " can_merge = True\n", " else:\n", " new_tokens.append(token_ids[i])\n", " i += 1\n", " if i < len(token_ids):\n", " new_tokens.append(token_ids[i])\n", " token_ids = new_tokens\n", "\n", " return token_ids\n", "\n", " def decode(self, token_ids):\n", " \"\"\"\n", " 将令牌 ID 列表解码回字符串。\n", "\n", " 参数:\n", " token_ids (List[int]): 要解码的令牌 ID 列表。\n", "\n", " 返回:\n", " str: 解码后的字符串。\n", " \"\"\"\n", " decoded_string = \"\"\n", " for token_id in token_ids:\n", " if token_id not in self.vocab:\n", " raise ValueError(f\"未在词汇表中找到令牌 ID {token_id}。\")\n", " token = self.vocab[token_id]\n", " if token.startswith(\"Ġ\"):\n", " # 用空格替换 'Ġ'\n", " decoded_string += \" \" + token[1:]\n", " else:\n", " decoded_string += token\n", " return decoded_string\n", "\n", " def save_vocab_and_merges(self, vocab_path, bpe_merges_path):\n", " \"\"\"\n", " 将词汇表和 BPE 合并保存到 JSON 文件。\n", "\n", " 参数:\n", " vocab_path (str): 保存词汇表的路径。\n", " bpe_merges_path (str): 保存 BPE 合并的路径。\n", " \"\"\"\n", " # 保存词汇表\n", " with open(vocab_path, \"w\", encoding=\"utf-8\") as file:\n", " json.dump({k: v for k, v in self.vocab.items()}, file, ensure_ascii=False, indent=2)\n", "\n", " # 保存 BPE 合并作为字典列表\n", " with open(bpe_merges_path, \"w\", encoding=\"utf-8\") as file:\n", " merges_list = [{\"pair\": list(pair), \"new_id\": new_id}\n", " for pair, new_id in self.bpe_merges.items()]\n", " json.dump(merges_list, file, ensure_ascii=False, indent=2)\n", "\n", " def load_vocab_and_merges(self, vocab_path, bpe_merges_path):\n", " \"\"\"\n", " 从 JSON 文件加载词汇表和 BPE 合并。\n", "\n", " 参数:\n", " vocab_path (str): 词汇表文件路径。\n", " bpe_merges_path (str): BPE 合并文件路径。\n", " \"\"\"\n", " # 加载词汇表\n", " with open(vocab_path, \"r\", encoding=\"utf-8\") as file:\n", " loaded_vocab = json.load(file)\n", " self.vocab = {int(k): v for k, v in loaded_vocab.items()}\n", " self.inverse_vocab = {v: int(k) for k, v in loaded_vocab.items()}\n", "\n", " # 加载 BPE 合并\n", " with open(bpe_merges_path, \"r\", encoding=\"utf-8\") as file:\n", " merges_list = json.load(file)\n", " for merge in merges_list:\n", " pair = tuple(merge['pair'])\n", " new_id = merge['new_id']\n", " self.bpe_merges[pair] = new_id\n", "\n", " @lru_cache(maxsize=None)\n", " def get_special_token_id(self, token):\n", " return self.inverse_vocab.get(token, None)\n", "\n", " @staticmethod\n", " def find_freq_pair(token_ids, mode=\"most\"):\n", " pairs = Counter(zip(token_ids, token_ids[1:]))\n", "\n", " if mode == \"most\":\n", " return max(pairs.items(), key=lambda x: x[1])[0]\n", " elif mode == \"least\":\n", " return min(pairs.items(), key=lambda x: x[1])[0]\n", " else:\n", " raise ValueError(\"无效模式。选择 'most' 或 'least'。\")\n", "\n", " @staticmethod\n", " def replace_pair(token_ids, pair_id, new_id):\n", " dq = deque(token_ids)\n", " replaced = []\n", "\n", " while dq:\n", " current = dq.popleft()\n", " if dq and (current, dq[0]) == pair_id:\n", " replaced.append(new_id)\n", " dq.popleft()\n", " else:\n", " replaced.append(current)\n", "\n", " return replaced\n" ] }, { "cell_type": "markdown", "id": "46db7310-79c7-4ee0-b5fa-d760c6e1aa67", "metadata": {}, "source": [ "- 上面 `BPETokenizerSimple` 类的代码比较多,详细的内容超出了本笔记本的范围,但下一节提供了一个简短的使用概述,用来帮助您更好地理解该类的方法。\n" ] }, { "cell_type": "markdown", "id": "8ffe1836-eed4-40dc-860b-2d23074d067e", "metadata": {}, "source": [ "## 3. BPE演示" ] }, { "cell_type": "markdown", "id": "3c7c996c-fd34-484f-a877-13d977214cf7", "metadata": {}, "source": [ "- 实际上,我强烈推荐使用 [tiktoken](https://github.com/openai/tiktoken),因为我上面的实现更注重可读性和教育目的,无法保证性能。\n", "- 然而,使用方式与 tiktoken 基本相似,不同之处在于 tiktoken 没有训练方法。\n", "- 让我们通过以下几个示例来看看我上面的 `BPETokenizerSimple` Python 代码是如何工作的。\n" ] }, { "cell_type": "markdown", "id": "e82acaf6-7ed5-4d3b-81c0-ae4d3559d2c7", "metadata": {}, "source": [ "### 3.1 训练、编码与解码" ] }, { "cell_type": "code", "execution_count": 78, "id": "4d197cad-ed10-4a42-b01c-a763859781fb", "metadata": {}, "outputs": [], "source": [ "import os\n", "import urllib.request\n", "\n", "if not os.path.exists(\"the-verdict.txt\"):\n", " url = (\"https://raw.githubusercontent.com/rasbt/\"\n", " \"LLMs-from-scratch/main/ch02/01_main-chapter-code/\"\n", " \"the-verdict.txt\")\n", " file_path = \"the-verdict.txt\"\n", " urllib.request.urlretrieve(url, file_path)\n", "\n", "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n", " text = f.read()" ] }, { "cell_type": "markdown", "id": "04d1b6ac-71d3-4817-956a-9bc7e463a84a", "metadata": {}, "source": [ "- 接下来,我们将训练一个 BPE 分词器,词汇表大小为 1,000。\n", "- 词汇表大小默认已经为 255,因此我们实际上只“学习”了 745 个词汇条目。\n", "- 作为对比,GPT-2 的词汇表包含 50,257 个token,GPT-4 的词汇表包含 100,256 个令牌(在 tiktoken 中为 `cl100k_base`),GPT-4o 使用 199,997 个token(在 tiktoken 中为 `o200k_base`);与我们简单的示例文本相比,它们拥有更大的训练集。\n" ] }, { "cell_type": "code", "execution_count": 79, "id": "027348fd-d52f-4396-93dd-38eed142df9b", "metadata": {}, "outputs": [], "source": [ "tokenizer = BPETokenizerSimple()\n", "tokenizer.train(text, vocab_size=1000, allowed_special={\"<|endoftext|>\"})" ] }, { "cell_type": "markdown", "id": "2474ff05-5629-4f13-9e03-a47b1e713850", "metadata": {}, "source": [ "- 查看下vocab里的东西" ] }, { "cell_type": "code", "execution_count": 80, "id": "f705a283-355e-4460-b940-06bbc2ae4e61", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1000\n" ] } ], "source": [ "# print(tokenizer.vocab)\n", "print(len(tokenizer.vocab))" ] }, { "cell_type": "markdown", "id": "36c9da0f-8a18-41cd-91ea-9ccc2bb5febb", "metadata": {}, "source": [ "- 合并了 742 次 (~ `1000 - len(range(0, 256))`)" ] }, { "cell_type": "code", "execution_count": 81, "id": "3da42d1c-f75c-4ba7-a6c5-4cb8543d4a44", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "742\n" ] } ], "source": [ "print(len(tokenizer.bpe_merges))" ] }, { "cell_type": "markdown", "id": "5dac69c9-8413-482a-8148-6b2afbf1fb89", "metadata": {}, "source": [ "- 这代表前256是单token单次" ] }, { "cell_type": "markdown", "id": "451a4108-7c8b-4b98-9c67-d622e9cdf250", "metadata": {}, "source": [ "- 接下来我们要`编码`点内容" ] }, { "cell_type": "code", "execution_count": 82, "id": "e1db5cce-e015-412b-ad56-060b8b638078", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46]\n" ] } ], "source": [ "input_text = \"Jack embraced beauty through art and life.\"\n", "token_ids = tokenizer.encode(input_text)\n", "print(token_ids)" ] }, { "cell_type": "code", "execution_count": 83, "id": "1ed1b344-f7d4-4e9e-ac34-2a04b5c5b7a8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of characters: 42\n", "Number of token IDs: 20\n" ] } ], "source": [ "print(\"Number of characters:\", len(input_text))\n", "print(\"Number of token IDs:\", len(token_ids))" ] }, { "cell_type": "markdown", "id": "50c1cfb9-402a-4e1e-9678-0b7547406248", "metadata": {}, "source": [ "- 从上述长度可以看出,一个 42 字符的句子被编码为 20 个令牌 ID,相比于基于字符字节的编码,这有效地将输入长度缩短了大约一半。" ] }, { "cell_type": "markdown", "id": "252693ee-e806-4dac-ab76-2c69086360f4", "metadata": {}, "source": [ "- 请注意,词汇表本身在 `decoder()` 方法中被使用,该方法允许我们将令牌 ID 映射回文本:" ] }, { "cell_type": "code", "execution_count": 84, "id": "da0e1faf-1933-43d9-b681-916c282a8f86", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46]\n" ] } ], "source": [ "print(token_ids)" ] }, { "cell_type": "code", "execution_count": 85, "id": "8b690e83-5d6b-409a-804e-321c287c24a4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Jack embraced beauty through art and life.\n" ] } ], "source": [ "print(tokenizer.decode(token_ids))" ] }, { "cell_type": "markdown", "id": "adea5d09-e5ef-4721-994b-b9b25662fa0a", "metadata": {}, "source": [ "- 通过遍历每个令牌 ID,我们可以更好地理解令牌 ID 是如何通过词汇表解码的:" ] }, { "cell_type": "code", "execution_count": 86, "id": "2b9e6289-92cb-4d88-b3c8-e836d7c8095f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "424 -> Jack\n", "256 -> \n", "654 -> em\n", "531 -> br\n", "302 -> ac\n", "311 -> ed\n", "256 -> \n", "296 -> be\n", "97 -> a\n", "465 -> ut\n", "121 -> y\n", "595 -> through\n", "841 -> ar\n", "116 -> t\n", "287 -> a\n", "466 -> nd\n", "256 -> \n", "326 -> li\n", "972 -> fe\n", "46 -> .\n" ] } ], "source": [ "for token_id in token_ids:\n", " print(f\"{token_id} -> {tokenizer.decode([token_id])}\")" ] }, { "cell_type": "markdown", "id": "5ea41c6c-5538-4fd5-8b5f-195960853b71", "metadata": {}, "source": [ "- 如我们所见,大多数token ID 表示 2 字符的子词;这是因为训练数据文本非常简短,重复的单词并不多,而且我们使用了相对较小的词汇表大小。" ] }, { "cell_type": "markdown", "id": "600055a3-7ec8-4abf-b88a-c4186fb71463", "metadata": {}, "source": [ "- 总结来说,调用 `decode(encode())` 应该能够重现任意输入文本:" ] }, { "cell_type": "code", "execution_count": 87, "id": "c7056cb1-a9a3-4cf6-8364-29fb493ae240", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is some text.'" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer.decode(tokenizer.encode(\"This is some text.\"))" ] }, { "cell_type": "markdown", "id": "a63b42bb-55bc-4c9d-b859-457a28b76302", "metadata": {}, "source": [ "### 3.2 储存tokenizer" ] }, { "cell_type": "markdown", "id": "86210925-06dc-4e8c-87bd-821569cd7142", "metadata": {}, "source": [ "- 接下来,让我们看看如何保存训练好的分词器,以便稍后重新使用:" ] }, { "cell_type": "code", "execution_count": 88, "id": "955181cb-0910-4c6a-9c22-d8292a3ec1fc", "metadata": {}, "outputs": [], "source": [ "# Save trained tokenizer\n", "tokenizer.save_vocab_and_merges(vocab_path=\"vocab.json\", bpe_merges_path=\"bpe_merges.txt\")" ] }, { "cell_type": "code", "execution_count": 89, "id": "6e5ccfe7-ac67-42f3-b727-87886a8867f1", "metadata": {}, "outputs": [], "source": [ "# Load tokenizer\n", "tokenizer2 = BPETokenizerSimple()\n", "tokenizer2.load_vocab_and_merges(vocab_path=\"vocab.json\", bpe_merges_path=\"bpe_merges.txt\")" ] }, { "cell_type": "markdown", "id": "e7f9bcc2-3b27-4473-b75e-4f289d52a7cc", "metadata": {}, "source": [ "- 加载的分词器应该能够产生与之前相同的结果。" ] }, { "cell_type": "code", "execution_count": 90, "id": "00d9bf8f-756f-48bf-81b8-b890e2c2ef13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Jack embraced beauty through art and life.\n" ] } ], "source": [ "print(tokenizer2.decode(token_ids))" ] }, { "cell_type": "markdown", "id": "b24d10b2-1ab8-44ee-b51a-14248e30d662", "metadata": {}, "source": [ " \n", "### 3.3 加载来自 OpenAI 的原始 GPT-2 BPE 分词器" ] }, { "cell_type": "markdown", "id": "df07e031-9495-4af1-929f-3f16cbde82a5", "metadata": {}, "source": [ "- 最后,让我们加载 OpenAI 的 GPT-2 分词器文件。" ] }, { "cell_type": "code", "execution_count": 91, "id": "b45b4366-2c2b-4309-9a14-febf3add8512", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vocab.bpe already exists\n", "encoder.json already exists\n" ] } ], "source": [ "import os\n", "import urllib.request\n", "\n", "def download_file_if_absent(url, filename):\n", " if not os.path.exists(filename):\n", " try:\n", " with urllib.request.urlopen(url) as response, open(filename, 'wb') as out_file:\n", " out_file.write(response.read())\n", " print(f\"Downloaded {filename}\")\n", " except Exception as e:\n", " print(f\"Failed to download {filename}. Error: {e}\")\n", " else:\n", " print(f\"{filename} already exists\")\n", "\n", "files_to_download = {\n", " \"https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpe\": \"vocab.bpe\",\n", " \"https://openaipublic.blob.core.windows.net/gpt-2/models/124M/encoder.json\": \"encoder.json\"\n", "}\n", "\n", "for url, filename in files_to_download.items():\n", " download_file_if_absent(url, filename)" ] }, { "cell_type": "markdown", "id": "3fe260a0-1d5f-4bbd-9934-5117052764d1", "metadata": {}, "source": [ "- 接下来,我们通过 `load_vocab_and_merges_from_openai` 方法加载文件:" ] }, { "cell_type": "code", "execution_count": 92, "id": "74306e6c-47d3-45a3-9e0f-93f7303ef601", "metadata": {}, "outputs": [], "source": [ "tokenizer_gpt2 = BPETokenizerSimple()\n", "tokenizer_gpt2.load_vocab_and_merges_from_openai(\n", " vocab_path=\"encoder.json\", bpe_merges_path=\"vocab.bpe\"\n", ")" ] }, { "cell_type": "markdown", "id": "e1d012ce-9e87-47d7-8a1b-b6d6294d76c0", "metadata": {}, "source": [ "- 词汇表大小应为 `50257`,我们可以通过下面的代码进行确认:" ] }, { "cell_type": "code", "execution_count": 93, "id": "2bb722b4-dbf5-4a0c-9120-efda3293f132", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "50257" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(tokenizer_gpt2.vocab)" ] }, { "cell_type": "markdown", "id": "7ea44b45-f524-44b5-a53a-f6d7f483fc19", "metadata": {}, "source": [ "- 现在我们可以通过我们的 `BPETokenizerSimple` 对象使用 GPT-2 分词器:" ] }, { "cell_type": "code", "execution_count": 97, "id": "e4866de7-fb32-4dd6-a878-469ec734641c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1212, 318, 617, 2420]\n" ] } ], "source": [ "input_text = \"This is some text\"\n", "token_ids = tokenizer_gpt2.encode(input_text)\n", "print(token_ids)" ] }, { "cell_type": "code", "execution_count": 98, "id": "3da8d9b2-af55-4b09-95d7-fabd983e919e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is some text\n" ] } ], "source": [ "print(tokenizer_gpt2.decode(token_ids))" ] }, { "cell_type": "code", "execution_count": 99, "id": "460deb85-8de7-40c7-ba18-3c17831fa8ab", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1212, 318, 617, 2420]" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import tiktoken\n", "\n", "tik_tokenizer = tiktoken.get_encoding(\"gpt2\")\n", "tik_tokenizer.encode(input_text)" ] }, { "cell_type": "markdown", "id": "b3b1e2dc-f69b-4533-87ef-549e6fb9b5a0", "metadata": {}, "source": [ "- 你可以通过互动式的 [tiktoken 应用](https://tiktokenizer.vercel.app/?model=gpt2) 或 [tiktoken 库](https://github.com/openai/tiktoken)来确认这是否生成了正确的token:\n", "\n", "\n", "\n", "\n", "```python\n", "import tiktoken\n", "\n", "gpt2_tokenizer = tiktoken.get_encoding(\"gpt2\")\n", "gpt2_tokenizer.encode(\"This is some text\")\n", "# prints [1212, 318, 617, 2420]\n", "```\n" ] }, { "cell_type": "markdown", "id": "3558af04-483c-4f6b-88f5-a534f37316cd", "metadata": {}, "source": [ " \n", "# 4. 结论" ] }, { "cell_type": "markdown", "id": "410ed0e6-ad06-4bb3-bb39-6b8110c1caa4", "metadata": {}, "source": [ "- 这就是 BPE 的基本原理,包括用于创建新分词器的训练方法,或者从原始 OpenAI GPT-2 模型加载 GPT-2 分词器的词汇表和合并。\n", "- 希望这个简短的教程对教学目的有所帮助;如果你有任何问题,请随时在 [这里](https://github.com/rasbt/LLMs-from-scratch/discussions/categories/q-a) 开启一个新的讨论。\n", "- 关于与其他分词器实现的性能对比,请参见 [这个笔记本](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb)。\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.6" } }, "nbformat": 4, "nbformat_minor": 5 }