{ "cells": [ { "cell_type": "markdown", "id": "ba450fb1-8a26-4894-ab7a-5d7bfefe90ce", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", "
汉化的库: https://github.com/GoatCsu/CN-LLMs-from-scratch.git\n", "
\n", "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14", "metadata": {}, "source": [ "# 第五章 课后练习" ] }, { "cell_type": "code", "execution_count": 1, "id": "37aa4692-2357-4d88-b072-6d2d988d7f4f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "numpy version: 1.26.4\n", "tiktoken version: 0.7.0\n", "torch version: 2.4.0\n", "tensorflow version: 2.16.1\n" ] } ], "source": [ "from importlib.metadata import version\n", "\n", "pkgs = [\"numpy\", \n", " \"tiktoken\", \n", " \"torch\",\n", " \"tensorflow\" # For OpenAI's pretrained weights\n", " ]\n", "for p in pkgs:\n", " print(f\"{p} version: {version(p)}\")" ] }, { "cell_type": "markdown", "id": "5fea8be3-30a1-4623-a6d7-b095c6c1092e", "metadata": {}, "source": [ "# 练习 5.1:基于温度缩放(Temperature Scaling)的 Softmax 得分与采样概率" ] }, { "cell_type": "markdown", "id": "5860ba9f-2db3-4480-b96b-4be1c68981eb", "metadata": {}, "source": [ "- 我们可以使用本节中定义的 `print_sampled_tokens` 函数来打印“pizza”这个词被采样的次数。\n", "- 我们从第 5.3.1 节中定义的代码开始。\n", "\n", "- 当温度为 0 或 0.1 时,它的采样次数为 0x,而当温度被调高到 5 时,它的采样次数为 32x。估计的概率是 32/1000 * 100% = 3.2%。\n", "\n", "- 实际的概率是 4.3%,并包含在重新缩放的 softmax 概率张量(`scaled_probas[2][6]`)中。" ] }, { "cell_type": "markdown", "id": "9cba59c2-a8a3-4af3-add4-70230795225e", "metadata": {}, "source": [ "- 以下是一个完整的示例,使用了第 5 章中的代码:" ] }, { "cell_type": "code", "execution_count": 2, "id": "42dda298-3014-4c36-8d63-97c210bcf4e8", "metadata": {}, "outputs": [], "source": [ "import torch\n", "\n", "vocab = { \n", " \"closer\": 0,\n", " \"every\": 1, \n", " \"effort\": 2, \n", " \"forward\": 3,\n", " \"inches\": 4,\n", " \"moves\": 5, \n", " \"pizza\": 6,\n", " \"toward\": 7,\n", " \"you\": 8,\n", "} \n", "inverse_vocab = {v: k for k, v in vocab.items()}\n", "\n", "next_token_logits = torch.tensor(\n", " [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]\n", ")\n", "\n", "def print_sampled_tokens(probas):\n", " torch.manual_seed(123)\n", " sample = [torch.multinomial(probas, num_samples=1).item() for i in range(1_000)]\n", " sampled_ids = torch.bincount(torch.tensor(sample))\n", " for i, freq in enumerate(sampled_ids):\n", " print(f\"{freq} x {inverse_vocab[i]}\")\n", "\n", "\n", "def softmax_with_temperature(logits, temperature):\n", " scaled_logits = logits / temperature\n", " return torch.softmax(scaled_logits, dim=0)\n", "\n", "\n", "temperatures = [1, 0.1, 5] # Original, higher, and lower temperature\n", "scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]" ] }, { "cell_type": "markdown", "id": "1ee0f9f3-4132-42c7-8324-252fd8f59145", "metadata": {}, "source": [ "- 现在,我们可以遍历 `scaled_probas` 并打印每种情况下的采样频率:" ] }, { "cell_type": "code", "execution_count": 3, "id": "b5605236-e300-4844-aea7-509d868efbdd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Temperature: 1\n", "73 x closer\n", "0 x every\n", "0 x effort\n", "582 x forward\n", "2 x inches\n", "0 x moves\n", "0 x pizza\n", "343 x toward\n", "\n", "\n", "Temperature: 0.1\n", "0 x closer\n", "0 x every\n", "0 x effort\n", "985 x forward\n", "0 x inches\n", "0 x moves\n", "0 x pizza\n", "15 x toward\n", "\n", "\n", "Temperature: 5\n", "165 x closer\n", "75 x every\n", "42 x effort\n", "239 x forward\n", "71 x inches\n", "46 x moves\n", "32 x pizza\n", "227 x toward\n", "103 x you\n" ] } ], "source": [ "for i, probas in enumerate(scaled_probas):\n", " print(\"\\n\\nTemperature:\", temperatures[i])\n", " print_sampled_tokens(probas)" ] }, { "cell_type": "markdown", "id": "fbf88c97-19c4-462c-924a-411c8c765d2c", "metadata": {}, "source": [ "- 请注意,采样提供了对单词“pizza”进行采样时的实际概率的近似值。\n", "- 例如,如果它被采样了 32/1000 次,估计的概率是 3.2%。\n", "- 要获得实际的概率,我们可以通过访问 `scaled_probas` 中对应的条目来直接查看概率。\n", "\n", "- 由于“pizza”是词汇表中的第 7 个条目,在温度为 5 时,我们可以按如下方式获取它:" ] }, { "cell_type": "code", "execution_count": 4, "id": "1d4163c0-22ad-4f5b-8e20-b7420e9dbfc6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor(0.0430)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temp5_idx = 2\n", "pizza_idx = 6\n", "\n", "scaled_probas[temp5_idx][pizza_idx]" ] }, { "cell_type": "markdown", "id": "d3dcb438-5f18-4332-9627-66009f30a1a4", "metadata": {}, "source": [ "如果温度设置为 5,\"pizza\" 这个词被采样的概率为 4.3%。" ] }, { "cell_type": "markdown", "id": "b510ffb0-adca-4d64-8a12-38c4646fd736", "metadata": {}, "source": [ "# 练习 5.2:不同温度系数和 top-k 设置" ] }, { "cell_type": "markdown", "id": "10263328", "metadata": {}, "source": [ "- 译者的理解:\n", " - 温度”(temperature)通常控制采样的多样性\n", " - “top-k”则是指从概率最高的前 k 个词中采样,保证确定性" ] }, { "cell_type": "markdown", "id": "884990db-d1a6-4c4e-8e36-2c1e4c1e67c7", "metadata": {}, "source": [ "- 温度和 top-k 参数设置需要根据具体的 LLM 调整(这是一种试错过程,直到生成理想的输出)。\n", "- 然而,理想的结果也是应用特定的:\n", " - 较低的 top-k 和温度会导致较少随机的输出,这在创建教育内容、技术写作、问答、数据分析、代码生成等任务中是更为理想的。\n", " - 较高的 top-k 和温度会导致更具多样性和随机性的输出,这在头脑风暴、创意写作等任务中更为理想。" ] }, { "cell_type": "markdown", "id": "3f35425d-529d-4179-a1c4-63cb8b25b156", "metadata": {}, "source": [ "# 练习 5.3:解码函数中的确定性行为" ] }, { "cell_type": "markdown", "id": "d12229a2-1d52-46ff-b1e8-198f2e58a7d2", "metadata": {}, "source": [ "有多种方式可以强制 `generate` 函数实现确定性行为:\n", "\n", "1. 将 `top_k=None`,并且不应用温度缩放;\n", "2. 设置 `top_k=1`。" ] }, { "cell_type": "markdown", "id": "391c5dc8-8dd7-4a0a-90bd-519b72f528c7", "metadata": {}, "source": [ "- 以下是一个完整的示例,使用了第 5 章中的代码:" ] }, { "cell_type": "code", "execution_count": 5, "id": "a61a4034-797a-4635-bf42-ddfff1b07125", "metadata": {}, "outputs": [], "source": [ "import tiktoken\n", "import torch\n", "from previous_chapters import GPTModel\n", "\n", "\n", "GPT_CONFIG_124M = {\n", " \"vocab_size\": 50257, # 词汇表大小\n", " \"context_length\": 256, # 缩短的上下文长度(原始值:1024)\n", " \"emb_dim\": 768, # 嵌入维度\n", " \"n_heads\": 12, # 注意力头数\n", " \"n_layers\": 12, # 层数\n", " \"drop_rate\": 0.1, # 丢弃率\n", " \"qkv_bias\": False # 查询-键-值偏置\n", "}\n", "\n", "\n", "torch.manual_seed(123)\n", "\n", "tokenizer = tiktoken.get_encoding(\"gpt2\")\n", "model = GPTModel(GPT_CONFIG_124M)\n", "model.load_state_dict(torch.load(\"model.pth\", weights_only=True))\n", "model.eval();" ] }, { "cell_type": "code", "execution_count": 6, "id": "ee95a272-b852-43b4-9827-ea7e1dbd5724", "metadata": {}, "outputs": [], "source": [ "from gpt_generate import generate, text_to_token_ids, token_ids_to_text\n", "from previous_chapters import generate_text_simple" ] }, { "cell_type": "code", "execution_count": 8, "id": "ebb22d06-393a-42d3-ab64-66646d33b39b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Output text:\n", " Every effort moves you know,\" was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun\n" ] } ], "source": [ "# 使用torch.argmax的确定性函数\n", "\n", "start_context = \"Every effort moves you\"\n", "\n", "token_ids = generate_text_simple(\n", " model=model,\n", " idx=text_to_token_ids(start_context, tokenizer),\n", " max_new_tokens=25,\n", " context_size=GPT_CONFIG_124M[\"context_length\"]\n", ")\n", "\n", "print(\"生成的文本:\\n\", token_ids_to_text(token_ids, tokenizer))" ] }, { "cell_type": "markdown", "id": "c85b1f11-37a5-477d-9c2d-170a6865e669", "metadata": {}, "source": [ "- 重新执行前一个代码单元将生成完全相同的文本。" ] }, { "cell_type": "code", "execution_count": 9, "id": "75469f24-47cc-458d-a200-fe64c648131d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Output text:\n", " Every effort moves you know,\" was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun\n" ] } ], "source": [ "# 确定性行为:没有top_k,没有温度缩放\n", "\n", "token_ids = generate(\n", " model=model,\n", " idx=text_to_token_ids(\"Every effort moves you\", tokenizer),\n", " max_new_tokens=25,\n", " context_size=GPT_CONFIG_124M[\"context_length\"],\n", " top_k=None,\n", " temperature=0.0\n", ")\n", "\n", "print(\"生成的文本:\\n\", token_ids_to_text(token_ids, tokenizer))" ] }, { "cell_type": "markdown", "id": "6d0480e5-fb4e-41f8-a161-7ac980d71d47", "metadata": {}, "source": [ "# 练习 5.4: 后续预训练" ] }, { "cell_type": "markdown", "id": "f40044e8-a0f5-476c-99fd-489b999fd80a", "metadata": {}, "source": [ "- 如果我们仍然在第5章首次训练模型的Python会话中,要继续预训练一个轮次,我们只需要加载在主章节中保存的模型和优化器,并再次调用`train_model_simple`函数\n", "\n", "- 在新的代码环境中确保结果可复现需要多几个步骤\n", "- 首先,我们加载分词器、模型和优化器:" ] }, { "cell_type": "code", "execution_count": 10, "id": "94eae6ba-d9fd-417a-8e31-fc39e9299870", "metadata": {}, "outputs": [], "source": [ "import tiktoken\n", "import torch\n", "from previous_chapters import GPTModel\n", "\n", "\n", "GPT_CONFIG_124M = {\n", " \"vocab_size\": 50257, # 词汇表大小\n", " \"context_length\": 256, # 缩短的上下文长度(原始值:1024)\n", " \"emb_dim\": 768, # 嵌入维度\n", " \"n_heads\": 12, # 注意力头数\n", " \"n_layers\": 12, # 层数\n", " \"drop_rate\": 0.1, # 丢弃率\n", " \"qkv_bias\": False # 查询-键-值偏置\n", "}\n", "\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "\n", "tokenizer = tiktoken.get_encoding(\"gpt2\")\n", "\n", "checkpoint = torch.load(\"model_and_optimizer.pth\", weights_only=True)\n", "model = GPTModel(GPT_CONFIG_124M)\n", "model.load_state_dict(checkpoint[\"model_state_dict\"])\n", "model.to(device)\n", "\n", "optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)\n", "optimizer.load_state_dict(checkpoint[\"optimizer_state_dict\"])\n", "model.train();" ] }, { "cell_type": "markdown", "id": "688fce4a-9ab2-4d97-a95c-fef02c32b4f3", "metadata": {}, "source": [ "- 接下来初始化载入器" ] }, { "cell_type": "code", "execution_count": 11, "id": "b5a78470-0652-4abd-875a-664e23c07c36", "metadata": {}, "outputs": [], "source": [ "import os\n", "import urllib.request\n", "from previous_chapters import create_dataloader_v1\n", "\n", "\n", "file_path = \"the-verdict.txt\"\n", "url = \"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt\"\n", "\n", "if not os.path.exists(file_path):\n", " with urllib.request.urlopen(url) as response:\n", " text_data = response.read().decode('utf-8')\n", " with open(file_path, \"w\", encoding=\"utf-8\") as file:\n", " file.write(text_data)\n", "else:\n", " with open(file_path, \"r\", encoding=\"utf-8\") as file:\n", " text_data = file.read()\n", "\n", "\n", "# 训练/验证集比例\n", "train_ratio = 0.90\n", "split_idx = int(train_ratio * len(text_data))\n", "train_data = text_data[:split_idx]\n", "val_data = text_data[split_idx:]\n", "\n", "\n", "torch.manual_seed(123)\n", "\n", "train_loader = create_dataloader_v1(\n", " train_data,\n", " batch_size=2,\n", " max_length=GPT_CONFIG_124M[\"context_length\"],\n", " stride=GPT_CONFIG_124M[\"context_length\"],\n", " drop_last=True,\n", " shuffle=True,\n", " num_workers=0\n", ")\n", "\n", "val_loader = create_dataloader_v1(\n", " val_data,\n", " batch_size=2,\n", " max_length=GPT_CONFIG_124M[\"context_length\"],\n", " stride=GPT_CONFIG_124M[\"context_length\"],\n", " drop_last=False,\n", " shuffle=False,\n", " num_workers=0\n", ")" ] }, { "cell_type": "markdown", "id": "76598ef8-165c-4bcc-af5e-b6fe72398365", "metadata": {}, "source": [ " 最后,我们使用`train_model_simple`函数来训练模型:" ] }, { "cell_type": "code", "execution_count": 12, "id": "ab4693dc-1359-47a7-8110-1e90f514a49e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ep 1 (Step 000000): Train loss 0.271, Val loss 6.545\n", "Ep 1 (Step 000005): Train loss 0.244, Val loss 6.614\n", "Every effort moves you?\" \"Yes--quite insensible to the irony. She wanted him vindicated--and by me!\" He laughed again, and threw back his head to look up at the sketch of the donkey. \"There were days when I\n" ] } ], "source": [ "from gpt_train import train_model_simple\n", "\n", "num_epochs = 1\n", "train_losses, val_losses, tokens_seen = train_model_simple(\n", " model, train_loader, val_loader, optimizer, device,\n", " num_epochs=num_epochs, eval_freq=5, eval_iter=5,\n", " start_context=\"Every effort moves you\", tokenizer=tokenizer\n", ")" ] }, { "cell_type": "markdown", "id": "3384e788-f5a1-407c-8dd1-87959b75026d", "metadata": {}, "source": [ "# 练习 5.5: 预训练模型的训练和验证集Loss" ] }, { "cell_type": "markdown", "id": "7cb1140b-2027-4156-8d19-600ac849edbe", "metadata": {}, "source": [ "我们可以使用以下代码计算GPT模型的训练集和验证集损失:\n", "\n", "```python\n", "train_loss = calc_loss_loader(train_loader, gpt, device)\n", "val_loss = calc_loss_loader(val_loader, gpt, device)\n", "```\n", "\n", "- 124M模型的Loss如下\n", "\n", "```\n", "Training loss: 3.754748503367106\n", "Validation loss: 3.559617757797241\n", "```\n", "- 主要观察结果:训练集和验证集的表现差不多\n", "- 这可能有多种解释:\n", "\n", "1. The Verdict并不是OpenAI在训练GPT-2时使用的预训练数据集的一部分。因此,模型并没有显式地对训练集发生过拟合,并且在The Verdict的训练集和验证集部分表现相似。(在深度学习中,验证集的损失略低于训练集的损失,这并不常见。由于数据集相对较小,这种现象可能是由随机噪声引起的。在实际应用中,如果没有发生过拟合,训练集和验证集的表现通常是大致相同的)。\n", "\n", "2. The Verdict是GPT-2训练数据集的一部分。在这种情况下,我们无法判断模型是否在训练数据上发生了过拟合,因为验证集也会用于训练。为了评估过拟合的程度,我们需要一个在OpenAI完成训练GPT-2后生成的新数据集,以确保该数据集不可能是预训练数据的一部分。\n", "主要观察结果是训练集和验证集的表现差不多\n", "这可能有多种解释:\n" ] }, { "cell_type": "markdown", "id": "66bb4316-a57c-437f-9a01-fe99b1678524", "metadata": {}, "source": [ "下面的代码是一个可复现的独立示例" ] }, { "cell_type": "code", "execution_count": 13, "id": "68d162d6-bbb9-4d6d-82ee-1c410694f872", "metadata": {}, "outputs": [], "source": [ "import tiktoken\n", "import torch\n", "from previous_chapters import GPTModel\n", "\n", "\n", "GPT_CONFIG_124M = {\n", " \"vocab_size\": 50257, # 词汇表大小\n", " \"context_length\": 256, # 缩短的上下文长度(原始值:1024)\n", " \"emb_dim\": 768, # 嵌入维度\n", " \"n_heads\": 12, # 注意力头数\n", " \"n_layers\": 12, # 层数\n", " \"drop_rate\": 0.1, # 丢弃率\n", " \"qkv_bias\": False # 查询-键-值偏置\n", "}\n", "\n", "\n", "torch.manual_seed(123)\n", "\n", "tokenizer = tiktoken.get_encoding(\"gpt2\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "d8373461-7dad-47da-a489-3e23f0799b23", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "File already exists and is up-to-date: gpt2/124M/checkpoint\n", "File already exists and is up-to-date: gpt2/124M/encoder.json\n", "File already exists and is up-to-date: gpt2/124M/hparams.json\n", "File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001\n", "File already exists and is up-to-date: gpt2/124M/model.ckpt.index\n", "File already exists and is up-to-date: gpt2/124M/model.ckpt.meta\n", "File already exists and is up-to-date: gpt2/124M/vocab.bpe\n" ] } ], "source": [ "from gpt_download import download_and_load_gpt2\n", "\n", "settings, params = download_and_load_gpt2(model_size=\"124M\", models_dir=\"gpt2\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "cdd44873-d6c2-4471-a20f-f639b09fdcd3", "metadata": {}, "outputs": [], "source": [ "# 在字典中定义模型配置以简化表示\n", "model_configs = {\n", " \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n", " \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n", " \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n", " \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n", "}\n", "\n", "# 复制基础配置并根据特定模型设置进行更新\n", "model_name = \"gpt2-small (124M)\" # 示例模型名称\n", "NEW_CONFIG = GPT_CONFIG_124M.copy()\n", "NEW_CONFIG.update(model_configs[model_name])\n", "NEW_CONFIG.update({\"context_length\": 1024, \"qkv_bias\": True})\n", "\n", "gpt = GPTModel(NEW_CONFIG)\n", "gpt.eval();" ] }, { "cell_type": "code", "execution_count": 16, "id": "c7d562e4-33f6-4611-9b75-6ad1cb441d3b", "metadata": {}, "outputs": [], "source": [ "from gpt_generate import load_weights_into_gpt\n", "\n", "\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "load_weights_into_gpt(gpt, params)\n", "gpt.to(device);" ] }, { "cell_type": "code", "execution_count": 17, "id": "46eda9ea-ccb0-46ee-931b-3c07502b2544", "metadata": {}, "outputs": [], "source": [ "import os\n", "import urllib.request\n", "from previous_chapters import create_dataloader_v1\n", "\n", "\n", "file_path = \"the-verdict.txt\"\n", "url = \"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt\"\n", "\n", "if not os.path.exists(file_path):\n", " with urllib.request.urlopen(url) as response:\n", " text_data = response.read().decode('utf-8')\n", " with open(file_path, \"w\", encoding=\"utf-8\") as file:\n", " file.write(text_data)\n", "else:\n", " with open(file_path, \"r\", encoding=\"utf-8\") as file:\n", " text_data = file.read()\n", "\n", "\n", "# 比例\n", "train_ratio = 0.90\n", "split_idx = int(train_ratio * len(text_data))\n", "train_data = text_data[:split_idx]\n", "val_data = text_data[split_idx:]\n", "\n", "\n", "torch.manual_seed(123)\n", "\n", "train_loader = create_dataloader_v1(\n", " train_data,\n", " batch_size=2,\n", " max_length=GPT_CONFIG_124M[\"context_length\"],\n", " stride=GPT_CONFIG_124M[\"context_length\"],\n", " drop_last=True,\n", " shuffle=True,\n", " num_workers=0\n", ")\n", "\n", "val_loader = create_dataloader_v1(\n", " val_data,\n", " batch_size=2,\n", " max_length=GPT_CONFIG_124M[\"context_length\"],\n", " stride=GPT_CONFIG_124M[\"context_length\"],\n", " drop_last=False,\n", " shuffle=False,\n", " num_workers=0\n", ")" ] }, { "cell_type": "code", "execution_count": 18, "id": "4e3574a2-687d-47a2-a2f6-457fe9d595f1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training loss: 3.7547486888037787\n", "Validation loss: 3.5596182346343994\n" ] } ], "source": [ "from gpt_train import calc_loss_loader\n", "\n", "torch.manual_seed(123) # 确保可复现\n", "train_loss = calc_loss_loader(train_loader, gpt, device)\n", "val_loss = calc_loss_loader(val_loader, gpt, device)\n", "\n", "print(\"Training loss:\", train_loss)\n", "print(\"Validation loss:\", val_loss)" ] }, { "cell_type": "markdown", "id": "96485d6b-bf1f-4bc0-a53f-73b08d85726e", "metadata": {}, "source": [ "我们也可以对最大的GPT-2模型执行相同操作,但不要忘记更新上下文长度:" ] }, { "cell_type": "code", "execution_count": 19, "id": "1a79a4b6-fe8f-40c2-a018-e731dcf391b3", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "checkpoint: 100%|███████████████████████████| 77.0/77.0 [00:00<00:00, 43.5kiB/s]\n", "encoder.json: 100%|███████████████████████| 1.04M/1.04M [00:00<00:00, 2.75MiB/s]\n", "hparams.json: 100%|█████████████████████████| 91.0/91.0 [00:00<00:00, 60.2kiB/s]\n", "model.ckpt.data-00000-of-00001: 100%|█████| 6.23G/6.23G [06:02<00:00, 17.2MiB/s]\n", "model.ckpt.index: 100%|████████████████████| 20.7k/20.7k [00:00<00:00, 171kiB/s]\n", "model.ckpt.meta: 100%|████████████████████| 1.84M/1.84M [00:00<00:00, 4.27MiB/s]\n", "vocab.bpe: 100%|████████████████████████████| 456k/456k [00:00<00:00, 1.73MiB/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Training loss: 3.3046312861972384\n", "Validation loss: 3.1195147037506104\n" ] } ], "source": [ "settings, params = download_and_load_gpt2(model_size=\"1558M\", models_dir=\"gpt2\")\n", "\n", "model_name = \"gpt2-xl (1558M)\"\n", "NEW_CONFIG = GPT_CONFIG_124M.copy()\n", "NEW_CONFIG.update(model_configs[model_name])\n", "NEW_CONFIG.update({\"context_length\": 1024, \"qkv_bias\": True})\n", "\n", "gpt = GPTModel(NEW_CONFIG)\n", "gpt.eval()\n", "\n", "load_weights_into_gpt(gpt, params)\n", "gpt.to(device)\n", "\n", "torch.manual_seed(123)\n", "train_loss = calc_loss_loader(train_loader, gpt, device)\n", "val_loss = calc_loss_loader(val_loader, gpt, device)\n", "\n", "print(\"Training loss:\", train_loss)\n", "print(\"Validation loss:\", val_loss)" ] }, { "cell_type": "markdown", "id": "3a76a1e0-9635-480a-9391-3bda7aea402d", "metadata": {}, "source": [ "# 练习5.6 用更大的模型" ] }, { "cell_type": "markdown", "id": "b3d313f4-0038-4bc9-a340-84b3b55dc0e3", "metadata": {}, "source": [ "- 在主章节中,我们尝试了最小的GPT-2模型,只有124M参数\n", "- 这样做的原因是为了将资源需求保持在最低限度\n", "- 然而,您可以通过最少的代码更改轻松地尝试更大的模型\n", "- 例如,在第5章中,加载1558M模型而不是124M模型,我们只需更改以下两行代码:\n", "\n", "```python\n", "settings, params = download_and_load_gpt2(model_size=\"124M\", models_dir=\"gpt2\")\n", "model_name = \"gpt2-small (124M)\"\n", "```\n", "\n", "- 最新的代码如下\n", "\n", "\n", "```python\n", "settings, params = download_and_load_gpt2(model_size=\"1558M\", models_dir=\"gpt2\")\n", "model_name = \"gpt2-xl (1558M)\"\n", "```" ] }, { "cell_type": "code", "execution_count": 20, "id": "31e0972b-e85e-4904-a0f5-24c3eacd5fa2", "metadata": {}, "outputs": [], "source": [ "import tiktoken\n", "import torch\n", "from previous_chapters import GPTModel\n", "\n", "\n", "GPT_CONFIG_124M = {\n", " \"vocab_size\": 50257, # 词汇表大小\n", " \"context_length\": 256, # 缩短的上下文长度(原始值:1024)\n", " \"emb_dim\": 768, # 嵌入维度\n", " \"n_heads\": 12, # 注意力头数\n", " \"n_layers\": 12, # 层数\n", " \"drop_rate\": 0.1, # 丢弃率\n", " \"qkv_bias\": False # 查询-键-值偏置\n", "}\n", "\n", "\n", "tokenizer = tiktoken.get_encoding(\"gpt2\")" ] }, { "cell_type": "code", "execution_count": 21, "id": "b641ee88-f9d4-43ec-a787-e34199eed356", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "File already exists and is up-to-date: gpt2/1558M/checkpoint\n", "File already exists and is up-to-date: gpt2/1558M/encoder.json\n", "File already exists and is up-to-date: gpt2/1558M/hparams.json\n", "File already exists and is up-to-date: gpt2/1558M/model.ckpt.data-00000-of-00001\n", "File already exists and is up-to-date: gpt2/1558M/model.ckpt.index\n", "File already exists and is up-to-date: gpt2/1558M/model.ckpt.meta\n", "File already exists and is up-to-date: gpt2/1558M/vocab.bpe\n" ] } ], "source": [ "from gpt_download import download_and_load_gpt2\n", "from gpt_generate import load_weights_into_gpt\n", "\n", "\n", "model_configs = {\n", " \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n", " \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n", " \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n", " \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n", "}\n", "\n", "model_name = \"gpt2-xl (1558M)\"\n", "NEW_CONFIG = GPT_CONFIG_124M.copy()\n", "NEW_CONFIG.update(model_configs[model_name])\n", "NEW_CONFIG.update({\"context_length\": 1024, \"qkv_bias\": True})\n", "\n", "gpt = GPTModel(NEW_CONFIG)\n", "gpt.eval()\n", "\n", "settings, params = download_and_load_gpt2(model_size=\"1558M\", models_dir=\"gpt2\")\n", "load_weights_into_gpt(gpt, params)" ] }, { "cell_type": "code", "execution_count": 22, "id": "c98f56f4-98fc-43b4-9ee5-726e9d17c73f", "metadata": {}, "outputs": [], "source": [ "from gpt_generate import generate, text_to_token_ids, token_ids_to_text" ] }, { "cell_type": "code", "execution_count": 23, "id": "b1f7853c-6e81-4f1f-a1d0-61e2c7d33a20", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Output text:\n", " Every effort moves you toward finding an ideal life. You don't have to accept your current one at once, because if you do you'll never\n" ] } ], "source": [ "torch.manual_seed(123)\n", "\n", "token_ids = generate(\n", " model=gpt,\n", " idx=text_to_token_ids(\"Every effort moves you\", tokenizer),\n", " max_new_tokens=25,\n", " context_size=NEW_CONFIG[\"context_length\"],\n", " top_k=50,\n", " temperature=1.5\n", ")\n", "\n", "print(\"Output text:\\n\", token_ids_to_text(token_ids, tokenizer))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.6" } }, "nbformat": 4, "nbformat_minor": 5 }