{ "cells": [ { "cell_type": "markdown", "id": "ba450fb1-8a26-4894-ab7a-5d7bfefe90ce", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "Supplementary code for the Build a Large Language Model From Scratch book by Sebastian Raschka
\n", "
Code repository: https://github.com/rasbt/LLMs-from-scratch\n", "
汉化的库: https://github.com/GoatCsu/CN-LLMs-from-scratch.git\n", "
\n", "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "51c9672d-8d0c-470d-ac2d-1271f8ec3f14", "metadata": {}, "source": [ "# Chapter 7 课后练习" ] }, { "cell_type": "markdown", "id": "2625ddc4-9cce-42bd-947d-4e2203fdc55c", "metadata": {}, "source": [ "## Exercise 7.1: 改变prompt风格" ] }, { "cell_type": "markdown", "id": "6be25a95-2a33-433b-a698-2365b5fc9357", "metadata": {}, "source": [ "假如我们有如下json内容\n", "\n", "```json\n", "{\n", " \"instruction\": \"Identify the correct spelling of the following word.\",\n", " \"input\": \"Ocassion\",\n", " \"output\": \"The correct spelling is 'Occasion.'\"\n", "}\n", "```\n", "\n", "在主章节中,我们按照Alpaca风格的prompt模板进行了格式化。\n", "\n", "```\n", "下面是一个描述任务的指令。编写一个适当完成请求的响应。\n", "\n", "### 指令:\n", "找出下列单词的正确拼写。\n", "\n", "### 输入:\n", "Occassion\n", "\n", "### 响应:\n", "正确的拼写是 'Occasion.'\n", "```\n", "\n", "在这个练习中,我们改为使用 Phi-3 提示模板,把数据条目格式化:\n", "如下\n", "\n", "```\n", "### 指令:\n", "找出下列单词的正确拼写:'Occasion'\n", "\n", "### 响应:\n", "正确的拼写是 'Occasion'。\n", "```\n", "\n", "请注意,这个提示模板明显更简短,这减少了微调LLM和生成文本的运行时间和硬件需求,因为输入提示更短。\n", "为了进行此更改,我们将 `format_input` 函数更新如下:" ] }, { "cell_type": "code", "execution_count": 1, "id": "f99baa1e-c24c-417f-89d0-13e6d061ea6a", "metadata": {}, "outputs": [], "source": [ "def format_input(entry):\n", " instruction_text = (\n", " f\"<|user|>\\n{entry['instruction']}\"\n", " )\n", "\n", " input_text = f\"\\n{entry['input']}\" if entry[\"input\"] else \"\"\n", "\n", " return instruction_text + input_text" ] }, { "cell_type": "markdown", "id": "e4ba538f-64b9-495d-847b-d9f1d324bc50", "metadata": {}, "source": [ "我们通过应用于两个输入样本来确保它正常工作,一个有 `'input'` 字段内容,一个没有。" ] }, { "cell_type": "code", "execution_count": 2, "id": "877a57e2-535f-4363-b32a-a093edd951b8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<|user|>\n", "Identify the correct spelling of the following word.\n", "Ocassion\n", "\n", "<|user|>\n", "What is an antonym of 'complicated'?\n" ] } ], "source": [ "sample_data = [\n", " {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': \"The correct spelling is 'Occasion.'\"}, \n", " {'instruction': \"What is an antonym of 'complicated'?\", 'input': '', 'output': \"An antonym of 'complicated' is 'simple'.\"}\n", "]\n", "\n", "print(format_input(sample_data[0]))\n", "print()\n", "print(format_input(sample_data[1]))" ] }, { "cell_type": "markdown", "id": "fa2a6704-6c61-4a09-b8f5-ffc5a77d6aa3", "metadata": {}, "source": [ "接下来,我们更新 `InstructionDataset` 类,使用 `<|endoftext|>` 提示模板来处理响应:" ] }, { "cell_type": "code", "execution_count": 3, "id": "17f1a42c-7cc0-4746-8a6d-3a4cb37e2ca1", "metadata": {}, "outputs": [], "source": [ "import tiktoken\n", "from torch.utils.data import Dataset\n", "\n", "class InstructionDataset(Dataset):\n", " def __init__(self, data, tokenizer):\n", " self.data = data\n", "\n", " # 预先分词化文本\n", " self.encoded_texts = []\n", " for entry in data:\n", "\n", " ###################################################################\n", " # 新增:使用 `format_input_phi` 并调整响应文本模板\n", " instruction_plus_input = format_input(entry)\n", " response_text = f\"\\n<|assistant|>:\\n{entry['output']}\"\n", " ###################################################################\n", " full_text = instruction_plus_input + response_text\n", " self.encoded_texts.append(\n", " tokenizer.encode(full_text)\n", " )\n", "\n", " def __getitem__(self, index):\n", " return self.encoded_texts[index]\n", "\n", " def __len__(self):\n", " return len(self.data)\n", "\n", "\n", "tokenizer = tiktoken.get_encoding(\"gpt2\")" ] }, { "cell_type": "markdown", "id": "e0650926-c39f-4442-8116-cb7494416f28", "metadata": {}, "source": [ "最后,我们还需要更新提取生成响应的方式,当我们收集测试集的响应时:" ] }, { "cell_type": "markdown", "id": "a9253041-812f-4a5f-9ab1-d7e4cb1407fb", "metadata": {}, "source": [ "```python\n", "for i, entry in tqdm(enumerate(test_data), total=len(test_data)):\n", "\n", " input_text = format_input(entry)\n", " tokenizer=tokenizer\n", "\n", " token_ids = generate(\n", " model=model,\n", " idx=text_to_token_ids(input_text, tokenizer).to(device),\n", " max_new_tokens=256,\n", " context_size=BASE_CONFIG[\"context_length\"],\n", " eos_id=50256\n", " )\n", " generated_text = token_ids_to_text(token_ids, tokenizer)\n", "\n", " # 新增: 调整提取关键词 ###Response -> <|assistant|>\n", " response_text = generated_text[len(input_text):].replace(\"<|assistant|>:\", \"\").strip()\n", "\n", " test_data[i][\"model_response\"] = response_text\n", "```" ] }, { "cell_type": "markdown", "id": "29cd557c-3838-45e4-a26a-baed4b11175a", "metadata": {}, "source": [ "为了方便起见,习题解答已在 [exercise_experiments.py](exercise_experiments.py) 脚本中实现,您可以按如下方式运行:" ] }, { "cell_type": "markdown", "id": "dd8158e9-cc70-4e0f-88b0-73c3e1d8c030", "metadata": {}, "source": [ "```bash\n", "python exercise_experiments.py --exercise_solution phi3_prompt\n", "```\n", "\n", "Output:\n", "\n", "```\n", "matplotlib version: 3.7.1\n", "tiktoken version: 0.7.0\n", "torch version: 2.3.0+cu121\n", "tqdm version: 4.66.4\n", "tensorflow version: 2.15.0\n", "--------------------------------------------------\n", "Training set length: 935\n", "Validation set length: 55\n", "Test set length: 110\n", "--------------------------------------------------\n", "Device: cuda\n", "--------------------------------------------------\n", "...\n", "Loaded model: gpt2-medium (355M)\n", "--------------------------------------------------\n", "Initial losses\n", " Training loss: 3.71630220413208\n", " Validation loss: 3.6440994262695314\n", "Ep 1 (Step 000000): Train loss 2.633, Val loss 2.622\n", "...\n", "Ep 2 (Step 000230): Train loss 0.424, Val loss 0.928\n", "<|user|> Convert the active sentence to passive: 'The chef cooks the meal every day.' <|assistant|>: The meal is prepared every day by the chef....\n", "Training completed in 1.50 minutes.\n", "Plot saved as loss-plot-phi3-prompt.pdf\n", "--------------------------------------------------\n", "Generating responses\n", "100% 110/110 [00:11<00:00, 9.27it/s]\n", "Responses saved as instruction-data-with-response-phi3-prompt.json\n", "Model saved as gpt2-medium355M-sft-phi3-prompt.pth\n", "```\n", "\n", "作为比较,您可以通过运行原始的第7章微调代码 `python exercise_experiments.py --exercise_solution baseline` 来查看。\n", "\n", "请注意,在Nvidia L4 GPU上,使用Phi-3提示模板的代码运行时间为1.5分钟。而与此相比,使用Alpaca风格模板的运行时间为1.80分钟。因此,Phi-3模板大约快了17%,因为它导致了更短的模型输入。\n", "\n", "让我们来看一些响应,确保它们已正确格式化:\n", "\n", "```json\n", " {\n", " \"instruction\": \"Rewrite the sentence using a simile.\",\n", " \"input\": \"The car is very fast.\",\n", " \"output\": \"The car is as fast as lightning.\",\n", " \"model_response\": \"The car is as fast as a cheetah.\"\n", " },\n", " {\n", " \"instruction\": \"What type of cloud is typically associated with thunderstorms?\",\n", " \"input\": \"\",\n", " \"output\": \"The type of cloud typically associated with thunderstorms is cumulonimbus.\",\n", " \"model_response\": \"The type of cloud associated with thunderstorms is a cumulus cloud.\"\n", " },\n", " {\n", " \"instruction\": \"Name the author of 'Pride and Prejudice'.\",\n", " \"input\": \"\",\n", " \"output\": \"Jane Austen.\",\n", " \"model_response\": \"The author of 'Pride and Prejudice' is Jane Austen.\"\n", " },\n", "```\n", "\n", "我们可以使用Ollama Llama 3方法评估性能,出于方便考虑,该方法也已在 `python exercise_experiments.py` 脚本中实现,您可以按如下方式运行:\n", "\n", "```bash\n", "python ollama_evaluate.py --file_path instruction-data-with-response-phi3-prompt.json\n", "```\n", "\n", "Output:\n", "\n", "```\n", "Ollama running: True\n", "Scoring entries: 100%|████████████████████████| 110/110 [01:08<00:00, 1.60it/s]\n", "Number of scores: 110 of 110\n", "Average score: 48.87\n", "```\n", "\n", "得分接近50,和我们之前使用Alpaca风格提示时取得的得分相近。\n", "\n", "没有固有的优势或理由说明为什么 Phi 提示风格会更好,但它可以更简洁、更高效,除了下面提示部分提到的警告。\n", "\n", "**提示:考虑特殊标记**\n", "\n", "- 请注意,Phi-3 提示模板包含特殊标记,例如 `<|user|>` 和 `<|assistant|>`,这对于 GPT-2 编码器来说可能不是最佳选择\n", "- 虽然 GPT-2 编码器将 `<|endoftext|>` 识别为特殊标记(编码为token ID 50256),但它在处理其他特殊标记(例如上述标记)时效率低下\n", "- 例如,`<|user|>` 被编码为 5 个单独的token ID(27、91、7220、91、29),效率非常低下\n", "- 我们可以通过 allowed_special 参数将 `<|user|>` 作为 tiktoken 中的新特殊标记添加,但请记住,如果不进行额外修改,GPT-2 词汇表将无法处理它\n", "- 如果您对如何扩展标记器和 LLM 来处理特殊标记感到好奇,请参阅 [extend-tiktoken.ipynb](https://github.com/MLNLP-World/LLMs-from-scratch-CN/blob/main/ch05/09_extending-tokenizers) 奖励内容(请注意,这不是此处必需的,但只是对好奇的读者来说一个有趣/额外的考虑)\n", "- 此外,我们可以假设,通过词汇表支持提示模板的这些特殊标记的模型可能表现得更高效,总体上也更好" ] }, { "cell_type": "markdown", "id": "5fea8be3-30a1-4623-a6d7-b095c6c1092e", "metadata": {}, "source": [ " \n", "## 练习 7.2:指令和输入掩蔽(mask)\n", "\n", "为了像下图所示掩蔽指令,我们需要对 `InstructionDataset` 类和 `custom_collate_fn` 进行一些小的修改。\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 4, "id": "4405196a-db81-470b-be39-167a059587b6", "metadata": {}, "outputs": [], "source": [ "# 这个 `format_input` 函数是从原始的第7章代码中复制的\n", "\n", "def format_input(entry):\n", " instruction_text = (\n", " f\"Below is an instruction that describes a task. \"\n", " f\"Write a response that appropriately completes the request.\"\n", " f\"\\n\\n### Instruction:\\n{entry['instruction']}\"\n", " )\n", "\n", " input_text = f\"\\n\\n### Input:\\n{entry['input']}\" if entry[\"input\"] else \"\"\n", "\n", " return instruction_text + input_text" ] }, { "cell_type": "markdown", "id": "83658c09-af8a-425a-b940-eb1f06e43c0b", "metadata": {}, "source": [ "我们可以修改 `InstructionDataset` 类来收集指令的长度,稍后在编写 `collate` 函数时,我们将使用这些长度来定位目标中的指令内容位置,具体如下:" ] }, { "cell_type": "code", "execution_count": 5, "id": "e5e6188a-f182-4f26-b9e5-ccae3ecadae0", "metadata": {}, "outputs": [], "source": [ "import torch\n", "from torch.utils.data import Dataset\n", "\n", "\n", "class InstructionDataset(Dataset):\n", " def __init__(self, data, tokenizer):\n", " self.data = data\n", "\n", " ##########################################################################################\n", " # 新增:用于存储指令长度的列表\n", " self.instruction_lengths = []\n", " ##########################################################################################\n", " \n", " self.encoded_texts = []\n", " \n", " for entry in data:\n", " instruction_plus_input = format_input(entry)\n", " response_text = f\"\\n\\n### Response:\\n{entry['output']}\"\n", " full_text = instruction_plus_input + response_text\n", " \n", " self.encoded_texts.append(\n", " tokenizer.encode(full_text)\n", " )\n", "\n", " ##########################################################################################\n", " # 新增:收集指令长度\n", " instruction_length = len(tokenizer.encode(instruction_plus_input))\n", " self.instruction_lengths.append(instruction_length)\n", " ##########################################################################################\n", " \n", " def __getitem__(self, index):\n", " # 新增:分别返回指令长度和文本\n", " return self.instruction_lengths[index], self.encoded_texts[index]\n", "\n", " def __len__(self):\n", " return len(self.data)" ] }, { "cell_type": "code", "execution_count": 6, "id": "0163b7d1-acb8-456c-8efe-86307b58f4bb", "metadata": {}, "outputs": [], "source": [ "import tiktoken\n", "\n", "tokenizer = tiktoken.get_encoding(\"gpt2\")" ] }, { "cell_type": "markdown", "id": "3a186394-4960-424d-bb6a-f58459dd5994", "metadata": {}, "source": [ "接下来,我们更新 `custom_collate_fn`,由于在 `InstructionDataset` 数据集中进行了修改,每个 `batch` 现在是一个包含 `(instruction_length, item)` 的元组,而不仅仅是 `item`。此外,我们现在在目标ID列表中掩蔽相应的指令 token。" ] }, { "cell_type": "code", "execution_count": 7, "id": "f815e6fc-8e54-4105-aecd-d4c6e890ff9d", "metadata": {}, "outputs": [], "source": [ "def custom_collate_fn(\n", " batch,\n", " pad_token_id=50256,\n", " ignore_index=-100,\n", " allowed_max_length=None,\n", " device=\"cpu\"\n", "):\n", " # 找到批次中最长的序列\n", " batch_max_length = max(len(item)+1 for instruction_length, item in batch) # 新增:批次现在是一个元组\n", "\n", " # 填充并准备输入和目标\n", " inputs_lst, targets_lst = [], []\n", "\n", " for instruction_length, item in batch: # 新增:批次现在是一个元组\n", " new_item = item.copy()\n", " # 添加一个 <|endoftext|> 标记\n", " new_item += [pad_token_id]\n", " # 将序列填充到最大长度\n", " padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))\n", " inputs = torch.tensor(padded[:-1]) # 截取最后一个标记作为输入\n", " targets = torch.tensor(padded[1:]) # 向右移动 1个位置作为目标\n", "\n", " # 将目标中除第一个填充标记外的所有填充标记替换为 ignore_index\n", " mask = targets == pad_token_id\n", " indices = torch.nonzero(mask).squeeze()\n", " if indices.numel() > 1:\n", " targets[indices[1:]] = ignore_index\n", "\n", " ##########################################################################################\n", " # 新增:屏蔽目标和指令标记\n", " targets[:instruction_length-1] = -100\n", " ##########################################################################################\n", " \n", " # 可选:截断到最大序列长度\n", " if allowed_max_length is not None:\n", " inputs = inputs[:allowed_max_length]\n", " targets = targets[:allowed_max_length]\n", " \n", " inputs_lst.append(inputs)\n", " targets_lst.append(targets)\n", "\n", " # 将输入和目标列表转换为张量并传输到目标设备\n", " inputs_tensor = torch.stack(inputs_lst).to(device)\n", " targets_tensor = torch.stack(targets_lst).to(device)\n", "\n", " return inputs_tensor, targets_tensor" ] }, { "cell_type": "markdown", "id": "0a4a4815-850e-42c4-b70d-67e8ce5ebd57", "metadata": {}, "source": [ "试试下面指出的例子" ] }, { "cell_type": "code", "execution_count": 8, "id": "8da8a5b1-a8e2-4389-b21c-25b67be6dd1c", "metadata": {}, "outputs": [], "source": [ "sample_data = [\n", " {'instruction': \"What is an antonym of 'complicated'?\", 'input': '', 'output': \"An antonym of 'complicated' is 'simple'.\"},\n", " {'instruction': 'Sort the following list in alphabetical order.', 'input': 'Zebra, Elephant, Crocodile', 'output': 'Crocodile, Elephant, Zebra'},\n", " {'instruction': 'Arrange the given numbers in descending order.', 'input': '5, 12, 8, 3, 15', 'output': '15, 12, 8, 5, 3.'}\n", "]" ] }, { "cell_type": "code", "execution_count": 9, "id": "435b0816-0fc8-4650-a84a-eceffa4d85e4", "metadata": {}, "outputs": [], "source": [ "from torch.utils.data import DataLoader\n", "\n", "train_dataset = InstructionDataset(sample_data, tokenizer)\n", "train_loader = DataLoader(\n", " train_dataset,\n", " batch_size=len(sample_data),\n", " collate_fn=custom_collate_fn,\n", " num_workers=0\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "id": "106bbbd7-7286-4eb6-b343-43419332a80f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train loader:\n", "torch.Size([3, 64]) torch.Size([3, 64])\n" ] } ], "source": [ "print(\"Train loader:\")\n", "for inputs, targets in train_loader:\n", " print(inputs.shape, targets.shape)" ] }, { "cell_type": "code", "execution_count": 11, "id": "9bb3288b-84a9-4962-ae59-a7a29fd34bce", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Inputs:\n", " tensor([21106, 318, 281, 12064, 326, 8477, 257, 4876, 13, 19430,\n", " 257, 2882, 326, 20431, 32543, 262, 2581, 13, 198, 198,\n", " 21017, 46486, 25, 198, 42758, 262, 1708, 1351, 287, 24830,\n", " 605, 1502, 13, 198, 198, 21017, 23412, 25, 198, 57,\n", " 37052, 11, 42651, 11, 9325, 19815, 576, 198, 198, 21017,\n", " 18261, 25, 198, 34, 12204, 375, 576, 11, 42651, 11,\n", " 1168, 37052, 50256, 50256])\n", "\n", "\n", "Targets:\n", " tensor([ -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", " -100, -100, -100, -100, -100, -100, 198, 198, 21017, 18261,\n", " 25, 198, 34, 12204, 375, 576, 11, 42651, 11, 1168,\n", " 37052, 50256, -100, -100])\n" ] } ], "source": [ "print(\"Inputs:\\n\", inputs[1])\n", "print(\"\\n\\nTargets:\\n\", targets[1])" ] }, { "cell_type": "markdown", "id": "cc40347b-2ca7-44e1-862d-0fd0c92f0628", "metadata": {}, "source": [ "从 `targets` 可以看出,指令和填充 token 现在都使用 -100 占位符 token 进行了掩蔽。\n", "让我们解码(decode)输入,以确保它们看起来正确:" ] }, { "cell_type": "code", "execution_count": 12, "id": "76a9e6fa-3d75-4e39-b139-c3e05048f42b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n", "\n", "### Instruction:\n", "Sort the following list in alphabetical order.\n", "\n", "### Input:\n", "Zebra, Elephant, Crocodile\n", "\n", "### Response:\n", "Crocodile, Elephant, Zebra<|endoftext|><|endoftext|>\n" ] } ], "source": [ "print(tokenizer.decode(list(inputs[1])))" ] }, { "cell_type": "markdown", "id": "845ebd36-f63f-4b58-a76e-7767e4d2ccbd", "metadata": {}, "source": [ "解码非-100的位置" ] }, { "cell_type": "code", "execution_count": 13, "id": "4d54a152-b778-455a-8941-e375e2a17e8f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "### Response:\n", "Crocodile, Elephant, Zebra<|endoftext|>\n" ] } ], "source": [ "non_masked_targets = targets[1][targets[1] != -100]\n", "\n", "print(tokenizer.decode(list(non_masked_targets)))" ] }, { "cell_type": "markdown", "id": "3912bbf5-e9e2-474b-9552-d522e7510aa6", "metadata": {}, "source": [ "如上所示,未被掩蔽的目标 token 排除了 `\"Instruction\"` 和 `\"Input\"` 字段,达到了预期效果。现在,我们可以运行修改后的代码,看看使用这种掩蔽策略进行微调时,LLM 的表现如何。\n", "\n", "为了方便起见,您可以使用 `exercise_experiments.py` 代码进行比较,按如下方式运行:" ] }, { "cell_type": "markdown", "id": "56a76097-9114-479d-8803-443b0ff48581", "metadata": {}, "source": [ "```bash\n", "python exercise_experiments.py --exercise_solution mask_instructions\n", "```\n", "\n", "Output:\n", "\n", "```\n", "matplotlib version: 3.7.1\n", "tiktoken version: 0.7.0\n", "torch version: 2.3.0+cu121\n", "tqdm version: 4.66.4\n", "tensorflow version: 2.15.0\n", "--------------------------------------------------\n", "Training set length: 935\n", "Validation set length: 55\n", "Test set length: 110\n", "--------------------------------------------------\n", "Device: cuda\n", "--------------------------------------------------\n", "...\n", "Loaded model: gpt2-medium (355M)\n", "--------------------------------------------------\n", "Initial losses\n", " Training loss: 2.280539035797119\n", " Validation loss: 2.262560224533081\n", "Ep 1 (Step 000000): Train loss 1.636, Val loss 1.620\n", "...\n", "Ep 2 (Step 000230): Train loss 0.143, Val loss 0.727\n", "...\n", "Training completed in 1.77 minutes.\n", "Plot saved as loss-plot-mask-instructions.pdf\n", "--------------------------------------------------\n", "Generating responses\n", "100% 110/110 [02:10<00:00, 1.19s/it]\n", "Responses saved as instruction-data-with-response-mask-instructions.json\n", "Model saved as gpt2-medium355M-sft-mask-instructions.pth\n", "```\n", "\n", "Next, let's evaluate the performance of the resulting LLM:\n", "\n", "```bash\n", "python ollama_evaluate.py --file_path instruction-data-with-response-mask-instructions.json\n", "```\n", "\n", "```\n", "Ollama running: True\n", "Scoring entries: 100%|██████████████████████████████████████████████████████████████████████████████████████| 110/110 [01:23<00:00, 1.31it/s]\n", "Number of scores: 110 of 110\n", "Average score: 47.73\n", "```\n", "\n", "从得分可以看出,指令掩蔽的效果略差,这与《Instruction Tuning With Loss Over Instructions》论文中的观察结果一致(https://arxiv.org/abs/2405.14394)。" ] }, { "cell_type": "markdown", "id": "94a0f758-29da-44ee-b7af-32473b3c086e", "metadata": {}, "source": [ " \n", "## Exercise 7.3: 在最初的Alpaca数据上进行微调" ] }, { "cell_type": "markdown", "id": "68df7616-679f-4e53-954d-6e7cf2e2ef55", "metadata": {}, "source": [ "要在原始的斯坦福Alpaca数据集上微调模型([https://github.com/tatsu-lab/stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca)),只需将文件URL更改为:\n", "\n", "```python\n", "url = \"https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-chapter-code/instruction-data.json\"\n", "```\n", "\n", "to\n", "\n", "```python\n", "url = \"https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json\"\n", "```\n", "\n", "请注意,该数据集包含52k条记录(比第7章多50倍),且记录长度比我们在第7章中使用的要长。因此,强烈建议在GPU上运行训练。\n", "\n", "如果遇到内存不足的错误,建议将批次大小从8减少到4、2或1。除了降低批次大小外,您还可以考虑将 `allowed_max_length` 从1024降低到512或256。" ] }, { "cell_type": "markdown", "id": "d94c9621-2c3f-4551-b5b8-87cd96e38c9c", "metadata": {}, "source": [ "为了方便起见,您可以使用 `exercise_experiments.py` 代码,以批次大小为4和 `allowed_max_length` 为512的设置,在52k的Alpaca数据集上微调模型,按如下方式运行:" ] }, { "cell_type": "markdown", "id": "40a76486-73e6-4415-94dc-bfe2aa36ea52", "metadata": {}, "source": [ "```bash\n", "python exercise_experiments.py --exercise_solution alpaca_52k\n", "```\n", "\n", "```\n", "matplotlib version: 3.7.1\n", "tiktoken version: 0.7.0\n", "torch version: 2.3.0+cu121\n", "tqdm version: 4.66.4\n", "tensorflow version: 2.15.0\n", "--------------------------------------------------\n", "Training set length: 44201\n", "Validation set length: 2601\n", "Test set length: 5200\n", "--------------------------------------------------\n", "Device: cuda\n", "--------------------------------------------------\n", "...\n", "Loaded model: gpt2-medium (355M)\n", "--------------------------------------------------\n", "Initial losses\n", " Training loss: 3.3681655883789063\n", " Validation loss: 3.4122894287109373\n", "Ep 1 (Step 000000): Train loss 2.477, Val loss 2.750\n", "...\n", "Ep 2 (Step 022095): Train loss 0.761, Val loss 1.557\n", "...\n", "Training completed in 196.38 minutes.\n", "Plot saved as loss-plot-alpaca52k.pdf\n", "--------------------------------------------------\n", "Generating responses\n", "100% 5200/5200 [2:56:33<00:00, 2.04s/it]\n", "Responses saved as instruction-data-with-response-alpaca52k.json\n", "Model saved as gpt2-medium355M-sft-alpaca52k.pth\n", "```" ] }, { "cell_type": "markdown", "id": "cc8a1bd1-15f4-4a35-87b5-369038db0aa7", "metadata": {}, "source": [ "下面是一点例子" ] }, { "cell_type": "markdown", "id": "be9ab66f-5819-4b01-9a03-c45aa3b7c5b8", "metadata": {}, "source": [ "```json\n", "[\n", " {\n", " \"instruction\": \"Edit the following sentence to increase readability: \\\"He made a huge effort and was so successful.\\\"\",\n", " \"input\": \"\",\n", " \"output\": \"He exerted a tremendous effort, and thus enjoyed great success.\",\n", " \"model_response\": \"He put in an immense effort and was rewarded with success.\"\n", " },\n", " {\n", " \"instruction\": \"Rewrite the following sentence to make it more concise: \\\"I was displeased with the result of the experiment that I conducted.\\\"\",\n", " \"input\": \"\",\n", " \"output\": \"I was unhappy with my experiment's outcome.\",\n", " \"model_response\": \"I was displeased with the results of the experiment.\"\n", " },\n", " {\n", " \"instruction\": \"How can we build a more efficient GPT model?\",\n", " \"input\": \"\",\n", " \"output\": \"We can build a more efficient GPT model by optimizing the architecture of the model, using smaller model sizes and training with fewer parameters. We can also leverage techniques such as knowledge distillation, transfer learning, dynamic sparsity and hybrid computing to further improve the efficiency of the model.\",\n", " \"model_response\": \"Building a more efficient GPT model requires careful planning and optimization. First, it is important to identify the target language and the context in which the model is used. Then, it is important to select the appropriate model architecture, such as backpropagation, hyperparameters, and hyperparameters. Finally, it is important to select the appropriate model weights and optimizers, such as backpropagation, hyperparameters, and hyperparameters.\"\n", " },\n", "]" ] }, { "cell_type": "markdown", "id": "840e2076-f2e6-44a2-86fd-f191f9674267", "metadata": {}, "source": [ "我们用这个来评估 [ollama_evaluate.py](ollama_evaluate.py)\n", "\n", "```bash\n", "python ollama_evaluate.py --file_path instruction-data-with-response-alpaca52k.json\n", "```\n", "\n", "```\n", "Scoring entries: 100%|████████████████████| 5200/5200 [1:07:52<00:00, 1.28it/s]\n", "Number of scores: 5188 of 5200\n", "Average score: 48.16\n", "```" ] }, { "cell_type": "markdown", "id": "d14b3c60-00a1-43a9-9fcd-592aaadf1ef4", "metadata": {}, "source": [ "得分略低于我们在本章中使用的数据集上获得的得分。不过,请注意,Alpaca测试集包含比我们在本章中使用的数据集更具多样性和部分更具挑战性的指令。" ] }, { "cell_type": "markdown", "id": "ca61fa6c-4e1d-4618-9e5e-d091f8303e30", "metadata": {}, "source": [ "## 练习 7.4:使用LoRA进行参数高效微调" ] }, { "cell_type": "markdown", "id": "01742cec-1f41-4415-8788-009d31b1ad38", "metadata": {}, "source": [ "要使用LoRA对模型进行指令微调,请使用附录E中的相关类和函数:\n", "\n", "```python\n", "from appendix_E import LoRALayer, LinearWithLoRA, replace_linear_with_lora\n", "```" ] }, { "cell_type": "markdown", "id": "871dca8f-3411-4735-b7b0-9d0e6e0599ac", "metadata": {}, "source": [ "把7.5部分中的代码加进来\n", "\n", "\n", "```python\n", "total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n", "print(f\"Total trainable parameters before: {total_params:,}\")\n", "\n", "for param in model.parameters():\n", " param.requires_grad = False\n", "\n", "total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n", "print(f\"Total trainable parameters after: {total_params:,}\")\n", "replace_linear_with_lora(model, rank=16, alpha=16)\n", "\n", "total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n", "print(f\"Total trainable LoRA parameters: {total_params:,}\")\n", "model.to(device)\n", "```" ] }, { "cell_type": "markdown", "id": "1b26b925-dc95-4b91-b050-9676dd9608a4", "metadata": {}, "source": [ "为了方便起见,您可以使用 `exercise_experiments.py` 代码,使用LoRA(秩为16,alpha为16)对模型进行微调,按如下方式运行:" ] }, { "cell_type": "markdown", "id": "01f02c7e-3b15-44b8-bf41-7892cd755766", "metadata": {}, "source": [ "```bash\n", "python exercise_experiments.py --exercise_solution lora\n", "```\n", "\n", "Output:\n", "\n", "```\n", "matplotlib version: 3.7.1\n", "tiktoken version: 0.7.0\n", "torch version: 2.3.0+cu121\n", "tqdm version: 4.66.4\n", "tensorflow version: 2.15.0\n", "--------------------------------------------------\n", "Training set length: 935\n", "Validation set length: 55\n", "Test set length: 110\n", "--------------------------------------------------\n", "Device: cuda\n", "--------------------------------------------------\n", "File already exists and is up-to-date: gpt2/355M/checkpoint\n", "File already exists and is up-to-date: gpt2/355M/encoder.json\n", "File already exists and is up-to-date: gpt2/355M/hparams.json\n", "File already exists and is up-to-date: gpt2/355M/model.ckpt.data-00000-of-00001\n", "File already exists and is up-to-date: gpt2/355M/model.ckpt.index\n", "File already exists and is up-to-date: gpt2/355M/model.ckpt.meta\n", "File already exists and is up-to-date: gpt2/355M/vocab.bpe\n", "Loaded model: gpt2-medium (355M)\n", "--------------------------------------------------\n", "Total trainable parameters before: 406,286,336\n", "Total trainable parameters after: 0\n", "Total trainable LoRA parameters: 7,898,384\n", "Initial losses\n", " Training loss: 3.7684114456176756\n", " Validation loss: 3.7619335651397705\n", "Ep 1 (Step 000000): Train loss 2.509, Val loss 2.519\n", "...\n", "Ep 2 (Step 000230): Train loss 0.308, Val loss 0.652\n", "...\n", "--------------------------------------------------\n", "Generating responses\n", "100% 110/110 [01:52<00:00, 1.03s/it]\n", "Responses saved as instruction-data-with-response-lora.json\n", "Model saved as gpt2-medium355M-sft-lora.pth\n", "```\n", "\n", "为了比较,您可以通过运行原始第7章微调代码 `python exercise_experiments.py --exercise_solution baseline` 来进行对比。\n", "\n", "请注意,在Nvidia L4 GPU上,使用LoRA的代码运行时间为1.30分钟,而基线代码运行时间为1.80分钟。因此,LoRA大约快了28%。\n", "\n", "我们可以使用Ollama Llama 3方法来评估性能,出于方便考虑,该方法也已在 `python exercise_experiments.py` 脚本中实现,您可以按如下方式运行:\n", "\n", "```bash\n", "python ollama_evaluate.py --file_path instruction-data-with-response-lora.json\n", "```\n", "\n", "Output:\n", "\n", "```\n", "Ollama running: True\n", "Scoring entries: 100%|████████████████████████| 110/110 [01:13<00:00, 1.50it/s]\n", "Number of scores: 110 of 110\n", "Average score: 50.23\n", "```\n", "\n", "得分约为50,与原始模型的得分相近。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" } }, "nbformat": 4, "nbformat_minor": 5 }