{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# LLM训练全链路最佳实践\n", "\n", "随着人工智能技术的飞速发展,大型语言模型(LLMs)已经成为自然语言处理领域的核心驱动力。本文档旨在概述使用modelscope生态进行LLM训练的全链路最佳实践,涵盖数据下载、数据预处理、模型训练、模型评估完整流程。\n", "\n", "主要内容\n", "\n", "教程以知乎评论数据集为例,使用LoRA微调模型,让AI生成的文本没有那么强的“AI味”\n", "\n", "本教程涉及以下框架的安装和使用:\n", "1. modelscope:提供模型、数据集下载能力 \n", "2. data-juicer:提供数据集处理能力\n", "1. ms-swift:提供模型训练、推理能力\n", "1. evalscope:提供模型评测能力\n", "\n", "## 环境准备\n", "\n", "安装modelscope、data-juicer、swift、evalscope" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecutionIndicator": { "show": false }, "execution": { "iopub.execute_input": "2024-12-23T11:49:57.724413Z", "iopub.status.busy": "2024-12-23T11:49:57.723990Z", "iopub.status.idle": "2024-12-23T11:49:59.154300Z", "shell.execute_reply": "2024-12-23T11:49:59.153737Z", "shell.execute_reply.started": "2024-12-23T11:49:57.724383Z" }, "scrolled": true, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found existing installation: tensorflow 2.18.0\n", "Uninstalling tensorflow-2.18.0:\n", " Successfully uninstalled tensorflow-2.18.0\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0mNote: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "# %pip install modelscope[framework] # 模型库,notebook已预装\n", "%pip install ms-swift[llm] -U # 训练库\n", "%pip install evalscope -U # 评测库\n", "%pip install py-data-juicer[sci] # 数据处理库\n", "%pip install datasets==3.0.1 pydantic==2.0 tf-keras\n", "%pip uninstall tensorflow -y # 不需要,跟环境冲突" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# !!重启notebook环境!!\n", "------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据集准备\n", "\n", "使用modelscope下载数据集,初步处理数据集,提取需要的字段,并处理成data-juicer需要的格式" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-12-23T11:50:09.836788Z", "iopub.status.busy": "2024-12-23T11:50:09.836608Z", "iopub.status.idle": "2024-12-23T11:50:59.750489Z", "shell.execute_reply": "2024-12-23T11:50:59.749933Z", "shell.execute_reply.started": "2024-12-23T11:50:09.836767Z" }, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Dataset({\n", " features: ['INSTRUCTION', 'RESPONSE', 'SOURCE', 'METADATA'],\n", " num_rows: 1006218\n", "})\n", "{'INSTRUCTION': '怎么说服男朋友买烤箱?',\n", " 'METADATA': '{\"question_id\": 357137111.0, \"answer_id\": 914332816.0, \"url\": '\n", " '\"https://www.zhihu.com/question/357137111/answer/914332816\", '\n", " '\"upvotes\": \"赞同 15\", \"answer_creation_time\": '\n", " '\"2019-11-28T12:01:22.000Z\"}',\n", " 'RESPONSE': 'emmmmm,首先想说的是,我买厨房用品一般是不用「说服」的,只是在厨房堆的满满当当的情况下会象征性的问一下我老公,他就会回答我说:你看看你还有地方放吗。然后我会思考一下,如果是特别想买的,就不会问他了。自己决定就好。 '\n", " '比如,前几天我又买了两个盘子~~~~他还不知道。 可以给题主看看我有多少的锅具:自家炒菜用什么锅好?各有什么优缺点? '\n", " '说回烤箱的问题,买的时候处于热恋期,我告诉他我有一个买烤箱的计划。虽然他基本不吃点心,也不喜欢烘焙,但那个时期的他欣然同意并热情洋溢的给我选烤箱。可能是他有憧憬我会给他做什么好吃的吧。又因为我是一个不怎么吃甜食的湖南人,烤箱在我家烘焙的使用率很低。 '\n", " '但是!!你还是可以告诉他烤箱的作用是可以烤制各种肉类!!!我不相信有不喜欢吃肉的男生!!烤箱真的是可以烤一切的肉类,熟悉之后会觉得非常简单。 '\n", " '我很久以前用烤箱做的最多的就是烤羊排和烤鸡翅,我老公不怎么吃羊肉和鸡翅。这个烤箱因为厨房放不下,被放在了餐厅,也就闲置了下来…… '\n", " '要说的事是,烤箱真的能给你做出很多不一样的美食,尤其是来了客人,在你两个灶台忙不过来的时候,烤箱特别适合准备一个荤素搭配的豪华大菜。在烹饪其他需要爆炒的菜肴的空档去处理一下就可以了。 '\n", " '总结来说理由如下: 1、如果你家是你做饭多,那么为什么有这么多话说, 也不是他用,等着吃就好了。 '\n", " '2、工欲善其事,必先利其器。没有好的工具怎么能吃到更好的美食。 3、我要我喜欢,不要你喜欢。我还不能有个爱好吗?',\n", " 'SOURCE': 'Zhihu'}\n" ] } ], "source": [ "from modelscope import MsDataset\n", "from pprint import pprint\n", "\n", "ds = MsDataset.load('OmniData/Zhihu-KOL', cache_dir=\"data\", split='train')\n", "print(ds)\n", "pprint(ds[0])" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-12-23T11:54:05.942611Z", "iopub.status.busy": "2024-12-23T11:54:05.942272Z", "iopub.status.idle": "2024-12-23T11:54:10.671123Z", "shell.execute_reply": "2024-12-23T11:54:10.670576Z", "shell.execute_reply.started": "2024-12-23T11:54:05.942592Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done\n" ] } ], "source": [ "# 处理 metadata\n", "import json\n", "# load json\n", "metadata = list(map(lambda x: json.loads(x), ds['METADATA']))\n", "\n", "# 处理 upvotes \n", "vote_list = []\n", "for item in metadata:\n", " try:\n", " upvotes = item['upvotes'][3:]\n", " if not upvotes:\n", " votes = 0\n", " elif '万' in upvotes:\n", " votes = int(float(upvotes[:-2]) * 10000)\n", " else:\n", " votes = int(upvotes)\n", " except Exception as e:\n", " print(upvotes)\n", " votes = 0\n", " vote_list.append(votes)\n", "print(\"Done\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-11-11T07:00:57.127113Z", "iopub.status.busy": "2024-11-11T07:00:57.126851Z", "iopub.status.idle": "2024-11-11T07:01:33.374894Z", "shell.execute_reply": "2024-11-11T07:01:33.374455Z", "shell.execute_reply.started": "2024-11-11T07:00:57.127073Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1006218\n" ] }, { "data": { "text/html": [ "
| \n", " | query | \n", "response | \n", "upvotes | \n", "
|---|---|---|---|
| 0 | \n", "怎么说服男朋友买烤箱? | \n", "emmmmm,首先想说的是,我买厨房用品一般是不用「说服」的,只是在厨房堆的满满当当的情况下... | \n", "15 | \n", "
| 1 | \n", "航天从业者是如何看待电视剧《你是我的荣耀》的? | \n", "难得有个关于航天的剧,职场情节悬不悬浮,航天设定和细节走不走心?带着放大镜看了前18集,... | \n", "4432 | \n", "
| 2 | \n", "如何看待PayPal正式进入中国? | \n", "PayPal不仅是美国支付巨头,也是国际支付巨头,目前已开拓全球200多个市场,美国以外的市... | \n", "127 | \n", "
| 3 | \n", "中金公司交易员月薪八万五是如何做到的? | \n", "1、首先,考虑到这位交易员的工作经验,月薪八万五的表述是不正确的:其实是一年的全部薪酬除以1... | \n", "450 | \n", "
| 4 | \n", "摇滚乐(金属)给你们带来了什么? | \n", "ㄟ( ▔, ▔ )ㄏ哪里带来了什么东西啊,除了找到热爱的东西,也失去了很多。听重型现场像疯子... | \n", "5 | \n", "