{ "cells": [ { "cell_type": "markdown", "id": "118cdf43-537f-4bdf-9a3f-eeed05c49dea", "metadata": {}, "source": [ "# 大语言模型到底在干什么" ] }, { "cell_type": "markdown", "id": "e45d8405-45e9-4571-9f94-d7e4ed814799", "metadata": {}, "source": [ "在开发 Agent 之前,我们需要先理解一个最核心的组件:大语言模型(Large Language Model, LLM)。\n", "早期的模型只是“复读机”,经历了从“统计概率”到“深度学习”,再到“通用人工智能”原型的三次跨越后,现在的 LLM 具备了世界模型(World Model)的雏形。\n", "\n", "主要发展过程如下:\n", "\n", "#### **1. 萌芽期:统计与神经网络(2013 - 2017)**\n", "\n", "* **关键技术**:Word2Vec, RNN, LSTM。\n", "* **特征**:这一时期的模型主要用于翻译和情感分析。虽然能处理序列,但“记性”不好,难以理解长文本,更无法进行复杂的逻辑推理。\n", "* **局限**:模型规模小,且必须针对特定任务(如下棋、翻译)进行专项训练。\n", "\n", "#### **2. 转折点:Transformer 的诞生(2017 - 2019)**\n", "\n", "* **里程碑**:Google 发布论文 *Attention is All You Need*,提出了 **Transformer** 架构。\n", "* **特征**:引入了“自注意力机制(Self-Attention)”,让模型能够并行处理数据并捕捉长距离的语义联系。\n", "* **代表作**:BERT(理解力强)和 GPT-1/2(生成能力初现)。\n", "\n", "#### **3. 爆发期:大规模预训练与涌现能力(2020 - 2022)**\n", "\n", "* **里程碑**:GPT-3 的发布。\n", "* **特征**:人们发现,当模型参数达到千亿级别时,会产生**“涌现能力(Emergent Abilities)”**——模型开始具备零样本学习(Zero-shot)和上下文学习(In-context Learning)能力,不再需要为每个小任务微调。\n", "* **社会化**:ChatGPT 的出现标志着 LLM 正式具备了强大的指令遵循和多轮对话能力。\n", "\n", "#### **4. 现状:从 ChatBot 迈向 Agent(2023 至今)**\n", "\n", "* **核心逻辑**:模型不再仅仅是“聊天工具”,而是作为**推理引擎**。\n", "* **特征**:随着推理能力(Reasoning,如 OpenAI o1, DeepSeek-R1)和多模态能力(Vision)的突破,模型可以自主规划步骤、调用工具、观察反馈并修正行为。\n", "* **趋势**:长文本支持(Long Context)、低延迟流式输出以及与现实世界接口的深度整合。" ] }, { "cell_type": "markdown", "id": "7f8cc54a-8c93-4ffe-9926-808931c8a421", "metadata": {}, "source": [ "下面将从开发者的视角出发,了解和学习LLM相关的知识" ] }, { "cell_type": "markdown", "id": "c0b59f67-5319-4dc3-9bda-f3f147ec7532", "metadata": {}, "source": [ "## 一、什么是 LLM(大语言模型)" ] }, { "cell_type": "markdown", "id": "a3374e28-5743-4f72-bdcc-b6a4221598f1", "metadata": {}, "source": [ "简单来说,LLM 是一种通过在海量文本数据上进行训练,学会了“预测下一个Token”的深度学习模型。\n", "\n", "\n", "### 1. 从三个维度理解 LLM\n", "\n", "#### **“大” (Large)**\n", "\n", "这里的“大”通常指两个方面:\n", "\n", "* **参数量大**:模型内部拥有数十亿甚至上万亿个可调节的参数(如 GPT-4, Llama 3)。\n", "* **数据量大**:训练数据涵盖了互联网上的书籍、代码、论文、对话等几乎人类所有的公开知识。\n", "\n", "\n", "#### **“语言” (Language)**\n", "\n", "对于 LLM 来说,语言不仅是人类的谈话,还包括:\n", "\n", "* **程序代码**(逻辑的体现)\n", "* **数学公式**(严谨性的体现)\n", "* **结构化数据**(如 JSON、XML,是 Agent 调用工具的基石)\n", "\n", "#### **“模型” (Model)**\n", "\n", "LLM 本质上是一个极其复杂的概率模型。当你给它一个开头(Prompt)时,它在计算:**“基于我读过的所有书,接下来的哪一个字最合理?”**\n", "\n", "\n", "### 2. LLM 的核心技术:Transformer 架构\n", "\n", "目前几乎所有主流的 LLM(GPT, Claude, Gemini, DeepSeek)都基于 **Transformer** 架构。它最核心的创新是 **自注意力机制 (Self-Attention)**。\n", "\n", "* **并行处理**:不同于早期的模型必须逐字阅读,Transformer 可以同时“观察”整句话。\n", "* **上下文联系**:它能理解长句子中不同词语之间的深层关联。例如在“苹果公司发布了新手机,它很受欢迎”中,模型能准确知道“它”指的是“手机”还是“苹果公司”。" ] }, { "cell_type": "markdown", "id": "c251d0b2-ce18-4919-8f6e-aca7c4951e97", "metadata": {}, "source": [ "### 3.大语言模型的本质\n", "从本质看,**LLM 是一个在给定上下文条件下,预测下一个 token 概率分布的函数。**\n", "当你向 LLM 输入一段文本时,例如:\n", "\n", "> “今天北京的天气很……”\n", "\n", "模型并不是在“理解天气”,而是在内部执行这样一件事:\n", "\n", "```\n", "P(下一个 token | 已有的所有 token)\n", "```\n", "\n", "它会对词表中的所有 token 计算一个概率分布,例如:\n", "\n", "| token | 概率 |\n", "| ----- | ------- |\n", "| 冷 | 0.31 |\n", "| 好 | 0.27 |\n", "| 热 | 0.18 |\n", "| 不错 | 0.12 |\n", "| 🍎 | 0.00001 |\n", "\n", "然后根据采样策略(temperature、top-p 等),**选出一个 token**,拼接到已有文本后面,再继续预测下一个。" ] }, { "cell_type": "code", "execution_count": 3, "id": "50c40c00-56aa-4847-afa8-ac20897bbfdb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", "
\n", " (来源:3Blue1Brown)\n", "
\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML, display\n", "\n", "display(HTML(\"\"\"\n", "
\n", " \n", "
\n", " (来源:3Blue1Brown)\n", "
\n", "
\n", "\"\"\"))\n" ] }, { "cell_type": "markdown", "id": "c5e5af53-8d44-4fb0-8330-a2458de942ec", "metadata": {}, "source": [ "### 实战:观察token粒度" ] }, { "cell_type": "code", "execution_count": 4, "id": "535459b8-998f-45bb-89cc-41e3346cb842", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- 使用模型: gpt-4o ---\n", "文本内容 | 字符数 | Token数 | 切分结果\n", "--------------------------------------------------------------------------------\n", "你好,今天天气不错。 | 10 | 7 | ['你好', ',', '今', '天天', '气', '不错', '。']\n", "深度学习卷积神经网络 | 10 | 8 | ['深', '度', '学习', '卷', '积', '神', '经', '网络']\n", "Using LangChain 部署 Agent | 24 | 6 | ['Using', ' Lang', 'Chain', ' 部', '署', ' Agent']\n", "龘龘靐齉爩 | 5 | 10 | ['�', '�', '�', '�', '�', '�', '�', '�', '�', '�']\n" ] } ], "source": [ "import tiktoken\n", "\n", "def analyze_tokens(model_name, text_list):\n", " # 加载对应的编码器\n", " enc = tiktoken.encoding_for_model(model_name)\n", " \n", " print(f\"--- 使用模型: {model_name} ---\")\n", " print(f\"{'文本内容':<30} | {'字符数':<5} | {'Token数':<5} | {'切分结果'}\")\n", " print(\"-\" * 80)\n", " \n", " for text in text_list:\n", " tokens = enc.encode(text)\n", " # 将 token id 转换回文字,方便查看切分细节\n", " token_strings = [enc.decode([t]) for t in tokens]\n", " \n", " print(f\"{text:<30} | {len(text):<8} | {len(tokens):<8} | {token_strings}\")\n", "\n", "# 准备测试样本\n", "samples = [\n", " \"你好,今天天气不错。\", # 普通中文句子\n", " \"深度学习卷积神经网络\", # 专业术语\n", " \"Using LangChain 部署 Agent\", # 中英混合\n", " \"龘龘靐齉爩\" # 生僻字(极端情况)\n", "]\n", "\n", "analyze_tokens(\"gpt-4o\", samples)" ] }, { "cell_type": "markdown", "id": "d4629a87-b365-4e03-95f5-6bb60627b5d4", "metadata": {}, "source": [ "运行上述代码,你会发现:\n", "\n", "* **中文句子**:通常 1 个汉字占据 1 个 Token,但标点符号也占 1 个。\n", "* **专业术语**:由于“卷积”、“神经”等词在训练语料中出现频率极高,它们有时会被合并为一个 Token,或者切分得非常整齐。\n", "* **中英混合**:英文单词通常按空格或子词(Subword)切分。比如 `Using` 是 1 个 Token,但复杂的英文单词可能被拆分。" ] }, { "cell_type": "markdown", "id": "4d768b52-69d3-43bf-8df9-af274d2c4f00", "metadata": {}, "source": [ "## 二、「预测下一个 token」的真实含义\n", "\n", "### 什么是 token\n", "\n", "首先,需要知道的是,token 不是字、词或者句子。\n", "\n", "而是模型词表中的**最小计算单位**。\n", "\n", "例如,“人工智能”的token是 人 / 工 / 智能,unbelievable的token是unbelievable | un / believe / able。\n", "\n", "模型从不“看到句子”,它只看到 **token 序列**。\n", "\n", "\n", "### 预测的不是“内容”,而是概率\n", "\n", "模型真正输出的是:\n", "\n", "```text\n", "{\n", " token_A: 0.42,\n", " token_B: 0.31,\n", " token_C: 0.09,\n", " ...\n", "}\n", "```\n", "\n", "最终的“回答”,只是从这个分布中**采样**出来的结果。\n" ] }, { "cell_type": "markdown", "id": "51c8f864-d5e7-4e9b-8e3d-b93d4f80b08a", "metadata": {}, "source": [ "## 实战: 模拟一个“假模型”" ] }, { "cell_type": "code", "execution_count": 6, "id": "d16d5ad9-6afd-445b-a5f4-c705802d38b7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'首都': 0.55, '城市': 0.15, '中心': 0.1, '经济': 0.05, '苹果': 0.01}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "context = \"北京是中国的\"\n", "\n", "fake_next_token_probs = {\n", " \"首都\": 0.55,\n", " \"城市\": 0.15,\n", " \"中心\": 0.10,\n", " \"经济\": 0.05,\n", " \"苹果\": 0.01\n", "}\n", "\n", "fake_next_token_probs" ] }, { "cell_type": "code", "execution_count": 7, "id": "934f56f8-176f-4b04-9662-0279a8a413c5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['首都', '首都', '首都', '首都', '经济', '城市', '经济', '中心', '城市', '首都']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## 简单采样实验\n", "\n", "import random\n", "\n", "tokens = list(fake_next_token_probs.keys())\n", "probs = list(fake_next_token_probs.values())\n", "\n", "def sample_once():\n", " return random.choices(tokens, probs)[0]\n", "\n", "[sample_once() for _ in range(10)]" ] }, { "cell_type": "markdown", "id": "339fdc3a-c603-4fc0-94e0-81f4fa54dbb6", "metadata": {}, "source": [ "这就是为什么同一个问题每次回答略有不同" ] }, { "cell_type": "markdown", "id": "e865952d-a20f-427a-9d91-d1706aecbe2d", "metadata": {}, "source": [ "## 三、为什么一个“预测器”能产生看起来像智能的行为\n", "\n", "**关键原因语言本身就高度压缩了人类的知识、逻辑和行为模式**,LLM学到**人类在什么情况下会说什么样的话**。\n", "\n", "LLM学到了在语料中反复出现的**问题 → 解决步骤 → 结论**, 在代码中大量存在的**代码 → 注释 → 修复**等等。\n", "\n", "但LLM没有学的事实、规则等人类拥有的这些东西\n" ] }, { "cell_type": "markdown", "id": "c50aa072-03c7-4033-b626-0e4b685d3bb7", "metadata": {}, "source": [ "### 实战: 破坏“智能感”" ] }, { "cell_type": "markdown", "id": "4abe3102-7ccf-4127-8fd1-36c3050d1394", "metadata": {}, "source": [ "提出一个“人类觉得毫无难度”的问题,比如“现在是什么时间?”, 在**没有任何外部工具(如系统时间、浏览器、函数调用)**的情况下,LLM 往往会:\n", "\n", "* 给出一个**模糊或泛化的回答**\n", "\n", " > “我无法获取实时时间,但你可以查看设备上的时间。”\n", "\n", "* 或给出一个**看起来合理但本质回避的问题转移**\n", "\n", " > “时间取决于你所在的时区,目前可以通过系统时钟确认。”\n", "\n", "* 在部分场景中,甚至会**直接“猜一个时间”**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.10" } }, "nbformat": 4, "nbformat_minor": 5 }