{ "cells": [ { "cell_type": "markdown", "id": "c109c0e7-1aad-42ab-88d8-0990559b59e5", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "Supplementary code for the Build a Reasoning Model (From Scratch) book by Sebastian Raschka
\n", "
Code repository: https://github.com/rasbt/reasoning-from-scratch\n", "
\n", "
\n", "\n", "
\n" ] }, { "cell_type": "markdown", "id": "88c613ef-f4e5-49c3-b19d-3cf36dce0bf1", "metadata": {}, "source": [ "# Appendix D: Using larger LLMs" ] }, { "cell_type": "markdown", "id": "9c1cd731-7e23-4430-8ec6-c4a86a177f81", "metadata": {}, "source": [ "Packages that are being used in this notebook:" ] }, { "cell_type": "code", "execution_count": 1, "id": "b6882804-a2c4-4c98-ad42-1b108cbffa5b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "reasoning_from_scratch version: 0.1.17\n", "torch version: 2.10.0\n", "tokenizers version: 0.21.4\n" ] } ], "source": [ "from importlib.metadata import version\n", "\n", "used_libraries = [\n", " \"reasoning_from_scratch\", # for download functions\n", " \"torch\",\n", " \"tokenizers\"\n", "]\n", "\n", "for lib in used_libraries:\n", " print(f\"{lib} version: {version(lib)}\")" ] }, { "cell_type": "markdown", "id": "9d13703b-f75b-43fe-9c8a-e7459a884f36", "metadata": {}, "source": [ "- The main chapters use the Qwen3 0.6B base model because it is the smallest model in the\n", "Qwen3 family and therefore the easiest to run on consumer hardware\n", "- However, the same `Qwen3Model` implementation from appendix C can also be used to load larger dense Qwen3 checkpoints with the same from-scratch PyTorch model code" ] }, { "cell_type": "markdown", "id": "1e0cfd0a-f08a-4196-adde-619a23ccc24b", "metadata": {}, "source": [ " \n", "## D.1 Larger dense Qwen3 configurations" ] }, { "cell_type": "markdown", "id": "97a9fcc6-f3f0-4447-a74a-715a800b1a76", "metadata": {}, "source": [ "The repository includes configuration dictionaries for several larger dense Qwen3 models (beyond the 0.6B model) in\n", "`reasoning_from_scratch.appendix_c` ([reasoning_from_scratch/appendix_c.py](https://github.com/rasbt/reasoning-from-scratch/blob/main/reasoning_from_scratch/appendix_c.py)):\n", "\n", "| Model size | Configuration dictionary |\n", "| --- | --- |\n", "| 1.7B | `QWEN3_CONFIG_1_7B` |\n", "| 4B | `QWEN3_CONFIG_4B` |\n", "| 8B | `QWEN3_CONFIG_8B` |\n", "| 14B | `QWEN3_CONFIG_14B` |\n", "| 32B | `QWEN3_CONFIG_32B` |" ] }, { "cell_type": "markdown", "id": "3b82f50d-c04b-4a79-b3e1-73756781bb7d", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "5791527a-2c80-400e-aec2-aa7a271eb69d", "metadata": {}, "source": [ "- As mentioned in the figure above, these are the \"dense\" Qwen3 variants, which can run on single GPUs\n", "- There are also \"sparse\" Mixture-of-Experts variants of Qwen3, but they are not supported via this books' code; however, if you are interested in a from-scratch implementation, you can find one here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11_qwen3\n", "- All of these use the same overall architecture pattern as the 0.6B model from appendix C\n", "- What changes are the embedding size, number of layers, number of attention heads, and\n", "feed-forward hidden dimension" ] }, { "cell_type": "markdown", "id": "aa4ae6b2-3ab4-4127-b925-16314cf2c90a", "metadata": {}, "source": [ "- As a rough lower bound, storing weights in bfloat16 requires about 2 bytes per parameter\n", "- This means that the checkpoint weights alone are on the order of:\n", "\n", "| Model size | Rough weight memory in bfloat16 |\n", "| --- | --- |\n", "| 1.7B | about 3.4 GB |\n", "| 4B | about 8 GB |\n", "| 8B | about 16 GB |\n", "| 14B | about 28 GB |\n", "| 32B | about 64 GB |\n" ] }, { "cell_type": "markdown", "id": "43c711a3-ea2d-4ab1-9bdc-9f6154c53af5", "metadata": {}, "source": [ "- In practice, the real runtime memory usage is higher because we also need memory for\n", "activations, temporary buffers, and often the KV cache" ] }, { "cell_type": "markdown", "id": "123c9c2c-da52-458e-a557-26f4f342e358", "metadata": {}, "source": [ " \n", "## D.2 Downloading larger checkpoints overview" ] }, { "cell_type": "markdown", "id": "e8812f03-17d9-4026-922a-3c33f27713d4", "metadata": {}, "source": [ "- Unlike the 0.6B checkpoints used in the main chapters, larger official Qwen3 models are\n", "typically distributed as `safetensors` files, sometimes split across multiple shards\n", "- The helper function `download_from_huggingface_from_snapshots` to load these requires some additional packages:" ] }, { "cell_type": "markdown", "id": "2364d29d-88d3-45a0-bd69-a69060164915", "metadata": {}, "source": [ "```bash\n", "!uv add huggingface_hub safetensors\n", "```\n", "\n", "or\n", "\n", "```bash\n", "!pip install huggingface_hub safetensors\n", "```" ] }, { "cell_type": "markdown", "id": "e5b7c66e-3afe-4f91-bab0-f127351fece8", "metadata": {}, "source": [ " \n", "## D.3 Loading a larger base model" ] }, { "cell_type": "markdown", "id": "4de72b45-2d0e-455c-b274-3ff2f7b681e6", "metadata": {}, "source": [ "- Download weights:" ] }, { "cell_type": "code", "execution_count": 2, "id": "a694d56a-d9ce-467a-b0a4-bf256547afec", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using Apple Silicon GPU (MPS)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/sebastian/Developer/reasoning-from-scratch/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n", "Fetching 13 files: 100%|██████████████████████| 13/13 [00:00<00:00, 2616.79it/s]\n" ] } ], "source": [ "from pathlib import Path\n", "from reasoning_from_scratch.ch02 import get_device\n", "from reasoning_from_scratch.appendix_c import (\n", " download_from_huggingface_from_snapshots\n", ")\n", "\n", "\n", "device = get_device()\n", "local_dir = Path(\"qwen3-4b-base\")\n", "\n", "weights = download_from_huggingface_from_snapshots(\n", " repo_id=\"Qwen/Qwen3-4B-Base\",\n", " local_dir=local_dir,\n", ")" ] }, { "cell_type": "markdown", "id": "c6fd12a3-0b26-4052-8054-c8e3e61bce96", "metadata": {}, "source": [ "- Initialize model:" ] }, { "cell_type": "code", "execution_count": 3, "id": "56697fe2-bd0d-47c2-b53c-395d5b4da597", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model uses weight tying.\n" ] }, { "data": { "text/plain": [ "Qwen3Model(\n", " (tok_emb): Embedding(151936, 2560)\n", " (trf_blocks): ModuleList(\n", " (0-35): 36 x TransformerBlock(\n", " (att): GroupedQueryAttention(\n", " (W_query): Linear(in_features=2560, out_features=4096, bias=False)\n", " (W_key): Linear(in_features=2560, out_features=1024, bias=False)\n", " (W_value): Linear(in_features=2560, out_features=1024, bias=False)\n", " (out_proj): Linear(in_features=4096, out_features=2560, bias=False)\n", " (q_norm): RMSNorm()\n", " (k_norm): RMSNorm()\n", " )\n", " (ff): FeedForward(\n", " (fc1): Linear(in_features=2560, out_features=9728, bias=False)\n", " (fc2): Linear(in_features=2560, out_features=9728, bias=False)\n", " (fc3): Linear(in_features=9728, out_features=2560, bias=False)\n", " )\n", " (norm1): RMSNorm()\n", " (norm2): RMSNorm()\n", " )\n", " )\n", " (final_norm): RMSNorm()\n", " (out_head): Linear(in_features=2560, out_features=151936, bias=False)\n", ")" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from reasoning_from_scratch.qwen3 import (\n", " Qwen3Model, load_hf_weights_into_qwen\n", ")\n", "from reasoning_from_scratch.appendix_c import QWEN3_CONFIG_4B\n", "\n", "\n", "model = Qwen3Model(QWEN3_CONFIG_4B)\n", "load_hf_weights_into_qwen(\n", " model,\n", " param_config={\n", " \"n_layers\": QWEN3_CONFIG_4B[\"n_layers\"],\n", " \"hidden_dim\": QWEN3_CONFIG_4B[\"hidden_dim\"],\n", " },\n", " params=weights,\n", ")\n", "model.to(device)\n", "model.eval()" ] }, { "cell_type": "markdown", "id": "fb99f826-49eb-49d4-b649-3395c97d5266", "metadata": {}, "source": [ "- Load tokenizer:" ] }, { "cell_type": "code", "execution_count": 4, "id": "1f80ad58-2995-4952-8f90-f76a9aaba3ca", "metadata": {}, "outputs": [], "source": [ "from reasoning_from_scratch.qwen3 import Qwen3Tokenizer\n", "import shutil\n", "\n", "# Note that the original base tokenizer is called \"tokenizer.json\"\n", "# We rename it to distinguish from the reasoning tokenizer (next section)\n", "tokenizer_src = local_dir / \"tokenizer.json\"\n", "tokenizer_path = local_dir / \"tokenizer-base.json\"\n", "\n", "if not tokenizer_path.exists():\n", " shutil.copyfile(tokenizer_src, tokenizer_path)\n", "\n", "tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)" ] }, { "cell_type": "markdown", "id": "2586d0c1-034e-4cd3-8aa0-e3c8e7ad2325", "metadata": {}, "source": [ "- Use model:" ] }, { "cell_type": "code", "execution_count": 5, "id": "943d671a-132f-49a3-932c-9b0eabdf3009", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Large language models are artificial intelligence systems that use deep learning techniques to understand and generate human-like text. They are trained on vast amounts of data and can perform a wide range of natural language processing tasks, such as translation, summarization, and question answering." ] } ], "source": [ "import torch\n", "from reasoning_from_scratch.ch02 import (\n", " generate_text_basic_stream_cache,\n", ")\n", "\n", "prompt = \"Explain large language models in two sentences.\"\n", "input_ids = torch.tensor(\n", " tokenizer.encode(prompt),\n", " device=device,\n", ").unsqueeze(0)\n", "\n", "for token in generate_text_basic_stream_cache(\n", " model=model,\n", " token_ids=input_ids,\n", " max_new_tokens=64,\n", " eos_token_id=tokenizer.eos_token_id,\n", "):\n", " print(tokenizer.decode(token.squeeze(0).tolist()), end=\"\", flush=True)" ] }, { "cell_type": "markdown", "id": "acb9de1d-dc40-4e47-bd92-a3efd1f64fa0", "metadata": {}, "source": [ " \n", "## D.4 Loading a larger reasoning variant" ] }, { "cell_type": "markdown", "id": "e4e45fea-eaab-4067-aa8e-d103b790465a", "metadata": {}, "source": [ "- The same idea also works for larger reasoning-style Qwen3 models\n", "- The architecture for a given model size stays the same; only the checkpoint and tokenizer settings change" ] }, { "cell_type": "markdown", "id": "c4cdd902-7402-4142-b1bd-cbde3a453834", "metadata": {}, "source": [ "For example, to load the 4B reasoning variant instead of the 4B base variant, we would:\n", "\n", "- switch the repository ID from `Qwen/Qwen3-4B-Base` to `Qwen/Qwen3-4B`;\n", "- copy the `tokenizer.json` file to `tokenizer-reasoning.json`;\n", "- initialize the tokenizer as follows:" ] }, { "cell_type": "markdown", "id": "87c71dd6-ac91-4ca8-836e-d21a67469986", "metadata": {}, "source": [ "```python\n", "tokenizer = Qwen3Tokenizer(\n", " tokenizer_file_path=tokenizer_path,\n", " apply_chat_template=True,\n", " add_generation_prompt=True,\n", " add_thinking=True,\n", ")\n", "```" ] }, { "cell_type": "markdown", "id": "0ee741b9-fff2-4ebc-a6b5-d59fc11e7e63", "metadata": {}, "source": [ "- The rest of the model-loading and -usage code stays the same" ] }, { "cell_type": "markdown", "id": "23c3543a-66dd-4ff3-b837-131485f862ae", "metadata": {}, "source": [ " \n", "## D.5 Practical recommendations" ] }, { "cell_type": "markdown", "id": "b4ab6b1e-72e7-4fa7-876d-e3aade38b9b1", "metadata": {}, "source": [ "- No code in this section" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 5 }