{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyNdLyKezZrtEwnFExtqd4C4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# SGLang profiling数据采集与分析\n",
        "\n",
        "介绍：本练习旨在实践SGLang的性能剖析数据采集流程，并学习如何分析所采集的剖析数据。采集过程主要包括：数据下载、镜像拉取、容器创建、运行推理任务以及数据导入。分析部分将基于Qwen2.5-7B-Instruct模型的剖析数据展开，具体包括Python层与GPU层的时序图细节分析。\n",
        "\n",
        "相关文章：[SGLang Profiling数据采集与分析入门](https://zhuanlan.zhihu.com/p/2004605638760763526)\n",
        "\n",
        "Author: kaiyuan\n",
        "\n",
        "Email: kyxie@zju.edu.cn"
      ],
      "metadata": {
        "id": "RuofLhSfiTAt"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 1 准备工作\n",
        "\n",
        "### 1.1 模型下载\n",
        "\n",
        "从hugging face下载模型到服务器本地盘。模型文件地址：\n",
        "https://huggingface.co/Qwen/Qwen2-7B-Instruct\n",
        "\n",
        "设置下载站点：\n",
        "```\n",
        "export HF_ENDPOINT=https://hf-mirror.com\n",
        "```\n",
        "\n",
        "下载脚本（python）：\n",
        "\n",
        "```\n",
        "from huggingface_hub import snapshot_download\n",
        "\n",
        "repo_id = \"Qwen/Qwen2.5-7B-Instruct\"\n",
        "local_dir = \"./models/Qwen2.5-7B-Instruct\"\n",
        "\n",
        "# 重试下载，并允许从断点续传\n",
        "local_dir = snapshot_download(\n",
        "    repo_id=repo_id,\n",
        "    local_dir=local_dir,\n",
        "    local_dir_use_symlinks=False,  # 避免使用符号链接，有时更稳定\n",
        "    resume_download=True,      # 关键参数：尝试恢复中断的下载\n",
        "    force_download=False,      # 不强制重新下载已存在的文件\n",
        ")\n",
        "\n",
        "print(f\"模型已完整下载到：{local_dir}\")\n",
        "```\n",
        "注：huggingface_hub 的版本 1.4.1\n",
        "\n",
        "\n",
        "### 1.2 环境创建\n",
        "\n",
        "推荐使用nvidia的预置镜像，能够避免因环境问题产生error。\n",
        "\n",
        "镜像拉取：\n",
        "```\n",
        "docker pull nvcr.io/nvidia/sglang:26.01-py3\n",
        "```\n",
        "版本介绍：[Link](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/sglang?version=26.01-py3)\n",
        "\n",
        "容器创建示例：\n",
        "```\n",
        "docker run -itd --rm --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \\\n",
        "-v /data/nfs/kaiyuan:/data/nfs/kaiyuan \\\n",
        "--name sglang-dev nvcr.io/nvidia/sglang:26.01-py3 bash\n",
        "```\n",
        "\n",
        "\n",
        "登入容器：\n",
        "\n",
        "```\n",
        "docker exec -it sglang-dev bash\n",
        "```\n",
        "\n",
        "测试是否正常：\n",
        "\n",
        "```\n",
        "python -c \"import torch; import sglang; print(torch.cuda.is_available())\"\n",
        "```\n",
        "\n",
        "注意：镜像要求NVIDIA驱动版本 >= 570\n",
        "\n",
        "\n",
        "本例测试机器信息：\n",
        "- NVIDIA A100-SXM4-80GB\n",
        "- NVIDIA-SMI 570.172.08\n",
        "- Driver Version: 570.172.08\n",
        "- CUDA Version: 13.1\n"
      ],
      "metadata": {
        "id": "Qda9-o9mJ2j2"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 2 数据采集\n",
        "\n",
        "在SGLang官方blog中介绍几种采集方式：[Benchmark and Profiling](https://docs.sglang.io/developer_guide/benchmark_and_profiling.html#benchmark-and-profiling).\n",
        "\n",
        "选取“HTTP API endpoints”方式实践。\n",
        "\n",
        "一共需要启动两个console终端(terminal)界面。\n",
        "\n",
        "- 终端1：启动服务器；\n",
        "- 终端2：运行客户端脚本。\n",
        "\n",
        "### 2.1 服务器启动\n",
        "\n",
        "单机单卡：\n",
        "```\n",
        "# 配置profiling导出位置(必须)：\n",
        "export SGLANG_TORCH_PROFILER_DIR=/data/kaiyuan/llm_infer/profiles\n",
        "\n",
        "# 启动服务器\n",
        "python -m sglang.launch_server --model-path /data/kaiyuan/models/Qwen2.5-7B-Instruct\n",
        "\n",
        "```\n",
        "\n",
        "单机多卡：\n",
        "\n",
        "```\n",
        "SGLANG_TORCH_PROFILER_DIR=\"/data/kaiyuan/llm_infer/profiles\" \\\n",
        "python -m sglang.launch_server \\\n",
        "  --model-path /data/kaiyuan/models/Qwen2.5-7B-Instruct \\\n",
        "  --host 127.0.0.1 \\\n",
        "  --tp-size 4 \\\n",
        "  --port 30000\n",
        "\n",
        "```\n",
        "\n",
        "单机多卡（关闭图模式）：\n",
        "\n",
        "```\n",
        "SGLANG_CUDA_GRAPH_MODE=0 \\\n",
        "SGLANG_CACHE_GRAPH=0 \\\n",
        "CUDA_LAUNCH_BLOCKING=1 \\\n",
        "SGLANG_TORCH_PROFILER_DIR=\"/data/kaiyuan/llm_infer/profiles\" \\\n",
        "python -m sglang.launch_server \\\n",
        "  --model-path /data/kaiyuan/models/Qwen2.5-7B-Instruct \\\n",
        "  --host 127.0.0.1 \\\n",
        "  --tp-size 4 \\\n",
        "  --port 30000\n",
        "\n",
        "```\n",
        "注：\n",
        "* SGLANG_CUDA_GRAPH_MODE环境变量控制CUDA图模式，0表示关闭；\n",
        "* tp-size大于1时，会调用多张GPU。\n",
        "* --model-path 配置下载好的模型地址\n",
        "\n",
        "### 2.2 客户端\n",
        "\n",
        "步骤：\n",
        "- 开启profiling采集；\n",
        "- 向服务器发送连续5个请求；\n",
        "- 关闭profiling采集。\n",
        "\n",
        "profiling采集请求endpoints的基本使用：\n",
        "\n",
        "- 开启：http://127.0.0.1:30000/start_profile\n",
        "- 关闭：http://127.0.0.1:30000/stop_profile\n",
        "\n",
        "采集参数配置：\n",
        "\n",
        "```\n",
        "# Start profiling immediately for 10 steps\n",
        "curl -X POST http://127.0.0.1:30000/start_profile \\\n",
        "  -H \"Content-Type: application/json\" \\\n",
        "  -d '{\n",
        "    \"num_steps\": 10\n",
        "  }'\n",
        "```\n",
        "\n",
        "- output_dir(optional): 性能分析追踪结果的保存目录。若未指定，则使用SGLANG_TORCH_PROFILER_DIR环境变量，或默认使用/tmp目录\n",
        "\n",
        "- num_steps(optional): 进行性能分析的步数。若未指定，则持续进行分析直至通过/end_profile指令手动停止\n",
        "\n",
        "- start_step(optional): 开始性能分析的起始步数（包含该步）。适用于跳过预热迭代阶段\n",
        "\n",
        "- activities(optional): 需要采集的时序图，例如[\"CPU\"]表示仅采集CPU侧数据。默认为[\"CPU\", \"GPU\"]\n",
        "\n",
        "- merge_profiles(optional): 是否合并分布式追踪结果。默认为false\"\n",
        "\n",
        "\n",
        "采集过程的python简化版本："
      ],
      "metadata": {
        "id": "2uh4UT4kNR0v"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# -*- coding: gbk -*-\n",
        "import requests\n",
        "import time\n",
        "\n",
        "base_url = \"http://127.0.0.1:30000\"\n",
        "\n",
        "# 开始性能分析\n",
        "requests.post(f\"{base_url}/start_profile\", timeout=5)\n",
        "\n",
        "# 发送5个推理请求\n",
        "for i in range(5):\n",
        "    requests.post(\n",
        "        f\"{base_url}/v1/completions\",\n",
        "        json={\n",
        "            \"model\": \"default\",\n",
        "            \"prompt\": f\"测试请求 {i+1}: 请解释人工智能的基本概念\",\n",
        "            \"max_tokens\": 30,\n",
        "            \"temperature\": 0.1\n",
        "        },\n",
        "        timeout=15\n",
        "    )\n",
        "    time.sleep(0.5)\n",
        "\n",
        "# 等待并停止性能分析\n",
        "time.sleep(5)\n",
        "requests.post(f\"{base_url}/stop_profile\")\n",
        "\n"
      ],
      "metadata": {
        "id": "EWhtmwqGhb37"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "增强版（包含故障判断）："
      ],
      "metadata": {
        "id": "WwSMkTR5hoGn"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gYuvJNGQh1VL"
      },
      "outputs": [],
      "source": [
        "\n",
        "import requests\n",
        "import time\n",
        "import threading\n",
        "\n",
        "base_url = \"http://127.0.0.1:30000\"\n",
        "\n",
        "def send_stop_profile_with_timeout(timeout=30):\n",
        "    \"\"\"发送stop_profile请求，设置超时\"\"\"\n",
        "    try:\n",
        "        print(f\"发送/stop_profile请求，超时设置为{timeout}秒...\")\n",
        "        response = requests.post(f\"{base_url}/stop_profile\", timeout=timeout)\n",
        "        print(f\"/stop_profile完成: 状态码 {response.status_code}\")\n",
        "        return True\n",
        "    except requests.exceptions.Timeout:\n",
        "        print(f\"stop_profile超时（{timeout}秒），但可能仍在后台处理\")\n",
        "        print(\"这通常是正常的，trace文件可能仍在写入\")\n",
        "        return True  # 即使超时也视为成功\n",
        "    except Exception as e:\n",
        "        print(f\"stop_profile错误: {e}\")\n",
        "        return False\n",
        "\n",
        "def check_server_status():\n",
        "    \"\"\"检查服务器是否仍在运行\"\"\"\n",
        "    try:\n",
        "        resp = requests.get(f\"{base_url}/health\", timeout=2)\n",
        "        return resp.status_code == 200\n",
        "    except:\n",
        "        return False\n",
        "\n",
        "print(\">>> Starting profiling session...\")\n",
        "try:\n",
        "    resp = requests.post(f\"{base_url}/start_profile\", timeout=5)\n",
        "    print(f\"/start_profile: 状态码 {resp.status_code}\")\n",
        "except Exception as e:\n",
        "    print(f\"/start_profile错误: {e}\")\n",
        "\n",
        "# 执行推理测试\n",
        "print(\"\\n>>> 运行推理测试（5个请求）...\")\n",
        "for i in range(5):\n",
        "    print(f\"  请求 {i+1}/5\")\n",
        "    try:\n",
        "        resp = requests.post(\n",
        "            f\"{base_url}/v1/completions\",\n",
        "            json={\n",
        "                \"model\": \"default\",\n",
        "                \"prompt\": f\"测试请求 {i+1}: 请解释人工智能的基本概念\",\n",
        "                \"max_tokens\": 30,\n",
        "                \"temperature\": 0.1\n",
        "            },\n",
        "            timeout=15\n",
        "        )\n",
        "        print(f\"    状态码: {resp.status_code}\")\n",
        "    except Exception as e:\n",
        "        print(f\"    错误: {e}\")\n",
        "    time.sleep(0.5)\n",
        "\n",
        "print(\"\\n>>> 等待所有推理请求完成（5秒）...\")\n",
        "time.sleep(5)\n",
        "print(\"\\n>>> Stopping profiling...\")\n",
        "# 发送stop_profile，设置30秒超时\n",
        "success = send_stop_profile_with_timeout(timeout=30)"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 3 输出示例\n",
        "\n",
        "### 3.1 服务器端输出日志\n",
        "\n",
        "TP size = 1\n",
        "\n",
        "```\n",
        "[2026-01-01 08:50:18] INFO:     127.0.0.1:59572 - \"POST /start_profile HTTP/1.1\" 200 OK\n",
        "[2026-01-01 08:50:18] Prefill batch, #new-seq: 1, #new-token: 11, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 08:50:18] INFO:     127.0.0.1:59574 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 08:50:19] Prefill batch, #new-seq: 1, #new-token: 8, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 08:50:19] Decode batch, #running-req: 1, #token: 14, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.99, #queue-req: 0,\n",
        "[2026-01-01 08:50:19] INFO:     127.0.0.1:59580 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 08:50:20] Prefill batch, #new-seq: 1, #new-token: 8, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 08:50:20] Decode batch, #running-req: 1, #token: 24, token usage: 0.00, cuda graph: True, gen throughput (token/s): 35.99, #queue-req: 0,\n",
        "[2026-01-01 08:50:20] INFO:     127.0.0.1:59582 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 08:50:21] Prefill batch, #new-seq: 1, #new-token: 8, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 08:50:21] Decode batch, #running-req: 1, #token: 34, token usage: 0.00, cuda graph: True, gen throughput (token/s): 36.09, #queue-req: 0,\n",
        "[2026-01-01 08:50:21] INFO:     127.0.0.1:59588 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 08:50:22] Prefill batch, #new-seq: 1, #new-token: 8, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 08:50:22] INFO:     127.0.0.1:59594 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 08:50:28] Stop profiling...\n",
        "[2026-01-01 08:55:53] Profiling done. Traces are saved to: /data/kaiyuan/llm_infer/profiles\n",
        "\n",
        "```\n",
        "\n",
        "TP size = 4\n",
        "\n",
        "```\n",
        "[2026-01-01 09:19:37] INFO:     127.0.0.1:36936 - \"POST /start_profile HTTP/1.1\" 200 OK\n",
        "[2026-01-01 09:19:37 TP0] Prefill batch, #new-seq: 1, #new-token: 11, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 09:19:37] INFO:     127.0.0.1:36940 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 09:19:38 TP0] Prefill batch, #new-seq: 1, #new-token: 8, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 09:19:38 TP0] Decode batch, #running-req: 1, #token: 14, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.54, #queue-req: 0,\n",
        "[2026-01-01 09:19:38] INFO:     127.0.0.1:36946 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 09:19:39 TP0] Prefill batch, #new-seq: 1, #new-token: 8, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 09:19:39 TP0] Decode batch, #running-req: 1, #token: 24, token usage: 0.00, cuda graph: True, gen throughput (token/s): 38.68, #queue-req: 0,\n",
        "[2026-01-01 09:19:39] INFO:     127.0.0.1:36950 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 09:19:40 TP0] Prefill batch, #new-seq: 1, #new-token: 8, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 09:19:40 TP0] Decode batch, #running-req: 1, #token: 34, token usage: 0.00, cuda graph: True, gen throughput (token/s): 38.90, #queue-req: 0,\n",
        "[2026-01-01 09:19:40] INFO:     127.0.0.1:36956 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 09:19:41 TP0] Prefill batch, #new-seq: 1, #new-token: 8, #cached-token: 3, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 09:19:41] INFO:     127.0.0.1:36958 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 09:19:47 TP1] Stop profiling...\n",
        "[2026-01-01 09:19:47 TP3] Stop profiling...\n",
        "[2026-01-01 09:19:47 TP0] Stop profiling...\n",
        "[2026-01-01 09:19:47 TP2] Stop profiling...\n",
        "[2026-01-01 09:21:07 TP0] Profiling done. Traces are saved to: /data/kaiyuan/llm_infer/profiles\n",
        "[2026-01-01 09:21:07 TP2] Profiling done. Traces are saved to: /data/kaiyuan/llm_infer/profiles\n",
        "[2026-01-01 09:21:07 TP3] Profiling done. Traces are saved to: /data/kaiyuan/llm_infer/profiles\n",
        "[2026-01-01 09:21:07 TP1] Profiling done. Traces are saved to: /data/kaiyuan/llm_infer/profiles\n",
        "\n",
        "```\n",
        "\n",
        "### 客户端\n",
        "\n",
        "```\n",
        ">>> Starting profiling session...\n",
        "/start_profile: 状态码 200\n",
        "\n",
        ">>> 运行推理测试（5个请求）...\n",
        "  请求 1/5\n",
        "    状态码: 200\n",
        "  请求 2/5\n",
        "    状态码: 200\n",
        "  请求 3/5\n",
        "    状态码: 200\n",
        "  请求 4/5\n",
        "    状态码: 200\n",
        "  请求 5/5\n",
        "    状态码: 200\n",
        "\n",
        ">>> 等待所有推理请求完成（5秒）...\n",
        "\n",
        ">>> Stopping profiling...\n",
        "发送/stop_profile请求，超时设置为30秒...\n",
        "stop_profile超时（30秒），但可能仍在后台处理\n",
        "这通常是正常的，trace文件可能仍在写入\n",
        "\n",
        "```\n",
        "\n",
        "注：profiling采集的时间通常会大于30s，尤其是用挂载nfs网盘的时候。\n"
      ],
      "metadata": {
        "id": "l541upRTWu82"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 4 profiling阅读方式\n",
        "\n",
        "打开网页：[https://ui.perfetto.dev/](https://ui.perfetto.dev/)后导入profiling文件，如：xxxx_trace.json"
      ],
      "metadata": {
        "id": "DhzRzNgGYQQe"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 5 多TP融合采集方式\n",
        "\n",
        "融合多TP profiling数据融合\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "xYTpv8G1n7NY"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 服务器端启动方式\n",
        "\n",
        "```\n",
        "SGLANG_TORCH_PROFILER_DIR=\"/data/kaiyuan/llm_infer/profiles\" \\\n",
        "python -m sglang.launch_server \\\n",
        "  --model-path /data/kaiyuan/models/Qwen2.5-7B-Instruct \\\n",
        "  --host 127.0.0.1 \\\n",
        "  --tp-size 4 \\\n",
        "  --port 30000\n",
        "```\n",
        "\n",
        "### 客户端运行代码"
      ],
      "metadata": {
        "id": "5jVk0rUTeXum"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# -*- coding: gbk -*-\n",
        "import requests\n",
        "import time\n",
        "\n",
        "base_url = \"http://127.0.0.1:30000\"\n",
        "\n",
        "# 开始性能分析\n",
        "url = \"http://127.0.0.1:30000/start_profile\"\n",
        "headers = {\"Content-Type\": \"application/json\"}\n",
        "data = {\"merge_profiles\": True} # 多TP采集融合\n",
        "\n",
        "response = requests.post(url, headers=headers, json=data)\n",
        "\n",
        "# 发送5个推理请求\n",
        "for i in range(5):\n",
        "    requests.post(\n",
        "        f\"{base_url}/v1/completions\",\n",
        "        json={\n",
        "            \"model\": \"default\",\n",
        "            \"prompt\": f\"测试请求 {i+1}: 请解释人工智能的基本概念\",\n",
        "            \"max_tokens\": 30,\n",
        "            \"temperature\": 0.1\n",
        "        },\n",
        "        timeout=15\n",
        "    )\n",
        "    time.sleep(0.5)\n",
        "\n",
        "# 等待并停止性能分析\n",
        "time.sleep(5)\n",
        "requests.post(f\"{base_url}/stop_profile\")"
      ],
      "metadata": {
        "id": "hhSEWTR-oDif"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "日志示例：\n",
        "\n",
        "```\n",
        "[2026-01-01 12:53:47 TP0] Profiling starts. Traces will be saved to: /data/kaiyuan/llm_infer/profiles (with profile id: 1770728027.861476)\n",
        "[2026-01-01 12:53:47 TP3] Profiling starts. Traces will be saved to: /data/kaiyuan/llm_infer/profiles (with profile id: 1770728027.861476)\n",
        "[2026-01-01 12:53:47 TP1] Profiling starts. Traces will be saved to: /data/kaiyuan/llm_infer/profiles (with profile id: 1770728027.861476)\n",
        "[2026-01-01 12:53:47 TP2] Profiling starts. Traces will be saved to: /data/kaiyuan/llm_infer/profiles (with profile id: 1770728027.861476)\n",
        "[2026-01-01 12:53:47] INFO:     127.0.0.1:47480 - \"POST /start_profile HTTP/1.1\" 200 OK\n",
        "[2026-01-01 12:53:47 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 10, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 12:53:53 TP0] Decode batch, #running-req: 1, #token: 14, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.03, #queue-req: 0,\n",
        "[2026-01-01 12:53:53] INFO:     127.0.0.1:47482 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 12:53:54 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 10, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 12:53:54 TP0] Decode batch, #running-req: 1, #token: 24, token usage: 0.00, cuda graph: True, gen throughput (token/s): 38.91, #queue-req: 0,\n",
        "[2026-01-01 12:53:54] INFO:     127.0.0.1:47504 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 12:53:55 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 10, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 12:53:55 TP0] Decode batch, #running-req: 1, #token: 34, token usage: 0.00, cuda graph: True, gen throughput (token/s): 39.15, #queue-req: 0,\n",
        "[2026-01-01 12:53:55] INFO:     127.0.0.1:47510 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 12:53:56 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 10, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 12:53:56] INFO:     127.0.0.1:47512 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 12:53:57 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 10, token usage: 0.00, #running-req: 0, #queue-req: 0,\n",
        "[2026-01-01 12:53:57 TP0] Decode batch, #running-req: 1, #token: 14, token usage: 0.00, cuda graph: True, gen throughput (token/s): 24.92, #queue-req: 0,\n",
        "[2026-01-01 12:53:57] INFO:     127.0.0.1:47518 - \"POST /v1/completions HTTP/1.1\" 200 OK\n",
        "[2026-01-01 12:54:02 TP0] Stop profiling...\n",
        "[2026-01-01 12:54:02 TP1] Stop profiling...\n",
        "[2026-01-01 12:54:02 TP2] Stop profiling...\n",
        "[2026-01-01 12:54:02 TP3] Stop profiling...\n",
        "[2026-01-01 12:55:25 TP0] Starting profile merge...\n",
        "[2026-01-01 12:55:25 TP2] Profiling done. Traces are saved to: /data/kaiyuan/llm_infer/profiles\n",
        "[2026-01-01 12:55:25 TP3] Profiling done. Traces are saved to: /data/kaiyuan/llm_infer/profiles\n",
        "[2026-01-01 12:55:25 TP1] Profiling done. Traces are saved to: /data/kaiyuan/llm_infer/profiles\n",
        "[2026-01-01 12:55:25 TP0] Found 4 trace files to merge\n",
        "[2026-01-01 12:55:25 TP0] Processing /data/kaiyuan/llm_infer/profiles/1770728027.861476-TP-0.trace.json.gz with rank info: {'tp_rank': 0}\n",
        "[2026-01-01 12:55:25 TP0] Processing file: /data/kaiyuan/llm_infer/profiles/1770728027.861476-TP-0.trace.json.gz\n",
        "[2026-01-01 12:55:37 TP0] Processing /data/kaiyuan/llm_infer/profiles/1770728027.861476-TP-1.trace.json.gz with rank info: {'tp_rank': 1}\n",
        "[2026-01-01 12:55:37 TP0] Processing file: /data/kaiyuan/llm_infer/profiles/1770728027.861476-TP-1.trace.json.gz\n",
        "[2026-01-01 12:55:46 TP0] Processing /data/kaiyuan/llm_infer/profiles/1770728027.861476-TP-2.trace.json.gz with rank info: {'tp_rank': 2}\n",
        "[2026-01-01 12:55:46 TP0] Processing file: /data/kaiyuan/llm_infer/profiles/1770728027.861476-TP-2.trace.json.gz\n",
        "[2026-01-01 12:55:56 TP0] Processing /data/kaiyuan/llm_infer/profiles/1770728027.861476-TP-3.trace.json.gz with rank info: {'tp_rank': 3}\n",
        "[2026-01-01 12:55:56 TP0] Processing file: /data/kaiyuan/llm_infer/profiles/1770728027.861476-TP-3.trace.json.gz\n",
        "\n",
        "[2026-01-01 12:58:58 TP0] Merged profile saved to: /data/kaiyuan/llm_infer/profiles/merged-1770728027.861476.trace.json.gz\n",
        "[2026-01-01 12:58:58 TP0] Total events merged: 11548182\n",
        "[2026-01-01 12:59:42 TP0] Profile merge completed: /data/kaiyuan/llm_infer/profiles/merged-1770728027.861476.trace.json.gz\n",
        "[2026-01-01 12:59:42 TP0] Profiling done. Traces are saved to: /data/kaiyuan/llm_infer/profiles Merged trace: /data/kaiyuan/llm_infer/profiles/merged-1770728027.861476.trace.json.gz (Events: 11548182, Files: 4)\n",
        "\n",
        "```"
      ],
      "metadata": {
        "id": "IpKkMABnesWO"
      }
    }
  ]
}