{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 使用VLMEvalKit进行多模态模型评估\n", "\n", "VLMEvalKit (python 包名为 vlmeval) 是一款专为大型视觉语言模型 (Large Vision-Language Models, LVLMs) 评测而设计的开源工具包。该工具支持在各种基准测试上对大型视觉语言模型进行一键评估,无需进行繁重的数据准备工作,让评估过程更加简便。\n", "\n", "以下展示两种方式进行模型评估:\n", "1. 使用EvalScope封装的VLMEvalKit评测接口\n", "2. 直接使用VLMEvalKit评测接口" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. 使用EvalScope封装的VLMEvalKit评测接口\n", "\n", "[EvalScope](https://github.com/modelscope/evalscope) 是魔搭社区官方推出的模型评估与性能基准测试框架,内置多个常用测试基准和评估指标,如MMLU、CMMLU、C-Eval、GSM8K、ARC、HellaSwag、TruthfulQA、MATH和HumanEval等;支持多种类型的模型评测,包括LLM、多模态LLM、embedding模型和reranker模型。EvalScope还适用于多种评测场景,如端到端RAG评测、竞技场模式和模型推理性能压测等。此外,通过ms-swift训练框架的无缝集成,可一键发起评测,实现了模型训练到评测的全链路支持。\n", "\n", "使用指南:[EvalScope使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/vlmevalkit_backend.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 环境准备" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "shellscript" } }, "outputs": [], "source": [ "!pip install evalscope[vlmeval] -U\n", "!pip install ms-swift -U" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 部署模型" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "shellscript" } }, "outputs": [], "source": [ "!CUDA_VISIBLE_DEVICES=0 swift deploy --model_type internvl2-8b --port 8000" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-11-25 19:48:03,083 - evalscope - INFO - ** Args: Task config is provided with dictionary type. **\n", "2024-11-25 19:48:03,084 - evalscope - INFO - Check VLM Evaluation Kit: Installed\n", "2024-11-25 19:48:03,085 - evalscope - INFO - *** Run task with config: Arguments(data=['SEEDBench_IMG', 'ChartQA_TEST'], model=['internvl2-8b'], fps=-1, nframe=8, pack=False, use_subtitle=False, work_dir='outputs', mode='all', nproc=16, retry=None, judge='exact_matching', verbose=False, ignore=False, reuse=False, limit=30, config=None, OPENAI_API_KEY='EMPTY', OPENAI_API_BASE=None, LOCAL_LLM=None) \n", "\n", "[2024-11-25 19:48:03,085] WARNING - RUN - run.py: run_task - 145: --reuse is not set, will not reuse previous (before one day) temporary files\n", "2024-11-25 19:48:03,085 - RUN - WARNING - --reuse is not set, will not reuse previous (before one day) temporary files\n", "[2024-11-25 19:48:07,410] INFO - ChatAPI - gpt.py: __init__ - 135: Using API Base: http://localhost:8000/v1/chat/completions; API Key: EMPTY\n", "2024-11-25 19:48:07,410 - ChatAPI - INFO - Using API Base: http://localhost:8000/v1/chat/completions; API Key: EMPTY\n", " 0%| | 0/10 [00:00> Start to get the report with summarizer ...\n", "\n", ">> The report list: [{'internvl2-8b_SEEDBench_IMG_acc': {'split': 'none', 'Overall': '0.6333333333333333', 'Instance Attributes': '0.8571428571428571', 'Instance Identity': '0.3333333333333333', 'Instance Interaction': '1.0', 'Instance Location': '0.0', 'Instances Counting': '0.5', 'Scene Understanding': '0.75', 'Visual Reasoning': '1.0'}}, {'internvl2-8b_ChartQA_TEST_acc': {'test_human': '53.333333333333336', 'Overall': '53.333333333333336'}}]\n" ] } ], "source": [ "task_cfg_dict = {\n", " 'eval_backend': 'VLMEvalKit',\n", " 'eval_config': {\n", " 'data': ['SEEDBench_IMG', 'ChartQA_TEST'],\n", " 'limit': 30,\n", " 'mode': 'all',\n", " 'model': [{\n", " 'api_base': 'http://localhost:8000/v1/chat/completions',\n", " 'key': 'EMPTY',\n", " 'name': 'CustomAPIModel',\n", " 'temperature': 0.0,\n", " 'type': 'internvl2-8b'\n", " }],\n", " 'reuse': False,\n", " 'work_dir': 'outputs',\n", " 'judge': 'exact_matching'\n", " }\n", "}\n", "\n", "from evalscope.run import run_task\n", "from evalscope.summarizer import Summarizer\n", "\n", "\n", "def run_eval():\n", " # 选项 1: python 字典\n", " task_cfg = task_cfg_dict\n", "\n", " # 选项 2: yaml 配置文件\n", " # task_cfg = 'eval_openai_api.yaml'\n", "\n", " run_task(task_cfg=task_cfg)\n", "\n", " print('>> Start to get the report with summarizer ...')\n", " report_list = Summarizer.get_report_from_cfg(task_cfg)\n", " print(f'\\n>> The report list: {report_list}')\n", "\n", "\n", "run_eval()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. 直接使用VLMEvalKit\n", "\n", "直接使用VLMEvalKit需设置`VLMEVALKIT_USE_MODELSCOPE=1`来开启从modelscope下载数据集的能力,目前支持如下视频评测数据集:\n", "- MVBench_MP4\n", "- MLVU_OpenEnded\n", "- MLVU_MCQ\n", "- LongVideoBench\n", "- TempCompass_MCQ\n", "- TempCompass_Captioning\n", "- TempCompass_YorN\n", "- Video-MME\n", "- MVBench\n", "- MMBench-Video" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 环境准备" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "shellscript" } }, "outputs": [], "source": [ "\n", "git clone https://github.com/open-compass/VLMEvalKit.git\n", "cd VLMEvalKit\n", "pip install -e ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "VLM 配置:所有 VLMs 都在 `vlmeval/config.py` 中配置。对于某些 VLMs(如 MiniGPT-4、LLaVA-v1-7B),需要额外的配置(在配置文件中配置代码 / 模型权重根目录)。在评估时,你应该使用 `vlmeval/config.py` 中 supported_VLM 指定的模型名称来选择 VLM。确保在开始评估之前,你可以成功使用 VLM 进行推理,使用以下命令 `vlmutil check {MODEL_NAME}`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "shellscript" } }, "outputs": [], "source": [ "# 执行如下bash命令开始评测:\n", "!python run.py --data TempCompass --model InternVL2-8B" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 2 }