{ "cells": [ { "cell_type": "markdown", "id": "license-header", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cosmos3 Generator Audiovisual with vLLM Omni\n", "\n", "This notebook calls already-running vLLM Omni Cosmos3 servers with direct `curl` requests from Python.\n", "\n", "The examples are split into Cosmos3-Nano and Cosmos3-Super sections. Each section is self-contained, so you can run just one. Each section targets the matching model endpoint.\n" ], "id": "d88fe9a8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Prerequisites\n", "\n", "Use a running vLLM Omni server and set endpoint environment variables before the setup cell if you are not using the local default. Text-to-image uses `/v1/images/generations`; video modes use `/v1/videos/sync`.\n", "\n", "```bash\n", "export COSMOS3_VLLM_BASE_URL=http://localhost:8000\n", "export COSMOS3_VLLM_NANO_BASE_URL=http://localhost:8000\n", "export COSMOS3_VLLM_SUPER_BASE_URL=http://localhost:8000\n", "```\n" ], "id": "49df4e61" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Start the Server\n", "\n", "Run the vLLM Omni server before running the request cells. Use the Docker image for every modality on this page. Mount any directory that contains local media or action files you want the server to read.\n", "\n", "### Docker Image: Cosmos3-Nano\n", "\n", "```bash\n", "docker run --runtime nvidia --gpus all \\\n", " -v ~/.cache/huggingface:/root/.cache/huggingface \\\n", " -v \"$(pwd):/workspace\" \\\n", " -p 8000:8000 \\\n", " --ipc=host \\\n", " vllm/vllm-omni:cosmos3 \\\n", " vllm serve nvidia/Cosmos3-Nano \\\n", " --omni \\\n", " --model-class-name Cosmos3OmniDiffusersPipeline \\\n", " --allowed-local-media-path / \\\n", " --port 8000 \\\n", " --init-timeout 1800\n", "```\n", "\n", "### Docker Image: Cosmos3-Super\n", "\n", "`Cosmos3-Super` is the larger 64B model, so it usually needs more GPU memory than `Cosmos3-Nano`. `--tensor-parallel-size` splits model weights across multiple GPUs and reduces per-GPU memory use. `--enable-layerwise-offload` reduces peak GPU memory further by offloading transformer blocks between CPU and GPU, with a latency tradeoff and additional CPU RAM use. Set `--tensor-parallel-size` to the number of GPUs you want to use.\n", "\n", "For example, on four GPUs:\n", "\n", "```bash\n", "docker run --runtime nvidia --gpus all \\\n", " -v ~/.cache/huggingface:/root/.cache/huggingface \\\n", " -v \"$(pwd):/workspace\" \\\n", " -p 8000:8000 \\\n", " --ipc=host \\\n", " vllm/vllm-omni:cosmos3 \\\n", " vllm serve nvidia/Cosmos3-Super \\\n", " --omni \\\n", " --model-class-name Cosmos3OmniDiffusersPipeline \\\n", " --allowed-local-media-path / \\\n", " --tensor-parallel-size 4 \\\n", " --enable-layerwise-offload \\\n", " --port 8000 \\\n", " --init-timeout 1800\n", "```\n", "\n", "### CFG Parallel\n", "\n", "Use `--cfg-parallel-size 2` to run the positive and negative CFG branches in parallel on two GPUs:\n", "\n", "```bash\n", "vllm serve nvidia/Cosmos3-Nano \\\n", " --omni \\\n", " --model-class-name Cosmos3OmniDiffusersPipeline \\\n", " --allowed-local-media-path / \\\n", " --cfg-parallel-size 2 \\\n", " --port 8000 \\\n", " --init-timeout 1800\n", "```\n", "\n", "For Cosmos3, set CFG strength with the request-level `guidance_scale` field. Do not use `true_cfg_scale` for CFG Parallel with these Cosmos3 examples.\n", "" ], "id": "26776c50" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Configure Paths and Endpoints\n", "\n", "This setup cell only configures repo/output paths and vLLM endpoint settings.\n" ], "id": "4412f2f9" }, { "cell_type": "code", "metadata": {}, "source": [ "from pathlib import Path\n", "import os\n", "\n", "\n", "def find_repo_root(start: Path) -> Path:\n", " for path in [start, *start.parents]:\n", " if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n", " return path\n", " return start\n", "\n", "\n", "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n", "COSMOS3_AUDIOVISUAL_ROOT = COSMOS_ROOT / \"cookbooks\" / \"cosmos3\" / \"generator\" / \"audiovisual\"\n", "COSMOS3_AUDIOVISUAL_OUTPUT_ROOT = Path(\n", " os.environ.get(\"COSMOS3_AUDIOVISUAL_OUTPUT_ROOT\", COSMOS3_AUDIOVISUAL_ROOT / \"outputs\" / \"notebooks\")\n", ").resolve()\n", "DEFAULT_VLLM_BASE_URL = os.environ.get(\"COSMOS3_VLLM_BASE_URL\", \"http://localhost:8000\")\n", "VLLM_ENDPOINTS = {\n", " \"Cosmos3-Nano\": os.environ.get(\"COSMOS3_VLLM_NANO_BASE_URL\", DEFAULT_VLLM_BASE_URL),\n", " \"Cosmos3-Super\": os.environ.get(\"COSMOS3_VLLM_SUPER_BASE_URL\", DEFAULT_VLLM_BASE_URL),\n", "}\n", "\n", "os.environ[\"COSMOS3_AUDIOVISUAL_OUTPUT_ROOT\"] = str(COSMOS3_AUDIOVISUAL_OUTPUT_ROOT)\n", "os.environ.setdefault(\"COSMOS3_VLLM_API_KEY\", \"\")\n", "\n", "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n", "print(\"COSMOS3_AUDIOVISUAL_OUTPUT_ROOT:\", COSMOS3_AUDIOVISUAL_OUTPUT_ROOT)\n", "for model, endpoint in VLLM_ENDPOINTS.items():\n", " print(f\"{model} endpoint: {endpoint}\")\n" ], "execution_count": null, "outputs": [], "id": "23f04a90" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Verify Endpoint Configuration\n" ], "id": "73369e7f" }, { "cell_type": "code", "metadata": {}, "source": [ "from urllib.parse import urlparse\n", "\n", "for model, base_url in VLLM_ENDPOINTS.items():\n", " api_root = base_url.rstrip(\"/\")\n", " if not api_root.endswith(\"/v1\"):\n", " api_root = f\"{api_root}/v1\"\n", " parsed = urlparse(api_root)\n", " print(model)\n", " print(\" api root:\", api_root)\n", " print(\" images generations:\", f\"{api_root}/images/generations\")\n", " print(\" videos sync:\", f\"{api_root}/videos/sync\")\n", " print(\" scheme:\", parsed.scheme)\n", " print(\" host:\", parsed.netloc)\n" ], "execution_count": null, "outputs": [], "id": "1c50e183" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Preview Available Inputs\n" ], "id": "c1d161a6" }, { "cell_type": "code", "metadata": {}, "source": [ "from pathlib import Path\n", "import json\n", "from IPython.display import Image, display\n", "\n", "assets_dir = COSMOS3_AUDIOVISUAL_ROOT / \"assets\"\n", "for prompt_dir in sorted((assets_dir / \"prompts\").iterdir()):\n", " if not prompt_dir.is_dir():\n", " continue\n", " print(f\"{prompt_dir.relative_to(assets_dir)}:\")\n", " for prompt_path in sorted(prompt_dir.glob(\"*.json\")):\n", " data = json.loads(prompt_path.read_text())\n", " caption = (\n", " data.get(\"temporal_caption\")\n", " or data.get(\"comprehensive_t2i_caption\")\n", " or data.get(\"extra\", {}).get(\"prompt\", \"\")\n", " )\n", " print(f\" {prompt_path.name}: {caption[:180]}{'...' if len(caption) > 180 else ''}\")\n", " print()\n", "\n", "for image_dir in sorted((assets_dir / \"images\").iterdir()):\n", " if not image_dir.is_dir():\n", " continue\n", " print(f\"{image_dir.relative_to(assets_dir)}:\")\n", " for image_path in sorted(image_dir.iterdir()):\n", " if image_path.suffix.lower() in {\".jpg\", \".jpeg\", \".png\", \".webp\", \".bmp\"}:\n", " print(f\" {image_path.name}\")\n", " display(Image(filename=str(image_path), width=420))\n" ], "execution_count": null, "outputs": [], "id": "973ea472" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Define Asset Sets, Payload Helpers, Request Helpers, and Viewer Helpers\n" ], "id": "cfa34351" }, { "cell_type": "code", "metadata": {}, "source": [ "import json\n", "import os\n", "from pathlib import Path\n", "from IPython.display import Image, display\n", "\n", "IMAGE_EXTENSIONS = {\".jpg\", \".jpeg\", \".png\", \".webp\", \".bmp\"}\n", "\n", "FIXED_SAMPLING = {\n", " \"num_steps\": 35,\n", " \"guidance\": 6.0,\n", " \"shift\": 10.0,\n", " \"fps\": 24,\n", " \"num_frames\": 189,\n", " \"resolution\": \"720\",\n", " \"aspect_ratio\": \"16,9\",\n", " \"seed\": 0,\n", "}\n", "\n", "# All asset paths are repo-relative under cookbooks/cosmos3/generator/audiovisual.\n", "# Model and sound choices live in this manifest; folders are organized only by modality.\n", "ASSET_SETS = {\n", " \"t2i\": {\n", " \"model\": \"Cosmos3-Nano\",\n", " \"mode\": \"text2image\",\n", " \"prompt\": \"assets/prompts/text2image/robot_draping.json\",\n", " \"enable_sound\": False,\n", " },\n", " \"t2i_super\": {\n", " \"model\": \"Cosmos3-Super\",\n", " \"mode\": \"text2image\",\n", " \"prompt\": \"assets/prompts/text2image/robot_draping.json\",\n", " \"enable_sound\": False,\n", " },\n", " \"t2v_nano_noaudio\": {\n", " \"model\": \"Cosmos3-Nano\",\n", " \"mode\": \"text2video\",\n", " \"prompt\": \"assets/prompts/text2video/robot_kitchen.json\",\n", " \"enable_sound\": False,\n", " },\n", " \"t2vs\": {\n", " \"model\": \"Cosmos3-Nano\",\n", " \"mode\": \"text2video\",\n", " \"prompt\": \"assets/prompts/text2video/robot_pouring_water_audio.json\",\n", " \"enable_sound\": True,\n", " },\n", " \"i2v_nano_noaudio\": {\n", " \"model\": \"Cosmos3-Nano\",\n", " \"mode\": \"image2video\",\n", " \"prompt\": \"assets/prompts/image2video/car_driving.json\",\n", " \"image\": \"assets/images/image2video/car_driving.jpg\",\n", " \"enable_sound\": False,\n", " },\n", " \"i2vs\": {\n", " \"model\": \"Cosmos3-Nano\",\n", " \"mode\": \"image2video\",\n", " \"prompt\": \"assets/prompts/image2video/coastal_road_audio.json\",\n", " \"image\": \"assets/images/image2video/coastal_road_audio.jpg\",\n", " \"enable_sound\": True,\n", " },\n", " \"t2v_super_noaudio\": {\n", " \"model\": \"Cosmos3-Super\",\n", " \"mode\": \"text2video\",\n", " \"prompt\": \"assets/prompts/text2video/robot_kitchen.json\",\n", " \"enable_sound\": False,\n", " },\n", " \"i2v_super_noaudio\": {\n", " \"model\": \"Cosmos3-Super\",\n", " \"mode\": \"image2video\",\n", " \"prompt\": \"assets/prompts/image2video/car_driving.json\",\n", " \"image\": \"assets/images/image2video/car_driving.jpg\",\n", " \"enable_sound\": False,\n", " },\n", "}\n", "\n", "\n", "def asset_path(relative_path: str) -> Path:\n", " path = COSMOS3_AUDIOVISUAL_ROOT / relative_path\n", " if not path.exists():\n", " raise FileNotFoundError(path)\n", " return path.resolve()\n", "\n", "\n", "def compact_json_file(path: Path) -> str:\n", " return json.dumps(json.loads(path.read_text()), ensure_ascii=True, separators=(\",\", \":\"))\n", "\n", "\n", "def payload_dimensions(payload: dict) -> tuple[int, int]:\n", " if payload.get(\"resolution\") == \"720\" and payload.get(\"aspect_ratio\") == \"16,9\":\n", " return 720, 1280\n", " if payload.get(\"resolution\") == \"256\" and payload.get(\"aspect_ratio\") == \"16,9\":\n", " return 192, 320\n", " raise ValueError(f\"Unsupported payload resolution/aspect ratio: {payload.get('resolution')} {payload.get('aspect_ratio')}\")\n", "\n", "\n", "def resolve_payload_path(payload_path: Path, value: str) -> Path:\n", " path = Path(value)\n", " if path.is_absolute():\n", " return path\n", " return (payload_path.parent / path).resolve()\n", "\n", "\n", "def create_payload(use_case: str, *, backend: str) -> tuple[Path, Path, str]:\n", " spec = ASSET_SETS[use_case]\n", " payload_dir = Path(os.environ[\"COSMOS3_AUDIOVISUAL_OUTPUT_ROOT\"]) / backend / \"payloads\" / use_case\n", " output_dir = Path(os.environ[\"COSMOS3_AUDIOVISUAL_OUTPUT_ROOT\"]) / backend / use_case\n", " payload_dir.mkdir(parents=True, exist_ok=True)\n", " output_dir.mkdir(parents=True, exist_ok=True)\n", "\n", " prompt_path = asset_path(spec[\"prompt\"])\n", " negative_prompt = \"\"\n", " if spec[\"mode\"] != \"text2image\":\n", " negative_prompt_path = asset_path(f\"assets/negative_prompts/{spec['mode']}/neg_prompt.json\")\n", " negative_prompt = compact_json_file(negative_prompt_path)\n", " payload_path = payload_dir / f\"{use_case}.json\"\n", " payload = {\n", " \"model_mode\": spec[\"mode\"],\n", " \"name\": use_case,\n", " \"prompt\": compact_json_file(prompt_path),\n", " \"negative_prompt\": negative_prompt,\n", " \"enable_sound\": spec[\"enable_sound\"],\n", " **FIXED_SAMPLING,\n", " }\n", " if spec[\"mode\"] == \"image2video\":\n", " image_path = asset_path(spec[\"image\"])\n", " payload[\"vision_path\"] = os.path.relpath(image_path, payload_path.parent)\n", "\n", " payload_path.write_text(json.dumps(payload, indent=2) + \"\\n\")\n", "\n", " os.environ[f\"COSMOS3_{backend.upper()}_{use_case.upper()}_INPUT\"] = str(payload_path)\n", " os.environ[f\"COSMOS3_{backend.upper()}_{use_case.upper()}_OUTPUT\"] = str(output_dir)\n", "\n", " print(f\"model: {spec['model']}\")\n", " print(f\"payload: {payload_path}\")\n", " print(f\"output: {output_dir}\")\n", " print(f\"prompt: {prompt_path.relative_to(COSMOS_ROOT)}\")\n", " if \"vision_path\" in payload:\n", " image_display_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n", " print(f\"image: {image_display_path.relative_to(COSMOS_ROOT)}\")\n", " display(Image(filename=str(image_display_path), width=420))\n", " print(json.dumps({k: payload[k] for k in [\"model_mode\", \"name\", \"enable_sound\", \"num_steps\", \"guidance\", \"shift\", \"fps\", \"num_frames\", \"resolution\", \"aspect_ratio\", \"seed\"]}, indent=2))\n", " return payload_path, output_dir, spec[\"model\"]\n", "\n", "\n", "import base64\n", "import html\n", "import json\n", "import os\n", "import subprocess\n", "import time\n", "from pathlib import Path\n", "from IPython.display import HTML, display\n", "\n", "\n", "def api_root_url(base_url: str) -> str:\n", " normalized = base_url.rstrip(\"/\")\n", " if not normalized.endswith(\"/v1\"):\n", " normalized = f\"{normalized}/v1\"\n", " return normalized\n", "\n", "\n", "def video_api_url(base_url: str) -> str:\n", " return f\"{api_root_url(base_url)}/videos/sync\"\n", "\n", "\n", "def image_api_url(base_url: str) -> str:\n", " return f\"{api_root_url(base_url)}/images/generations\"\n", "\n", "\n", "def build_vllm_form(payload: dict) -> dict[str, str]:\n", " height, width = payload_dimensions(payload)\n", " extra_params = {\n", " \"use_resolution_template\": False,\n", " \"use_duration_template\": False,\n", " \"guardrails\": True,\n", " }\n", " form = {\n", " \"prompt\": payload[\"prompt\"],\n", " \"negative_prompt\": payload[\"negative_prompt\"],\n", " \"size\": f\"{width}x{height}\",\n", " \"num_frames\": str(payload[\"num_frames\"]),\n", " \"fps\": str(payload[\"fps\"]),\n", " \"num_inference_steps\": str(payload[\"num_steps\"]),\n", " \"guidance_scale\": str(payload[\"guidance\"]),\n", " \"flow_shift\": str(payload[\"shift\"]),\n", " \"seed\": str(payload[\"seed\"]),\n", " \"extra_params\": json.dumps(extra_params, separators=(\",\", \":\")),\n", " }\n", " if payload[\"enable_sound\"]:\n", " form[\"generate_sound\"] = \"true\"\n", " form[\"sound_duration\"] = f\"{payload['num_frames'] / payload['fps']:.3f}\"\n", " return form\n", "\n", "\n", "def build_vllm_image_body(payload: dict) -> dict:\n", " height, width = payload_dimensions(payload)\n", " return {\n", " \"prompt\": payload[\"prompt\"],\n", " \"size\": f\"{width}x{height}\",\n", " \"n\": 1,\n", " \"num_inference_steps\": payload[\"num_steps\"],\n", " \"guidance_scale\": payload[\"guidance\"],\n", " \"flow_shift\": payload[\"shift\"],\n", " \"seed\": payload[\"seed\"],\n", " \"extra_args\": {\n", " \"use_resolution_template\": False,\n", " \"guardrails\": True,\n", " },\n", " }\n", "\n", "\n", "def post_video(*, payload_path: Path, payload: dict, output_path: Path, model: str) -> None:\n", " url = video_api_url(VLLM_ENDPOINTS[model])\n", " api_key = os.environ.get(\"COSMOS3_VLLM_API_KEY\") or None\n", " tmp_path = Path(f\"{output_path}.tmp\")\n", " error_path = Path(f\"{output_path}.error.txt\")\n", " if tmp_path.exists():\n", " tmp_path.unlink()\n", " if error_path.exists():\n", " error_path.unlink()\n", "\n", " cmd = [\n", " \"curl\",\n", " \"-sS\",\n", " \"--fail-with-body\",\n", " \"-X\",\n", " \"POST\",\n", " url,\n", " \"-H\",\n", " \"Accept: video/mp4\",\n", " ]\n", " if api_key is not None:\n", " cmd += [\"-H\", f\"Authorization: Bearer {api_key}\"]\n", "\n", " for key, value in build_vllm_form(payload).items():\n", " cmd += [\"--form-string\", f\"{key}={value}\"]\n", "\n", " if payload[\"model_mode\"] == \"image2video\":\n", " image_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n", " cmd += [\"-F\", f\"input_reference=@{image_path}\"]\n", "\n", " cmd += [\"-o\", str(tmp_path)]\n", " result = subprocess.run(cmd, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", " if result.returncode != 0:\n", " error_path.write_text((result.stdout or \"\") + (result.stderr or \"\"))\n", " raise RuntimeError(f\"vLLM request failed with exit code {result.returncode}; see {error_path}\")\n", " tmp_path.replace(output_path)\n", "\n", "\n", "def post_image(*, payload: dict, output_path: Path, model: str) -> None:\n", " url = image_api_url(VLLM_ENDPOINTS[model])\n", " api_key = os.environ.get(\"COSMOS3_VLLM_API_KEY\") or None\n", " tmp_path = Path(f\"{output_path}.tmp\")\n", " error_path = Path(f\"{output_path}.error.txt\")\n", " if tmp_path.exists():\n", " tmp_path.unlink()\n", " if error_path.exists():\n", " error_path.unlink()\n", "\n", " cmd = [\n", " \"curl\",\n", " \"-sS\",\n", " \"--fail-with-body\",\n", " \"-X\",\n", " \"POST\",\n", " url,\n", " \"-H\",\n", " \"Content-Type: application/json\",\n", " ]\n", " if api_key is not None:\n", " cmd += [\"-H\", f\"Authorization: Bearer {api_key}\"]\n", " cmd += [\"-d\", json.dumps(build_vllm_image_body(payload), separators=(\",\", \":\"))]\n", "\n", " result = subprocess.run(cmd, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", " if result.returncode != 0:\n", " error_path.write_text((result.stdout or \"\") + (result.stderr or \"\"))\n", " raise RuntimeError(f\"vLLM image request failed with exit code {result.returncode}; see {error_path}\")\n", " try:\n", " response = json.loads(result.stdout)\n", " b64_json = response[\"data\"][0][\"b64_json\"]\n", " tmp_path.write_bytes(base64.b64decode(b64_json))\n", " except Exception as exc:\n", " error_path.write_text((result.stdout or \"\") + (result.stderr or \"\"))\n", " raise RuntimeError(f\"Could not decode vLLM image response; see {error_path}\") from exc\n", " tmp_path.replace(output_path)\n", "\n", "\n", "def run_vllm_payload(payload_path: Path, output_dir: str | Path, *, model: str) -> Path:\n", " payload_path = Path(payload_path)\n", " output_dir = Path(output_dir)\n", " output_dir.mkdir(parents=True, exist_ok=True)\n", " payload = json.loads(payload_path.read_text())\n", " output_ext = \".png\" if payload[\"model_mode\"] == \"text2image\" else \".mp4\"\n", " output_path = output_dir / f\"{payload['name']}{output_ext}\"\n", " endpoint = image_api_url(VLLM_ENDPOINTS[model]) if payload[\"model_mode\"] == \"text2image\" else video_api_url(VLLM_ENDPOINTS[model])\n", " print(\"endpoint:\", endpoint)\n", " print(\"payload:\", payload_path)\n", " print(\"output:\", output_path)\n", " if payload[\"model_mode\"] == \"image2video\":\n", " print(\"input image:\", resolve_payload_path(payload_path, payload[\"vision_path\"]))\n", " t0 = time.time()\n", " if payload[\"model_mode\"] == \"text2image\":\n", " post_image(payload=payload, output_path=output_path, model=model)\n", " else:\n", " post_video(payload_path=payload_path, payload=payload, output_path=output_path, model=model)\n", " print(f\"wrote {output_path} in {time.time() - t0:.1f}s\")\n", " return output_path\n", "\n", "\n", "def display_video(path: Path, *, width: int = 720) -> None:\n", " data = base64.b64encode(path.read_bytes()).decode(\"ascii\")\n", " label = html.escape(str(path))\n", " markup = f\"\"\"\n", "\n", "