{ "cells": [ { "cell_type": "markdown", "id": "license-header", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cosmos3 Nano Action: Inverse Dynamics with vLLM-Omni\n", "\n", "This notebook runs Cosmos3 Nano **action inverse-dynamics** inference through the vLLM-Omni OpenAI-compatible video API:\n", "\n", "```text\n", "POST /v1/videos\n", "```\n", "\n", "Inverse dynamics is the reverse of forward dynamics: given a video, it predicts the ego-motion (action) trajectory that produced it. This notebook builds the same custom input spec as [`run_id_with_cosmos_framework.ipynb`](./run_id_with_cosmos_framework.ipynb), keeps the same input-video preview and predicted-trajectory visualization, and only changes the environment setup plus the inference call.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start vLLM-Omni Server\n", "\n", "Start the server in a terminal from the `cosmos` repo root. The container listens on port `8000`; Docker publishes it to host port `8001`, so the notebook uses `http://localhost:8001`.\n", "\n", "```bash\n", "docker rm -f cosmos3-vllm-omni-notebook 2>/dev/null || true\n", "\n", "docker run -d --name cosmos3-vllm-omni-notebook \\\n", " --runtime nvidia --gpus '\"device=0\"' \\\n", " -e CUDA_DEVICE_ORDER=PCI_BUS_ID \\\n", " -v \"/mnt/sdb/.cache/huggingface:/root/.cache/huggingface\" \\\n", " -v \"$PWD:/workspace\" \\\n", " -p 8001:8000 --ipc=host \\\n", " vllm/vllm-omni:cosmos3 \\\n", " vllm serve nvidia/Cosmos3-Nano \\\n", " --omni \\\n", " --model-class-name Cosmos3OmniDiffusersPipeline \\\n", " --allowed-local-media-path / \\\n", " --port 8000 \\\n", " --init-timeout 1800\n", "\n", "# Wait until this returns model metadata before running the inference cell.\n", "curl http://localhost:8001/v1/models\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "\n", "\n", "def find_repo_root(start: Path) -> Path:\n", " for path in [start, *start.parents]:\n", " if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n", " return path\n", " return start\n", "\n", "\n", "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n", "COSMOS3_REPO = Path(os.environ.get(\"COSMOS3_REPO\", COSMOS_ROOT / \"packages\" / \"cosmos3\")).resolve()\n", "COSMOS3_OUTPUT_ROOT = Path(\n", " os.environ.get(\"COSMOS3_VLLM_OUTPUT_ROOT\", COSMOS_ROOT / \"outputs\" / \"cosmos3_action_vllm\")\n", ").resolve()\n", "COSMOS3_INPUT_DIR = COSMOS3_OUTPUT_ROOT / \"inputs\"\n", "VLLM_BASE_URL = os.environ.get(\"COSMOS3_VLLM_BASE_URL\", \"http://localhost:8001\").rstrip(\"/\")\n", "\n", "COSMOS3_OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)\n", "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n", "print(\"COSMOS3_REPO:\", COSMOS3_REPO)\n", "print(\"COSMOS3_INPUT_DIR:\", COSMOS3_INPUT_DIR)\n", "print(\"COSMOS3_OUTPUT_ROOT:\", COSMOS3_OUTPUT_ROOT)\n", "print(\"COSMOS3_VLLM_BASE_URL:\", VLLM_BASE_URL)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create the Inverse-Dynamics Input Spec\n", "\n", "Inverse-dynamics inference is driven by a JSONL spec, one line per run. Unlike forward dynamics, each line provides only an input video (`vision_path`) and **no** `action_path` — the action is what the model predicts.\n", "\n", "This cell builds that spec from local AV videos, writing it under:\n", "\n", "```text\n", "outputs/cosmos3_action_vllm/inputs/action_inverse_dynamics_av_custom.jsonl\n", "```\n", "\n", "It mirrors the native PyTorch notebook's spec format. The `vision_path` is written as an absolute path, because the vLLM request cell reads the spec records directly.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bd4b3ff8", "metadata": {}, "outputs": [], "source": [ "# `os` and the COSMOS3_* paths come from the configuration cell.\n", "import json\n", "\n", "# Local inputs, relative to the cosmos repo root.\n", "input_videos = {\n", " \"av_inverse_0\": \"cookbooks/cosmos3/generator/action/assets/videos/av_0.mp4\",\n", " \"av_inverse_1\": \"cookbooks/cosmos3/generator/action/assets/videos/av_1.mp4\",\n", "}\n", "\n", "def resolve_input(rel_path: str) -> str:\n", " path = (COSMOS_ROOT / rel_path).resolve()\n", " assert path.exists(), f\"missing input: {path}\"\n", " return str(path)\n", "\n", "records = [\n", " {\n", " \"action_chunk_size\": 60,\n", " \"domain_name\": \"av\",\n", " \"fps\": 10,\n", " \"image_size\": 480,\n", " \"view_point\": \"ego_view\",\n", " \"model_mode\": \"inverse_dynamics\",\n", " \"name\": name,\n", " \"prompt\": \"You are an autonomous vehicle planning system.\",\n", " \"seed\": 0,\n", " \"vision_path\": resolve_input(video_rel),\n", " }\n", " for name, video_rel in input_videos.items()\n", "]\n", "\n", "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", "id_input_path = COSMOS3_INPUT_DIR / \"action_inverse_dynamics_av_custom.jsonl\"\n", "id_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in records))\n", "id_output_dir = COSMOS3_OUTPUT_ROOT / \"action_inverse_dynamics_av_custom\"\n", "\n", "# The bash inference cell can only see the environment, so export the paths it needs.\n", "os.environ[\"COSMOS3_ID_INPUT\"] = str(id_input_path)\n", "os.environ[\"COSMOS3_ID_OUTPUT\"] = str(id_output_dir)\n", "\n", "print(\"wrote spec:\", id_input_path)\n", "print(\"runs:\", list(input_videos))\n", "print(id_input_path.read_text())" ] }, { "cell_type": "markdown", "id": "0f17af65", "metadata": {}, "source": [ "## Preview the Input Video(s)\n", "\n", "Preview each input video before running inference. `Video(..., embed=True)` base64-inlines the file, and these AV clips are several MB each, so we first transcode a small preview (~150 KB) with the ffmpeg binary bundled in `imageio-ffmpeg` (installed by `uv sync`), then embed it. The original input videos are left untouched." ] }, { "cell_type": "code", "execution_count": null, "id": "293b1dfb", "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "import imageio_ffmpeg\n", "from IPython.display import Video, display\n", "\n", "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", "\n", "def make_preview(src: Path, dst: Path, crf: int = 28) -> Path:\n", " \"\"\"Re-encode `src` to a compact, browser-friendly mp4 (cached).\"\"\"\n", " if not dst.exists():\n", " subprocess.run(\n", " [FFMPEG, \"-y\", \"-loglevel\", \"error\", \"-i\", str(src),\n", " \"-c:v\", \"libx264\", \"-crf\", str(crf),\n", " \"-preset\", \"veryfast\", \"-an\", \"-pix_fmt\", \"yuv420p\", str(dst)],\n", " check=True,\n", " )\n", " return dst\n", "\n", "# `records` comes from the prepare cell; preview each input video.\n", "for record in records:\n", " name = record[\"name\"]\n", " src = Path(record[\"vision_path\"])\n", " preview = make_preview(src, COSMOS3_INPUT_DIR / f\"{name}_input_preview.mp4\")\n", " print(f\"{name} ({src.stat().st_size // 1024} KB -> {preview.stat().st_size // 1024} KB preview)\")\n", " display(Video(str(preview), embed=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run Inverse-Dynamics Inference\n", "\n", "Runs `Cosmos3-Nano` on every line of the spec through vLLM-Omni. Inverse dynamics predicts an action, and this cell writes a PyTorch-compatible result file for each run:\n", "\n", "```text\n", "//sample_outputs.json\n", "```\n", "\n", "The predicted action trajectory is stored under `outputs[0].content[\"action\"]`, matching the native notebook's visualization cell. vLLM also returns `response.json`, `final.json`, and `action.json` for API debugging.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from pathlib import Path\n", "\n", "try:\n", " import requests\n", "except ImportError as exc:\n", " raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n", "\n", "\n", "def check_vllm_server() -> None:\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/models\", timeout=10)\n", " response.raise_for_status()\n", " print(response.json())\n", "\n", "\n", "def submit_inverse_dynamics(record: dict) -> dict:\n", " run_dir = id_output_dir / record[\"name\"]\n", " run_dir.mkdir(parents=True, exist_ok=True)\n", "\n", " video_path = Path(record[\"vision_path\"])\n", " extra_params = {\n", " \"action_mode\": \"inverse_dynamics\",\n", " \"domain_name\": record[\"domain_name\"],\n", " \"action_chunk_size\": record[\"action_chunk_size\"],\n", " \"image_size\": record[\"image_size\"],\n", " \"view_point\": record[\"view_point\"],\n", " \"raw_action_dim\": 9,\n", " \"guardrails\": False,\n", " }\n", " form = {\n", " \"prompt\": record[\"prompt\"],\n", " \"num_frames\": record[\"action_chunk_size\"] + 1,\n", " \"fps\": record[\"fps\"],\n", " \"num_inference_steps\": 30,\n", " \"guidance_scale\": 1.0,\n", " \"flow_shift\": 10.0,\n", " \"seed\": record[\"seed\"],\n", " \"extra_params\": json.dumps(extra_params),\n", " }\n", "\n", " with video_path.open(\"rb\") as video_file:\n", " response = requests.post(\n", " f\"{VLLM_BASE_URL}/v1/videos\",\n", " data={key: str(value) for key, value in form.items()},\n", " files={\"input_reference\": (video_path.name, video_file, \"video/mp4\")},\n", " timeout=120,\n", " )\n", " response.raise_for_status()\n", " initial = response.json()\n", " (run_dir / \"response.json\").write_text(json.dumps(initial, indent=2))\n", "\n", " while True:\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/videos/{initial['id']}\", timeout=30)\n", " response.raise_for_status()\n", " final = response.json()\n", " (run_dir / \"final.json\").write_text(json.dumps(final, indent=2))\n", " print(initial[\"id\"], final.get(\"status\"), f\"{final.get('progress', 0)}%\")\n", " if final.get(\"status\") == \"completed\":\n", " break\n", " if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n", " raise RuntimeError(json.dumps(final, indent=2))\n", " time.sleep(2)\n", "\n", " action = final.get(\"action\")\n", " if not action or \"data\" not in action:\n", " raise RuntimeError(f\"vLLM response did not include action data: {json.dumps(final, indent=2)}\")\n", " (run_dir / \"action.json\").write_text(json.dumps(action, indent=2))\n", "\n", " sample_outputs = {\"outputs\": [{\"content\": {\"action\": action[\"data\"]}}]}\n", " (run_dir / \"sample_outputs.json\").write_text(json.dumps(sample_outputs, indent=2))\n", "\n", " print(\"saved\", run_dir / \"sample_outputs.json\")\n", " print(\"action shape:\", action.get(\"shape\"), \"dtype:\", action.get(\"dtype\"))\n", " return {\"record\": record, \"initial\": initial, \"final\": final, \"run_dir\": run_dir, \"action\": action}\n", "\n", "\n", "check_vllm_server()\n", "results = []\n", "for record in records:\n", " print(f\"\\nSubmitting {record['name']}\")\n", " results.append(submit_inverse_dynamics(record))\n" ] }, { "cell_type": "markdown", "id": "324e6378", "metadata": {}, "source": [ "## Visualize the Predicted Action\n", "\n", "Plot the action the model predicted from each input video, as a 3D camera path (with frustums) and a top-down bird's-eye view. The action is read from each run's `sample_outputs.json` and interpreted with the AV pose convention." ] }, { "cell_type": "code", "execution_count": null, "id": "e7808372", "metadata": {}, "outputs": [], "source": [ "import sys\n", "import json\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from matplotlib.collections import LineCollection\n", "from mpl_toolkits.mplot3d.art3d import Line3DCollection\n", "\n", "# The notebook kernel may differ from the framework venv, so put the repo on the\n", "# path before importing `cosmos_framework`.\n", "if str(COSMOS3_REPO) not in sys.path:\n", " sys.path.insert(0, str(COSMOS3_REPO))\n", "from cosmos_framework.data.vfm.action.pose_utils import pose_rel_to_abs\n", "\n", "# frustum: apex + image-rectangle corners (camera +Z forward), and their edges\n", "_FRUSTUM = np.array([[0, 0, 0], [-1, -1, 1], [1, -1, 1], [1, 1, 1], [-1, 1, 1]], float)\n", "_EDGES = [(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (2, 3), (3, 4), (4, 1)]\n", "\n", "def visualize_pose(poses_abs, *, n_frustums=20, scale_frac=0.03, aspect=16 / 9,\n", " fov_deg=60.0, vertical_exaggeration=1.0, cmap=\"turbo\",\n", " title=None, save_path=None, show=True):\n", " \"\"\"3D camera trajectory (with frustums) + a top-down bird's-eye view.\n", "\n", " AV convention: world Y is up, world +Z is the heading. `vertical_exaggeration`\n", " stretches only the up-axis box (uniform world scaling, so frustums never skew);\n", " 1.0 = geometrically faithful. The 3D plot reorders world (X, Y, Z) -> (X, Z, Y)\n", " so Y points up on screen.\n", " \"\"\"\n", " poses_abs = np.asarray(poses_abs)\n", " pos = poses_abs[:, :3, 3] # camera centers [T, 3]\n", " fwd = poses_abs[:, :3, 2] # heading (+Z) [T, 3]\n", " T = len(pos)\n", " colors = plt.get_cmap(cmap)(np.arange(T) / max(T - 1, 1))\n", " scale = max(np.ptp(pos, axis=0).max() * scale_frac, 1e-3)\n", " step = max(1, T // max(n_frustums, 1))\n", " xzy = [0, 2, 1] # world (X,Y,Z) -> plot (X, Z, Y-up)\n", "\n", " fig = plt.figure(figsize=(14, 6))\n", "\n", " # (1) 3D perspective with frustums\n", " ax = fig.add_subplot(1, 2, 1, projection=\"3d\")\n", " path = pos[:, xzy]\n", " ax.plot(*path.T, color=\"0.6\", lw=1.0, alpha=0.7)\n", " lines, lcolors, allpts = [], [], [path]\n", " for i in range(0, T, step):\n", " cw = ((_FRUSTUM * [aspect, 1, 1] * scale * np.tan(np.radians(fov_deg) / 2))\n", " @ poses_abs[i, :3, :3].T + poses_abs[i, :3, 3])[:, xzy] # frustum in plot coords\n", " allpts.append(cw)\n", " lines += [[cw[a], cw[b]] for a, b in _EDGES]\n", " lcolors += [colors[i]] * len(_EDGES)\n", " ax.add_collection3d(Line3DCollection(lines, colors=lcolors, linewidths=1.2))\n", " ax.scatter(*path[0], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n", " ax.scatter(*path[-1], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n", " rng = np.clip(np.ptp(np.concatenate(allpts), axis=0), 1e-9, None)\n", " ax.set_box_aspect((rng[0], rng[1], rng[2] * vertical_exaggeration))\n", " ax.set_xlabel(\"X (m)\", labelpad=12); ax.set_ylabel(\"Z forward (m)\", labelpad=12)\n", " ax.set_zlabel(\"Y up (m)\", labelpad=10); ax.set_zticks([])\n", " ax.set_title(title or f\"Camera trajectory + frustums ({T} frames)\")\n", " ax.legend(loc=\"upper left\"); ax.view_init(elev=22, azim=-70)\n", "\n", " # (2) top-down bird's-eye view (X-Z ground plane)\n", " ax2 = fig.add_subplot(1, 2, 2)\n", " seg = np.stack([pos[:-1, [0, 2]], pos[1:, [0, 2]]], axis=1)\n", " lc = LineCollection(seg, cmap=cmap, norm=plt.Normalize(0, T - 1), linewidth=2.5)\n", " lc.set_array(np.arange(T - 1)); ax2.add_collection(lc)\n", " ax2.quiver(pos[::step, 0], pos[::step, 2], fwd[::step, 0], fwd[::step, 2],\n", " color=colors[::step], angles=\"xy\", width=0.005, scale=22, zorder=3)\n", " ax2.scatter(*pos[0, [0, 2]], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n", " ax2.scatter(*pos[-1, [0, 2]], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n", " ax2.set_xlabel(\"X (m)\"); ax2.set_ylabel(\"Z forward (m)\")\n", " ax2.set_title(\"Top-down (bird's-eye view)\")\n", " ax2.set_aspect(\"equal\", adjustable=\"datalim\"); ax2.autoscale_view(); ax2.legend()\n", " fig.colorbar(lc, ax=ax2, label=\"frame index\")\n", "\n", " plt.tight_layout(w_pad=6)\n", " if save_path:\n", " fig.savefig(save_path, dpi=120, bbox_inches=\"tight\"); print(\"saved\", save_path)\n", " if show:\n", " plt.show()\n", "\n", "# `records` and `id_output_dir` come from the prepare cell; read each run's\n", "# predicted action from its sample_outputs.json.\n", "for record in records:\n", " name = record[\"name\"]\n", " outputs = json.loads((id_output_dir / name / \"sample_outputs.json\").read_text())\n", " poses_rel = np.array(outputs[\"outputs\"][0][\"content\"][\"action\"]) # [T-1, 9] = [translation(3), rot6d(6)]\n", "\n", " # AV action convention (see cosmos_framework/data/vfm/action/av_dataset.py):\n", " # rot6d rotation, backward_framewise, translation_scale = 1.35.\n", " poses_abs = pose_rel_to_abs(\n", " poses_rel,\n", " rotation_format=\"rot6d\",\n", " pose_convention=\"backward_framewise\",\n", " translation_scale=1.35,\n", " ) # [T, 4, 4] camera-to-world\n", " print(name, poses_rel.shape, poses_abs.shape)\n", " visualize_pose(poses_abs, title=f\"{name}: predicted camera trajectory + frustums ({len(poses_abs)} frames)\", show=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 5 }