{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "license-header",
   "metadata": {},
   "source": [
    "<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\nSPDX-License-Identifier: OpenMDW-1.1 -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cosmos3 Nano Action: Inverse Dynamics with vLLM-Omni\n",
    "\n",
    "This notebook runs Cosmos3 Nano **action inverse-dynamics** inference through the vLLM-Omni OpenAI-compatible video API:\n",
    "\n",
    "```text\n",
    "POST /v1/videos\n",
    "```\n",
    "\n",
    "Inverse dynamics is the reverse of forward dynamics: given a video, it predicts the ego-motion (action) trajectory that produced it. This notebook builds the same custom input spec as [`run_id_with_cosmos_framework.ipynb`](./run_id_with_cosmos_framework.ipynb), keeps the same input-video preview and predicted-trajectory visualization, and only changes the environment setup plus the inference call.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start vLLM-Omni Server\n",
    "\n",
    "Start the server in a terminal from the `cosmos` repo root. The container listens on port `8000`; Docker publishes it to host port `8001`, so the notebook uses `http://localhost:8001`.\n",
    "\n",
    "```bash\n",
    "docker rm -f cosmos3-vllm-omni-notebook 2>/dev/null || true\n",
    "\n",
    "docker run -d --name cosmos3-vllm-omni-notebook \\\n",
    "  --runtime nvidia --gpus '\"device=0\"' \\\n",
    "  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \\\n",
    "  -v \"/mnt/sdb/.cache/huggingface:/root/.cache/huggingface\" \\\n",
    "  -v \"$PWD:/workspace\" \\\n",
    "  -p 8001:8000 --ipc=host \\\n",
    "  vllm/vllm-omni:cosmos3 \\\n",
    "  vllm serve nvidia/Cosmos3-Nano \\\n",
    "    --omni \\\n",
    "    --model-class-name Cosmos3OmniDiffusersPipeline \\\n",
    "    --allowed-local-media-path / \\\n",
    "    --port 8000 \\\n",
    "    --init-timeout 1800\n",
    "\n",
    "# Wait until this returns model metadata before running the inference cell.\n",
    "curl http://localhost:8001/v1/models\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import os\n",
    "\n",
    "\n",
    "def find_repo_root(start: Path) -> Path:\n",
    "    for path in [start, *start.parents]:\n",
    "        if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n",
    "            return path\n",
    "    return start\n",
    "\n",
    "\n",
    "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n",
    "COSMOS3_REPO = Path(os.environ.get(\"COSMOS3_REPO\", COSMOS_ROOT / \"packages\" / \"cosmos3\")).resolve()\n",
    "COSMOS3_OUTPUT_ROOT = Path(\n",
    "    os.environ.get(\"COSMOS3_VLLM_OUTPUT_ROOT\", COSMOS_ROOT / \"outputs\" / \"cosmos3_action_vllm\")\n",
    ").resolve()\n",
    "COSMOS3_INPUT_DIR = COSMOS3_OUTPUT_ROOT / \"inputs\"\n",
    "VLLM_BASE_URL = os.environ.get(\"COSMOS3_VLLM_BASE_URL\", \"http://localhost:8001\").rstrip(\"/\")\n",
    "\n",
    "COSMOS3_OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)\n",
    "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n",
    "print(\"COSMOS3_REPO:\", COSMOS3_REPO)\n",
    "print(\"COSMOS3_INPUT_DIR:\", COSMOS3_INPUT_DIR)\n",
    "print(\"COSMOS3_OUTPUT_ROOT:\", COSMOS3_OUTPUT_ROOT)\n",
    "print(\"COSMOS3_VLLM_BASE_URL:\", VLLM_BASE_URL)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the Inverse-Dynamics Input Spec\n",
    "\n",
    "Inverse-dynamics inference is driven by a JSONL spec, one line per run. Unlike forward dynamics, each line provides only an input video (`vision_path`) and **no** `action_path` — the action is what the model predicts.\n",
    "\n",
    "This cell builds that spec from local AV videos, writing it under:\n",
    "\n",
    "```text\n",
    "outputs/cosmos3_action_vllm/inputs/action_inverse_dynamics_av_custom.jsonl\n",
    "```\n",
    "\n",
    "It mirrors the native PyTorch notebook's spec format. The `vision_path` is written as an absolute path, because the vLLM request cell reads the spec records directly.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd4b3ff8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# `os` and the COSMOS3_* paths come from the configuration cell.\n",
    "import json\n",
    "\n",
    "# Local inputs, relative to the cosmos repo root.\n",
    "input_videos = {\n",
    "    \"av_inverse_0\": \"cookbooks/cosmos3/generator/action/assets/videos/av_0.mp4\",\n",
    "    \"av_inverse_1\": \"cookbooks/cosmos3/generator/action/assets/videos/av_1.mp4\",\n",
    "}\n",
    "\n",
    "def resolve_input(rel_path: str) -> str:\n",
    "    path = (COSMOS_ROOT / rel_path).resolve()\n",
    "    assert path.exists(), f\"missing input: {path}\"\n",
    "    return str(path)\n",
    "\n",
    "records = [\n",
    "    {\n",
    "        \"action_chunk_size\": 60,\n",
    "        \"domain_name\": \"av\",\n",
    "        \"fps\": 10,\n",
    "        \"image_size\": 480,\n",
    "        \"view_point\": \"ego_view\",\n",
    "        \"model_mode\": \"inverse_dynamics\",\n",
    "        \"name\": name,\n",
    "        \"prompt\": \"You are an autonomous vehicle planning system.\",\n",
    "        \"seed\": 0,\n",
    "        \"vision_path\": resolve_input(video_rel),\n",
    "    }\n",
    "    for name, video_rel in input_videos.items()\n",
    "]\n",
    "\n",
    "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "id_input_path = COSMOS3_INPUT_DIR / \"action_inverse_dynamics_av_custom.jsonl\"\n",
    "id_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in records))\n",
    "id_output_dir = COSMOS3_OUTPUT_ROOT / \"action_inverse_dynamics_av_custom\"\n",
    "\n",
    "# The bash inference cell can only see the environment, so export the paths it needs.\n",
    "os.environ[\"COSMOS3_ID_INPUT\"] = str(id_input_path)\n",
    "os.environ[\"COSMOS3_ID_OUTPUT\"] = str(id_output_dir)\n",
    "\n",
    "print(\"wrote spec:\", id_input_path)\n",
    "print(\"runs:\", list(input_videos))\n",
    "print(id_input_path.read_text())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f17af65",
   "metadata": {},
   "source": [
    "## Preview the Input Video(s)\n",
    "\n",
    "Preview each input video before running inference. `Video(..., embed=True)` base64-inlines the file, and these AV clips are several MB each, so we first transcode a small preview (~150 KB) with the ffmpeg binary bundled in `imageio-ffmpeg` (installed by `uv sync`), then embed it. The original input videos are left untouched."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "293b1dfb",
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess\n",
    "import imageio_ffmpeg\n",
    "from IPython.display import Video, display\n",
    "\n",
    "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n",
    "\n",
    "def make_preview(src: Path, dst: Path, crf: int = 28) -> Path:\n",
    "    \"\"\"Re-encode `src` to a compact, browser-friendly mp4 (cached).\"\"\"\n",
    "    if not dst.exists():\n",
    "        subprocess.run(\n",
    "            [FFMPEG, \"-y\", \"-loglevel\", \"error\", \"-i\", str(src),\n",
    "             \"-c:v\", \"libx264\", \"-crf\", str(crf),\n",
    "             \"-preset\", \"veryfast\", \"-an\", \"-pix_fmt\", \"yuv420p\", str(dst)],\n",
    "            check=True,\n",
    "        )\n",
    "    return dst\n",
    "\n",
    "# `records` comes from the prepare cell; preview each input video.\n",
    "for record in records:\n",
    "    name = record[\"name\"]\n",
    "    src = Path(record[\"vision_path\"])\n",
    "    preview = make_preview(src, COSMOS3_INPUT_DIR / f\"{name}_input_preview.mp4\")\n",
    "    print(f\"{name}  ({src.stat().st_size // 1024} KB -> {preview.stat().st_size // 1024} KB preview)\")\n",
    "    display(Video(str(preview), embed=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run Inverse-Dynamics Inference\n",
    "\n",
    "Runs `Cosmos3-Nano` on every line of the spec through vLLM-Omni. Inverse dynamics predicts an action, and this cell writes a PyTorch-compatible result file for each run:\n",
    "\n",
    "```text\n",
    "<output>/<name>/sample_outputs.json\n",
    "```\n",
    "\n",
    "The predicted action trajectory is stored under `outputs[0].content[\"action\"]`, matching the native notebook's visualization cell. vLLM also returns `response.json`, `final.json`, and `action.json` for API debugging.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import time\n",
    "from pathlib import Path\n",
    "\n",
    "try:\n",
    "    import requests\n",
    "except ImportError as exc:\n",
    "    raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n",
    "\n",
    "\n",
    "def check_vllm_server() -> None:\n",
    "    response = requests.get(f\"{VLLM_BASE_URL}/v1/models\", timeout=10)\n",
    "    response.raise_for_status()\n",
    "    print(response.json())\n",
    "\n",
    "\n",
    "def submit_inverse_dynamics(record: dict) -> dict:\n",
    "    run_dir = id_output_dir / record[\"name\"]\n",
    "    run_dir.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "    video_path = Path(record[\"vision_path\"])\n",
    "    extra_params = {\n",
    "        \"action_mode\": \"inverse_dynamics\",\n",
    "        \"domain_name\": record[\"domain_name\"],\n",
    "        \"action_chunk_size\": record[\"action_chunk_size\"],\n",
    "        \"image_size\": record[\"image_size\"],\n",
    "        \"view_point\": record[\"view_point\"],\n",
    "        \"raw_action_dim\": 9,\n",
    "        \"guardrails\": False,\n",
    "    }\n",
    "    form = {\n",
    "        \"prompt\": record[\"prompt\"],\n",
    "        \"num_frames\": record[\"action_chunk_size\"] + 1,\n",
    "        \"fps\": record[\"fps\"],\n",
    "        \"num_inference_steps\": 30,\n",
    "        \"guidance_scale\": 1.0,\n",
    "        \"flow_shift\": 10.0,\n",
    "        \"seed\": record[\"seed\"],\n",
    "        \"extra_params\": json.dumps(extra_params),\n",
    "    }\n",
    "\n",
    "    with video_path.open(\"rb\") as video_file:\n",
    "        response = requests.post(\n",
    "            f\"{VLLM_BASE_URL}/v1/videos\",\n",
    "            data={key: str(value) for key, value in form.items()},\n",
    "            files={\"input_reference\": (video_path.name, video_file, \"video/mp4\")},\n",
    "            timeout=120,\n",
    "        )\n",
    "    response.raise_for_status()\n",
    "    initial = response.json()\n",
    "    (run_dir / \"response.json\").write_text(json.dumps(initial, indent=2))\n",
    "\n",
    "    while True:\n",
    "        response = requests.get(f\"{VLLM_BASE_URL}/v1/videos/{initial['id']}\", timeout=30)\n",
    "        response.raise_for_status()\n",
    "        final = response.json()\n",
    "        (run_dir / \"final.json\").write_text(json.dumps(final, indent=2))\n",
    "        print(initial[\"id\"], final.get(\"status\"), f\"{final.get('progress', 0)}%\")\n",
    "        if final.get(\"status\") == \"completed\":\n",
    "            break\n",
    "        if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n",
    "            raise RuntimeError(json.dumps(final, indent=2))\n",
    "        time.sleep(2)\n",
    "\n",
    "    action = final.get(\"action\")\n",
    "    if not action or \"data\" not in action:\n",
    "        raise RuntimeError(f\"vLLM response did not include action data: {json.dumps(final, indent=2)}\")\n",
    "    (run_dir / \"action.json\").write_text(json.dumps(action, indent=2))\n",
    "\n",
    "    sample_outputs = {\"outputs\": [{\"content\": {\"action\": action[\"data\"]}}]}\n",
    "    (run_dir / \"sample_outputs.json\").write_text(json.dumps(sample_outputs, indent=2))\n",
    "\n",
    "    print(\"saved\", run_dir / \"sample_outputs.json\")\n",
    "    print(\"action shape:\", action.get(\"shape\"), \"dtype:\", action.get(\"dtype\"))\n",
    "    return {\"record\": record, \"initial\": initial, \"final\": final, \"run_dir\": run_dir, \"action\": action}\n",
    "\n",
    "\n",
    "check_vllm_server()\n",
    "results = []\n",
    "for record in records:\n",
    "    print(f\"\\nSubmitting {record['name']}\")\n",
    "    results.append(submit_inverse_dynamics(record))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "324e6378",
   "metadata": {},
   "source": [
    "## Visualize the Predicted Action\n",
    "\n",
    "Plot the action the model predicted from each input video, as a 3D camera path (with frustums) and a top-down bird's-eye view. The action is read from each run's `sample_outputs.json` and interpreted with the AV pose convention."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7808372",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import json\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.collections import LineCollection\n",
    "from mpl_toolkits.mplot3d.art3d import Line3DCollection\n",
    "\n",
    "# The notebook kernel may differ from the framework venv, so put the repo on the\n",
    "# path before importing `cosmos_framework`.\n",
    "if str(COSMOS3_REPO) not in sys.path:\n",
    "    sys.path.insert(0, str(COSMOS3_REPO))\n",
    "from cosmos_framework.data.vfm.action.pose_utils import pose_rel_to_abs\n",
    "\n",
    "# frustum: apex + image-rectangle corners (camera +Z forward), and their edges\n",
    "_FRUSTUM = np.array([[0, 0, 0], [-1, -1, 1], [1, -1, 1], [1, 1, 1], [-1, 1, 1]], float)\n",
    "_EDGES = [(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (2, 3), (3, 4), (4, 1)]\n",
    "\n",
    "def visualize_pose(poses_abs, *, n_frustums=20, scale_frac=0.03, aspect=16 / 9,\n",
    "                   fov_deg=60.0, vertical_exaggeration=1.0, cmap=\"turbo\",\n",
    "                   title=None, save_path=None, show=True):\n",
    "    \"\"\"3D camera trajectory (with frustums) + a top-down bird's-eye view.\n",
    "\n",
    "    AV convention: world Y is up, world +Z is the heading. `vertical_exaggeration`\n",
    "    stretches only the up-axis box (uniform world scaling, so frustums never skew);\n",
    "    1.0 = geometrically faithful. The 3D plot reorders world (X, Y, Z) -> (X, Z, Y)\n",
    "    so Y points up on screen.\n",
    "    \"\"\"\n",
    "    poses_abs = np.asarray(poses_abs)\n",
    "    pos = poses_abs[:, :3, 3]                     # camera centers [T, 3]\n",
    "    fwd = poses_abs[:, :3, 2]                     # heading (+Z) [T, 3]\n",
    "    T = len(pos)\n",
    "    colors = plt.get_cmap(cmap)(np.arange(T) / max(T - 1, 1))\n",
    "    scale = max(np.ptp(pos, axis=0).max() * scale_frac, 1e-3)\n",
    "    step = max(1, T // max(n_frustums, 1))\n",
    "    xzy = [0, 2, 1]                               # world (X,Y,Z) -> plot (X, Z, Y-up)\n",
    "\n",
    "    fig = plt.figure(figsize=(14, 6))\n",
    "\n",
    "    # (1) 3D perspective with frustums\n",
    "    ax = fig.add_subplot(1, 2, 1, projection=\"3d\")\n",
    "    path = pos[:, xzy]\n",
    "    ax.plot(*path.T, color=\"0.6\", lw=1.0, alpha=0.7)\n",
    "    lines, lcolors, allpts = [], [], [path]\n",
    "    for i in range(0, T, step):\n",
    "        cw = ((_FRUSTUM * [aspect, 1, 1] * scale * np.tan(np.radians(fov_deg) / 2))\n",
    "              @ poses_abs[i, :3, :3].T + poses_abs[i, :3, 3])[:, xzy]  # frustum in plot coords\n",
    "        allpts.append(cw)\n",
    "        lines += [[cw[a], cw[b]] for a, b in _EDGES]\n",
    "        lcolors += [colors[i]] * len(_EDGES)\n",
    "    ax.add_collection3d(Line3DCollection(lines, colors=lcolors, linewidths=1.2))\n",
    "    ax.scatter(*path[0], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n",
    "    ax.scatter(*path[-1], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n",
    "    rng = np.clip(np.ptp(np.concatenate(allpts), axis=0), 1e-9, None)\n",
    "    ax.set_box_aspect((rng[0], rng[1], rng[2] * vertical_exaggeration))\n",
    "    ax.set_xlabel(\"X (m)\", labelpad=12); ax.set_ylabel(\"Z forward (m)\", labelpad=12)\n",
    "    ax.set_zlabel(\"Y up (m)\", labelpad=10); ax.set_zticks([])\n",
    "    ax.set_title(title or f\"Camera trajectory + frustums ({T} frames)\")\n",
    "    ax.legend(loc=\"upper left\"); ax.view_init(elev=22, azim=-70)\n",
    "\n",
    "    # (2) top-down bird's-eye view (X-Z ground plane)\n",
    "    ax2 = fig.add_subplot(1, 2, 2)\n",
    "    seg = np.stack([pos[:-1, [0, 2]], pos[1:, [0, 2]]], axis=1)\n",
    "    lc = LineCollection(seg, cmap=cmap, norm=plt.Normalize(0, T - 1), linewidth=2.5)\n",
    "    lc.set_array(np.arange(T - 1)); ax2.add_collection(lc)\n",
    "    ax2.quiver(pos[::step, 0], pos[::step, 2], fwd[::step, 0], fwd[::step, 2],\n",
    "               color=colors[::step], angles=\"xy\", width=0.005, scale=22, zorder=3)\n",
    "    ax2.scatter(*pos[0, [0, 2]], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n",
    "    ax2.scatter(*pos[-1, [0, 2]], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n",
    "    ax2.set_xlabel(\"X (m)\"); ax2.set_ylabel(\"Z forward (m)\")\n",
    "    ax2.set_title(\"Top-down (bird's-eye view)\")\n",
    "    ax2.set_aspect(\"equal\", adjustable=\"datalim\"); ax2.autoscale_view(); ax2.legend()\n",
    "    fig.colorbar(lc, ax=ax2, label=\"frame index\")\n",
    "\n",
    "    plt.tight_layout(w_pad=6)\n",
    "    if save_path:\n",
    "        fig.savefig(save_path, dpi=120, bbox_inches=\"tight\"); print(\"saved\", save_path)\n",
    "    if show:\n",
    "        plt.show()\n",
    "\n",
    "# `records` and `id_output_dir` come from the prepare cell; read each run's\n",
    "# predicted action from its sample_outputs.json.\n",
    "for record in records:\n",
    "    name = record[\"name\"]\n",
    "    outputs = json.loads((id_output_dir / name / \"sample_outputs.json\").read_text())\n",
    "    poses_rel = np.array(outputs[\"outputs\"][0][\"content\"][\"action\"])  # [T-1, 9] = [translation(3), rot6d(6)]\n",
    "\n",
    "    # AV action convention (see cosmos_framework/data/vfm/action/av_dataset.py):\n",
    "    # rot6d rotation, backward_framewise, translation_scale = 1.35.\n",
    "    poses_abs = pose_rel_to_abs(\n",
    "        poses_rel,\n",
    "        rotation_format=\"rot6d\",\n",
    "        pose_convention=\"backward_framewise\",\n",
    "        translation_scale=1.35,\n",
    "    )  # [T, 4, 4] camera-to-world\n",
    "    print(name, poses_rel.shape, poses_abs.shape)\n",
    "    visualize_pose(poses_abs, title=f\"{name}: predicted camera trajectory + frustums ({len(poses_abs)} frames)\", show=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}