{ "cells": [ { "cell_type": "markdown", "id": "license-header", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "fdvl-title", "metadata": {}, "source": [ "# Cosmos3 Nano Action: Forward Dynamics with vLLM-Omni\n", "\n", "This notebook runs Cosmos3 Nano **action forward-dynamics** inference through the vLLM-Omni OpenAI-compatible video API:\n", "\n", "```text\n", "POST /v1/videos\n", "```\n", "\n", "Forward dynamics predicts future visual observations from an initial image and an action trajectory. This notebook contains separate AV and robotics sections that each build their own input spec, run inference, and visualize generated videos.\n", "\n", "Start the server in a terminal from the `cosmos` repo root. The container listens on port `8000`; Docker publishes it to host port `8001`, so the notebook uses `http://localhost:8001`.\n", "\n", "```bash\n", "docker rm -f cosmos3-vllm-omni-notebook 2>/dev/null || true\n", "\n", "docker run -d --name cosmos3-vllm-omni-notebook \\\n", " --runtime nvidia --gpus '\"device=0\"' \\\n", " -e CUDA_DEVICE_ORDER=PCI_BUS_ID \\\n", " -v \"/mnt/sdb/.cache/huggingface:/root/.cache/huggingface\" \\\n", " -v \"$PWD:/workspace\" \\\n", " -p 8001:8000 --ipc=host \\\n", " vllm/vllm-omni:cosmos3 \\\n", " vllm serve nvidia/Cosmos3-Nano \\\n", " --omni \\\n", " --model-class-name Cosmos3OmniDiffusersPipeline \\\n", " --allowed-local-media-path / \\\n", " --port 8000 \\\n", " --init-timeout 1800\n", "\n", "# Wait until this returns model metadata before running the inference cell.\n", "curl http://localhost:8001/v1/models\n", "```\n" ] }, { "cell_type": "markdown", "id": "fdvl-vars-md", "metadata": {}, "source": [ "## Configure Notebook Variables\n", "\n", "Run this cell after the vLLM-Omni server is available. It resolves local input/output paths and stores generated outputs under `outputs/cosmos3_action_vllm/` by default.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fdvl-vars-code", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "\n", "\n", "def find_repo_root(start: Path) -> Path:\n", " for path in [start, *start.parents]:\n", " if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n", " return path\n", "\n", " return start\n", "\n", "\n", "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n", "COSMOS3_REPO = Path(os.environ.get(\"COSMOS3_REPO\", COSMOS_ROOT / \"packages\" / \"cosmos3\")).resolve()\n", "COSMOS3_OUTPUT_ROOT = Path(\n", " os.environ.get(\"COSMOS3_VLLM_OUTPUT_ROOT\", COSMOS_ROOT / \"outputs\" / \"cosmos3_action_vllm\")\n", ").resolve()\n", "COSMOS3_INPUT_DIR = COSMOS3_OUTPUT_ROOT / \"inputs\"\n", "VLLM_BASE_URL = os.environ.get(\"COSMOS3_VLLM_BASE_URL\", \"http://localhost:8001\").rstrip(\"/\")\n", "\n", "\n", "def resolve_input(rel_path: str) -> str:\n", " path = (COSMOS_ROOT / rel_path).resolve()\n", " assert path.exists(), f\"missing input: {path}\"\n", " return str(path)\n", "\n", "\n", "COSMOS3_OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)\n", "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n", "print(\"COSMOS3_REPO:\", COSMOS3_REPO)\n", "print(\"COSMOS3_INPUT_DIR:\", COSMOS3_INPUT_DIR)\n", "print(\"COSMOS3_OUTPUT_ROOT:\", COSMOS3_OUTPUT_ROOT)\n", "print(\"COSMOS3_VLLM_BASE_URL:\", VLLM_BASE_URL)\n" ] }, { "cell_type": "markdown", "id": "fdvl-av-md", "metadata": {}, "source": [ "## AV\n", "\n", "In this example, we show how to provide a set of ego poses of a autonomous vehicle and an image to generate driving videos using Cosmos3-Nano.\n" ] }, { "cell_type": "markdown", "id": "fdvl-av-spec-md", "metadata": {}, "source": [ "### Create the AV Forward-Dynamics Input Spec\n", "\n", "AV forward-dynamics inference is driven by a JSONL spec, one line per run. Each line shares the same start frame (`vision_path`) but uses a different ego trajectory (`action_path`), so we get one generated video per trajectory.\n", "\n", "The action input is prepared in a JSON file, which can be converted from camera poses (camera-to-world transformation, OpenCV convention, unit in meter) via `pose_abs_to_rel`:\n", "\n", "```python\n", "if str(COSMOS3_REPO) not in sys.path:\n", " sys.path.insert(0, str(COSMOS3_REPO))\n", "from cosmos_framework.data.vfm.action.pose_utils import pose_abs_to_rel\n", "\n", "poses_abs = np.array([...]) # [T, 4, 4], camera-to-world transformation in opencv convention, unit in meter\n", "poses_rel = pose_abs_to_rel(\n", " poses_abs,\n", " rotation_format=\"rot6d\",\n", " pose_convention=\"backward_framewise\",\n", " translation_scale=1.35,\n", ") # [T-1, 9], translation(3), rot6d(6), framewise relative transformation\n", "\n", "with open(\"custom_traj.json\", \"w\") as f:\n", " json.dump(poses_rel, f)\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fdvl-av-spec-code", "metadata": {}, "outputs": [], "source": [ "# `resolve_input` and the COSMOS3_* paths come from the variables cell.\n", "import json\n", "\n", "# Local AV inputs, relative to the cosmos repo root.\n", "av_input_image = \"cookbooks/cosmos3/generator/action/assets/images/av_0.jpg\"\n", "av_input_actions = {\n", " \"av_forward\": \"cookbooks/cosmos3/generator/action/assets/actions/av_traj_forward.json\",\n", " \"av_left\": \"cookbooks/cosmos3/generator/action/assets/actions/av_traj_left.json\",\n", " \"av_right\": \"cookbooks/cosmos3/generator/action/assets/actions/av_traj_right.json\",\n", "}\n", "\n", "av_vision_path = resolve_input(av_input_image)\n", "av_records = [\n", " {\n", " \"action_chunk_size\": 60,\n", " \"action_path\": resolve_input(action_rel),\n", " \"domain_name\": \"av\",\n", " \"fps\": 10,\n", " \"image_size\": 480,\n", " \"view_point\": \"ego_view\",\n", " \"model_mode\": \"forward_dynamics\",\n", " \"name\": name,\n", " \"prompt\": \"You are an autonomous vehicle planning system.\",\n", " \"seed\": 0,\n", " \"vision_path\": av_vision_path,\n", " }\n", " for name, action_rel in av_input_actions.items()\n", "]\n", "\n", "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", "av_fd_input_path = COSMOS3_INPUT_DIR / \"action_forward_dynamics_av_custom.jsonl\"\n", "av_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in av_records))\n", "av_fd_output_dir = COSMOS3_OUTPUT_ROOT / \"action_forward_dynamics_av_custom\"\n", "\n", "os.environ[\"COSMOS3_AV_FD_INPUT\"] = str(av_fd_input_path)\n", "os.environ[\"COSMOS3_AV_FD_OUTPUT\"] = str(av_fd_output_dir)\n", "\n", "print(\"wrote AV spec:\", av_fd_input_path)\n", "print(\"AV runs:\", list(av_input_actions))\n", "print(av_fd_input_path.read_text())\n" ] }, { "cell_type": "markdown", "id": "fdvl-av-traj-md", "metadata": {}, "source": [ "### Visualize AV Input Trajectories\n", "\n", "Before generating any video, plot each input ego trajectory as a 3D camera path with frustums and a top-down bird's-eye view.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fdvl-av-traj-code", "metadata": {}, "outputs": [], "source": [ "import sys\n", "import json\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from matplotlib.collections import LineCollection\n", "from mpl_toolkits.mplot3d.art3d import Line3DCollection\n", "\n", "# The notebook kernel may differ from the framework venv, so put the repo on the\n", "# path before importing `cosmos_framework`.\n", "if str(COSMOS3_REPO) not in sys.path:\n", " sys.path.insert(0, str(COSMOS3_REPO))\n", "from cosmos_framework.data.vfm.action.pose_utils import pose_rel_to_abs\n", "\n", "# frustum: apex + image-rectangle corners (camera +Z forward), and their edges\n", "_FRUSTUM = np.array([[0, 0, 0], [-1, -1, 1], [1, -1, 1], [1, 1, 1], [-1, 1, 1]], float)\n", "_EDGES = [(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (2, 3), (3, 4), (4, 1)]\n", "\n", "\n", "def visualize_pose(poses_abs, *, n_frustums=20, scale_frac=0.03, aspect=16 / 9,\n", " fov_deg=60.0, vertical_exaggeration=1.0, cmap=\"turbo\",\n", " title=None, save_path=None, show=True):\n", " \"\"\"3D camera trajectory (with frustums) + a top-down bird's-eye view.\"\"\"\n", " poses_abs = np.asarray(poses_abs)\n", " pos = poses_abs[:, :3, 3]\n", " fwd = poses_abs[:, :3, 2]\n", " T = len(pos)\n", " colors = plt.get_cmap(cmap)(np.arange(T) / max(T - 1, 1))\n", " scale = max(np.ptp(pos, axis=0).max() * scale_frac, 1e-3)\n", " step = max(1, T // max(n_frustums, 1))\n", " xzy = [0, 2, 1]\n", "\n", " fig = plt.figure(figsize=(14, 6))\n", "\n", " ax = fig.add_subplot(1, 2, 1, projection=\"3d\")\n", " path = pos[:, xzy]\n", " ax.plot(*path.T, color=\"0.6\", lw=1.0, alpha=0.7)\n", " lines, lcolors, allpts = [], [], [path]\n", " for i in range(0, T, step):\n", " cw = ((_FRUSTUM * [aspect, 1, 1] * scale * np.tan(np.radians(fov_deg) / 2))\n", " @ poses_abs[i, :3, :3].T + poses_abs[i, :3, 3])[:, xzy]\n", " allpts.append(cw)\n", " lines += [[cw[a], cw[b]] for a, b in _EDGES]\n", " lcolors += [colors[i]] * len(_EDGES)\n", " ax.add_collection3d(Line3DCollection(lines, colors=lcolors, linewidths=1.2))\n", " ax.scatter(*path[0], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n", " ax.scatter(*path[-1], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n", " rng = np.clip(np.ptp(np.concatenate(allpts), axis=0), 1e-9, None)\n", " ax.set_box_aspect((rng[0], rng[1], rng[2] * vertical_exaggeration))\n", " ax.set_xlabel(\"X (m)\", labelpad=12)\n", " ax.set_ylabel(\"Z forward (m)\", labelpad=12)\n", " ax.set_zlabel(\"Y up (m)\", labelpad=10)\n", " ax.set_zticks([])\n", " ax.set_title(title or f\"Camera trajectory + frustums ({T} frames)\")\n", " ax.legend(loc=\"upper left\")\n", " ax.view_init(elev=22, azim=-70)\n", "\n", " ax2 = fig.add_subplot(1, 2, 2)\n", " seg = np.stack([pos[:-1, [0, 2]], pos[1:, [0, 2]]], axis=1)\n", " lc = LineCollection(seg, cmap=cmap, norm=plt.Normalize(0, T - 1), linewidth=2.5)\n", " lc.set_array(np.arange(T - 1))\n", " ax2.add_collection(lc)\n", " ax2.quiver(pos[::step, 0], pos[::step, 2], fwd[::step, 0], fwd[::step, 2],\n", " color=colors[::step], angles=\"xy\", width=0.005, scale=22, zorder=3)\n", " ax2.scatter(*pos[0, [0, 2]], color=\"lime\", s=80, edgecolor=\"k\", label=\"first frame\", zorder=5)\n", " ax2.scatter(*pos[-1, [0, 2]], color=\"red\", s=80, edgecolor=\"k\", label=\"last frame\", zorder=5)\n", " ax2.set_xlabel(\"X (m)\")\n", " ax2.set_ylabel(\"Z forward (m)\")\n", " ax2.set_title(\"Top-down (bird's-eye view)\")\n", " ax2.set_aspect(\"equal\", adjustable=\"datalim\")\n", " ax2.autoscale_view()\n", " ax2.legend()\n", " fig.colorbar(lc, ax=ax2, label=\"frame index\")\n", "\n", " plt.tight_layout(w_pad=6)\n", " if save_path:\n", " fig.savefig(save_path, dpi=120, bbox_inches=\"tight\")\n", " print(\"saved\", save_path)\n", " if show:\n", " plt.show()\n", "\n", "\n", "for record in av_records:\n", " name = record[\"name\"]\n", " with open(record[\"action_path\"]) as f:\n", " poses_rel = np.array(json.load(f))\n", "\n", " # AV action convention: rot6d rotation, backward_framewise, translation_scale = 1.35.\n", " poses_abs = pose_rel_to_abs(\n", " poses_rel,\n", " rotation_format=\"rot6d\",\n", " pose_convention=\"backward_framewise\",\n", " translation_scale=1.35,\n", " )\n", " print(name, poses_rel.shape, poses_abs.shape)\n", " visualize_pose(poses_abs, title=f\"{name}: camera trajectory + frustums ({len(poses_abs)} frames)\", show=True)\n" ] }, { "cell_type": "markdown", "id": "fdvl-av-run-md", "metadata": {}, "source": [ "### Run AV Forward-Dynamics Inference\n", "\n", "Runs `Cosmos3-Nano` on every line of the AV spec through vLLM-Omni. Each run writes its video to:\n", "\n", "```text\n", "/action_forward_dynamics_av_custom//vision.mp4\n", "```\n", "\n", "The request sets top-level `size` to the conditioning image resolution so vLLM-Omni returns output at the input resolution without reflection padding.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fdvl-av-run-code", "metadata": {}, "outputs": [], "source": [ "import json\n", "import mimetypes\n", "import time\n", "from pathlib import Path\n", "\n", "from PIL import Image\n", "\n", "try:\n", " import requests\n", "except ImportError as exc:\n", " raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n", "\n", "\n", "def check_vllm_server(timeout_s: int = 600, interval_s: int = 10) -> None:\n", " deadline = time.time() + timeout_s\n", " last_error: Exception | None = None\n", " while time.time() < deadline:\n", " try:\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/models\", timeout=10)\n", " response.raise_for_status()\n", " print(response.json())\n", " return\n", " except requests.RequestException as exc:\n", " last_error = exc\n", " print(f\"Waiting for vLLM server at {VLLM_BASE_URL}: {exc}\")\n", " time.sleep(interval_s)\n", " raise RuntimeError(\n", " f\"vLLM server did not become ready at {VLLM_BASE_URL} within {timeout_s}s. \"\n", " \"Check `docker logs -f cosmos3-vllm-omni-notebook`.\"\n", " ) from last_error\n", "\n", "\n", "def submit_forward_dynamics(record: dict, fd_output_dir: Path) -> dict:\n", " run_dir = fd_output_dir / record[\"name\"]\n", " run_dir.mkdir(parents=True, exist_ok=True)\n", "\n", " vision_path = Path(record[\"vision_path\"])\n", " input_width, input_height = Image.open(vision_path).size\n", " mime_type = mimetypes.guess_type(vision_path.name)[0] or \"application/octet-stream\"\n", " extra_params = {\n", " \"action_mode\": \"forward_dynamics\",\n", " \"domain_name\": record[\"domain_name\"],\n", " \"action_chunk_size\": record[\"action_chunk_size\"],\n", " \"image_size\": record[\"image_size\"],\n", " \"view_point\": record[\"view_point\"],\n", " \"action\": json.loads(Path(record[\"action_path\"]).read_text()),\n", " \"guardrails\": False,\n", " }\n", " prompt = str(record.get(\"prompt\") or \"\").strip() or \"A robot manipulates an object.\"\n", " form = {\n", " \"prompt\": prompt,\n", " \"num_frames\": record[\"action_chunk_size\"] + 1,\n", " \"fps\": record[\"fps\"],\n", " \"size\": f\"{input_width}x{input_height}\",\n", " \"num_inference_steps\": 30,\n", " \"guidance_scale\": 1.0,\n", " \"flow_shift\": 10.0,\n", " \"seed\": record[\"seed\"],\n", " \"extra_params\": json.dumps(extra_params),\n", " }\n", "\n", " with vision_path.open(\"rb\") as image_file:\n", " response = requests.post(\n", " f\"{VLLM_BASE_URL}/v1/videos\",\n", " data={key: str(value) for key, value in form.items()},\n", " files={\"input_reference\": (vision_path.name, image_file, mime_type)},\n", " timeout=120,\n", " )\n", " if not response.ok:\n", " (run_dir / \"error_response.txt\").write_text(response.text)\n", " print(\"vLLM request failed:\", response.status_code)\n", " print(response.text)\n", " print(\"form:\", json.dumps(form, indent=2))\n", " print(\"extra_params keys:\", sorted(extra_params))\n", " print(\"action shape:\", [len(extra_params[\"action\"]), len(extra_params[\"action\"][0]) if extra_params[\"action\"] else 0])\n", " response.raise_for_status()\n", " initial = response.json()\n", " (run_dir / \"response.json\").write_text(json.dumps(initial, indent=2))\n", "\n", " while True:\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/videos/{initial['id']}\", timeout=30)\n", " response.raise_for_status()\n", " final = response.json()\n", " (run_dir / \"final.json\").write_text(json.dumps(final, indent=2))\n", " print(initial[\"id\"], final.get(\"status\"), f\"{final.get('progress', 0)}%\")\n", " if final.get(\"status\") == \"completed\":\n", " break\n", " if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n", " raise RuntimeError(json.dumps(final, indent=2))\n", " time.sleep(2)\n", "\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/videos/{initial['id']}/content\", timeout=300)\n", " response.raise_for_status()\n", " video_path = run_dir / \"vision.mp4\"\n", " video_path.write_bytes(response.content)\n", "\n", " action = final.get(\"action\")\n", " if action is not None:\n", " (run_dir / \"action.json\").write_text(json.dumps(action, indent=2))\n", "\n", " print(\"saved\", video_path)\n", " if action is not None:\n", " print(\"action shape:\", action.get(\"shape\"), \"dtype:\", action.get(\"dtype\"))\n", " return {\"record\": record, \"initial\": initial, \"final\": final, \"run_dir\": run_dir, \"video_path\": video_path, \"action\": action}\n", "\n", "\n", "check_vllm_server()\n", "av_results = []\n", "for record in av_records:\n", " print(f\"\\nSubmitting {record['name']}\")\n", " av_results.append(submit_forward_dynamics(record, av_fd_output_dir))\n" ] }, { "cell_type": "markdown", "id": "fdvl-av-preview-md", "metadata": {}, "source": [ "### Visualize AV Generated Videos\n", "\n", "\n", "`Video(..., embed=True)` base64-inlines a file into the notebook, and embedding full-resolution runs can freeze the front-end. This cell first transcodes each video to a small preview using the ffmpeg binary bundled with `imageio-ffmpeg`, then embeds the previews. The full-resolution `vision.mp4` files are left untouched.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fdvl-av-preview-code", "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "import imageio_ffmpeg\n", "from IPython.display import Video, display\n", "\n", "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", "\n", "\n", "def make_preview(src: Path, crf: int = 28) -> Path:\n", " \"\"\"Re-encode `src` to a compact, browser-friendly mp4 (cached).\"\"\"\n", " preview = src.with_name(f\"{src.stem}_preview.mp4\")\n", " if not preview.exists():\n", " subprocess.run(\n", " [FFMPEG, \"-y\", \"-loglevel\", \"error\", \"-i\", str(src),\n", " \"-c:v\", \"libx264\", \"-crf\", str(crf),\n", " \"-preset\", \"veryfast\", \"-an\", \"-pix_fmt\", \"yuv420p\", str(preview)],\n", " check=True,\n", " )\n", " return preview\n", "\n", "\n", "for record in av_records:\n", " name = record[\"name\"]\n", " src = av_fd_output_dir / name / \"vision.mp4\"\n", " assert src.exists(), f\"missing: {src}\"\n", " preview = make_preview(src)\n", " print(f\"{name} ({src.stat().st_size // 1024} KB -> {preview.stat().st_size // 1024} KB preview)\")\n", " display(Video(str(preview), embed=True))\n" ] }, { "cell_type": "markdown", "id": "fdvl-robotics-md", "metadata": {}, "source": [ "## Robotics\n", "\n", "In this example, we show how to start from a LeRobot dataset of DROID and run **multiview** generation for robotics manipulation **autoregressively**.\n" ] }, { "cell_type": "markdown", "id": "fdvl-robotics-spec-md", "metadata": {}, "source": [ "### Create the Robotics Autoregressive Forward-Dynamics Plan\n", "\n", "Robotics forward-dynamics runs autoregressively over five contiguous 16-action DROID chunks. This cell writes the GT first conditioning image for chunk 0 and one action JSON per chunk. Later chunks receive their conditioning image from the previous chunk's generated last frame during the inference loop.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fdvl-robotics-spec-code", "metadata": {}, "outputs": [], "source": [ "# `resolve_input` and the COSMOS3_* paths come from the variables cell.\n", "import json\n", "import os\n", "import sys\n", "\n", "from PIL import Image\n", "\n", "# The notebook kernel may differ from the framework venv, so put the repo on the\n", "# path before importing `cosmos_framework`.\n", "if str(COSMOS3_REPO) not in sys.path:\n", " sys.path.insert(0, str(COSMOS3_REPO))\n", "\n", "from cosmos_framework.data.vfm.action.datasets import DROIDLeRobotDataset\n", "\n", "robotics_dataset_root = resolve_input(\"cookbooks/cosmos3/generator/action/assets/droid_lerobot_example\")\n", "robotics_dataset = DROIDLeRobotDataset(root=robotics_dataset_root)\n", "robotics_num_chunks = 5\n", "robotics_chunk_length = 16\n", "robotics_chunk_starts = [chunk_idx * robotics_chunk_length for chunk_idx in range(robotics_num_chunks)]\n", "assert robotics_chunk_starts[-1] < len(robotics_dataset)\n", "\n", "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", "robotics_initial_vision_path = COSMOS3_INPUT_DIR / \"robotics_droid_autoregressive_input_chunk_00.png\"\n", "robotics_records = []\n", "\n", "for chunk_idx, sample_idx in enumerate(robotics_chunk_starts):\n", " robotics_sample = robotics_dataset[sample_idx]\n", " assert int(robotics_sample[\"action\"].shape[0]) == robotics_chunk_length\n", "\n", " chunk_name = f\"robotics_action_cond_chunk_{chunk_idx:02d}\"\n", " robotics_action_path = COSMOS3_INPUT_DIR / f\"robotics_droid_action_chunk_{chunk_idx:02d}.json\"\n", " robotics_action_path.write_text(json.dumps(robotics_sample[\"action\"].cpu().tolist()))\n", "\n", " if chunk_idx == 0:\n", " first_frame = robotics_sample[\"video\"][:, 0].permute(1, 2, 0).cpu().numpy()\n", " Image.fromarray(first_frame).save(robotics_initial_vision_path)\n", " vision_path = robotics_initial_vision_path\n", " else:\n", " vision_path = COSMOS3_INPUT_DIR / f\"robotics_droid_autoregressive_input_chunk_{chunk_idx:02d}.png\"\n", "\n", " robotics_records.append(\n", " {\n", " \"action_chunk_size\": robotics_chunk_length,\n", " \"action_path\": str(robotics_action_path),\n", " \"domain_name\": \"droid_lerobot\",\n", " \"fps\": int(robotics_sample[\"conditioning_fps\"]),\n", " \"image_size\": 480,\n", " \"view_point\": robotics_sample[\"viewpoint\"],\n", " \"model_mode\": \"forward_dynamics\",\n", " \"name\": chunk_name,\n", " \"prompt\": robotics_sample[\"ai_caption\"],\n", " \"seed\": 0,\n", " \"vision_path\": str(vision_path),\n", " }\n", " )\n", "\n", "robotics_fd_input_path = COSMOS3_INPUT_DIR / \"action_forward_dynamics_robotics_custom.jsonl\"\n", "robotics_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in robotics_records))\n", "robotics_fd_output_dir = COSMOS3_OUTPUT_ROOT / \"action_forward_dynamics_robotics_custom\"\n", "\n", "os.environ[\"COSMOS3_ROBOTICS_FD_INPUT\"] = str(robotics_fd_input_path)\n", "os.environ[\"COSMOS3_ROBOTICS_FD_OUTPUT\"] = str(robotics_fd_output_dir)\n", "\n", "print(\"loaded DROID samples from:\", robotics_dataset_root)\n", "print(\"chunk starts:\", robotics_chunk_starts)\n", "print(\"total action frames:\", robotics_num_chunks * robotics_chunk_length)\n", "print(\"wrote GT initial frame:\", robotics_initial_vision_path)\n", "print(\"wrote robotics autoregressive plan:\", robotics_fd_input_path)\n", "print(robotics_fd_input_path.read_text())\n" ] }, { "cell_type": "markdown", "id": "fdvl-robotics-run-md", "metadata": {}, "source": [ "### Run Robotics Autoregressive Forward-Dynamics Inference\n", "\n", "Runs `Cosmos3-Nano` once per robotics chunk through vLLM-Omni. Chunk 0 uses the DROID GT first frame. After each chunk finishes, the cell extracts that chunk's last generated frame and uses it as the conditioning image for the next chunk. Guardrails are disabled for this robotics run via `extra_params={\"guardrails\": false}`.\n", "\n", "Each request sets top-level `size` to the current conditioning image resolution so vLLM-Omni returns each autoregressive chunk without reflection padding.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fdvl-robotics-run-code", "metadata": {}, "outputs": [], "source": [ "import json\n", "import mimetypes\n", "import subprocess\n", "import time\n", "from pathlib import Path\n", "\n", "import imageio_ffmpeg\n", "from PIL import Image\n", "\n", "try:\n", " import requests\n", "except ImportError as exc:\n", " raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n", "\n", "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", "\n", "\n", "def check_vllm_server(timeout_s: int = 600, interval_s: int = 10) -> None:\n", " deadline = time.time() + timeout_s\n", " last_error: Exception | None = None\n", " while time.time() < deadline:\n", " try:\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/models\", timeout=10)\n", " response.raise_for_status()\n", " print(response.json())\n", " return\n", " except requests.RequestException as exc:\n", " last_error = exc\n", " print(f\"Waiting for vLLM server at {VLLM_BASE_URL}: {exc}\")\n", " time.sleep(interval_s)\n", " raise RuntimeError(\n", " f\"vLLM server did not become ready at {VLLM_BASE_URL} within {timeout_s}s. \"\n", " \"Check `docker logs -f cosmos3-vllm-omni-notebook`.\"\n", " ) from last_error\n", "\n", "\n", "def submit_forward_dynamics(record: dict, fd_output_dir: Path, *, disable_guardrails: bool = False) -> dict:\n", " run_dir = fd_output_dir / record[\"name\"]\n", " run_dir.mkdir(parents=True, exist_ok=True)\n", "\n", " vision_path = Path(record[\"vision_path\"])\n", " input_width, input_height = Image.open(vision_path).size\n", " mime_type = mimetypes.guess_type(vision_path.name)[0] or \"application/octet-stream\"\n", " extra_params = {\n", " \"action_mode\": \"forward_dynamics\",\n", " \"domain_name\": record[\"domain_name\"],\n", " \"action_chunk_size\": record[\"action_chunk_size\"],\n", " \"image_size\": record[\"image_size\"],\n", " \"view_point\": record[\"view_point\"],\n", " \"action\": json.loads(Path(record[\"action_path\"]).read_text()),\n", " \"guardrails\": False,\n", " }\n", " if disable_guardrails:\n", " extra_params[\"guardrails\"] = False\n", "\n", " prompt = str(record.get(\"prompt\") or \"\").strip() or \" \"\n", " form = {\n", " \"prompt\": prompt,\n", " \"num_frames\": record[\"action_chunk_size\"] + 1,\n", " \"fps\": record[\"fps\"],\n", " \"size\": f\"{input_width}x{input_height}\",\n", " \"num_inference_steps\": 30,\n", " \"guidance_scale\": 1.0,\n", " \"flow_shift\": 10.0,\n", " \"seed\": record[\"seed\"],\n", " \"extra_params\": json.dumps(extra_params),\n", " }\n", "\n", " with vision_path.open(\"rb\") as image_file:\n", " response = requests.post(\n", " f\"{VLLM_BASE_URL}/v1/videos\",\n", " data={key: str(value) for key, value in form.items()},\n", " files={\"input_reference\": (vision_path.name, image_file, mime_type)},\n", " timeout=120,\n", " )\n", " if not response.ok:\n", " (run_dir / \"error_response.txt\").write_text(response.text)\n", " print(\"vLLM request failed:\", response.status_code)\n", " print(response.text)\n", " print(\"form:\", json.dumps(form, indent=2))\n", " print(\"extra_params keys:\", sorted(extra_params))\n", " print(\"action shape:\", [len(extra_params[\"action\"]), len(extra_params[\"action\"][0]) if extra_params[\"action\"] else 0])\n", " response.raise_for_status()\n", " initial = response.json()\n", " (run_dir / \"response.json\").write_text(json.dumps(initial, indent=2))\n", "\n", " while True:\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/videos/{initial['id']}\", timeout=30)\n", " response.raise_for_status()\n", " final = response.json()\n", " (run_dir / \"final.json\").write_text(json.dumps(final, indent=2))\n", " print(initial[\"id\"], final.get(\"status\"), f\"{final.get('progress', 0)}%\")\n", " if final.get(\"status\") == \"completed\":\n", " break\n", " if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n", " raise RuntimeError(json.dumps(final, indent=2))\n", " time.sleep(2)\n", "\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/videos/{initial['id']}/content\", timeout=300)\n", " response.raise_for_status()\n", " video_path = run_dir / \"vision.mp4\"\n", " video_path.write_bytes(response.content)\n", "\n", " action = final.get(\"action\")\n", " if action is not None:\n", " (run_dir / \"action.json\").write_text(json.dumps(action, indent=2))\n", "\n", " print(\"saved\", video_path)\n", " if action is not None:\n", " print(\"action shape:\", action.get(\"shape\"), \"dtype:\", action.get(\"dtype\"))\n", " return {\"record\": record, \"initial\": initial, \"final\": final, \"run_dir\": run_dir, \"video_path\": video_path, \"action\": action}\n", "\n", "\n", "check_vllm_server()\n", "robotics_results = []\n", "robotics_actual_records = []\n", "current_vision_path = Path(robotics_records[0][\"vision_path\"])\n", "assert current_vision_path.exists(), f\"missing initial conditioning image: {current_vision_path}\"\n", "\n", "for chunk_idx, base_record in enumerate(robotics_records):\n", " record = dict(base_record)\n", " record[\"vision_path\"] = str(current_vision_path)\n", " robotics_records[chunk_idx][\"vision_path\"] = str(current_vision_path)\n", " robotics_actual_records.append(record)\n", "\n", " print(f\"\\nSubmitting {record['name']}\")\n", " print(\"conditioning image:\", current_vision_path)\n", " result = submit_forward_dynamics(record, robotics_fd_output_dir, disable_guardrails=True)\n", " robotics_results.append(result)\n", "\n", " if chunk_idx + 1 < len(robotics_records):\n", " next_vision_path = COSMOS3_INPUT_DIR / f\"robotics_droid_autoregressive_input_chunk_{chunk_idx + 1:02d}.png\"\n", " subprocess.run(\n", " [\n", " FFMPEG,\n", " \"-y\",\n", " \"-loglevel\",\n", " \"error\",\n", " \"-i\",\n", " str(result[\"video_path\"]),\n", " \"-vf\",\n", " fr\"select=eq(n\\,{record['action_chunk_size']})\",\n", " \"-frames:v\",\n", " \"1\",\n", " str(next_vision_path),\n", " ],\n", " check=True,\n", " )\n", " assert next_vision_path.exists(), f\"failed to extract next conditioning image: {next_vision_path}\"\n", " current_vision_path = next_vision_path\n", "\n", "robotics_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in robotics_actual_records))\n", "print(\"wrote autoregressive run spec:\", robotics_fd_input_path)\n", "print(\"completed chunks:\", [record[\"name\"] for record in robotics_actual_records])\n" ] }, { "cell_type": "markdown", "id": "fdvl-robotics-stitch-md", "metadata": {}, "source": [ "### Stitch Robotics Generated Chunks\n", "\n", "Each autoregressive chunk video includes its conditioning frame at frame 0. This cell drops that first frame from every chunk and concatenates the remaining 16 generated frames per chunk into one 80-frame video.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fdvl-robotics-stitch-code", "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "import imageio_ffmpeg\n", "\n", "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", "robotics_stitch_dir = robotics_fd_output_dir / \"_stitched_segments\"\n", "robotics_stitch_dir.mkdir(parents=True, exist_ok=True)\n", "\n", "segment_paths = []\n", "for record in robotics_records:\n", " src = robotics_fd_output_dir / record[\"name\"] / \"vision.mp4\"\n", " assert src.exists(), f\"missing: {src}\"\n", "\n", " segment = robotics_stitch_dir / f\"{record['name']}_generated.mp4\"\n", " subprocess.run(\n", " [\n", " FFMPEG,\n", " \"-y\",\n", " \"-loglevel\",\n", " \"error\",\n", " \"-i\",\n", " str(src),\n", " \"-vf\",\n", " r\"select=gte(n\\,1),setpts=N/FRAME_RATE/TB\",\n", " \"-an\",\n", " \"-r\",\n", " str(record[\"fps\"]),\n", " \"-c:v\",\n", " \"libx264\",\n", " \"-crf\",\n", " \"18\",\n", " \"-preset\",\n", " \"veryfast\",\n", " \"-pix_fmt\",\n", " \"yuv420p\",\n", " str(segment),\n", " ],\n", " check=True,\n", " )\n", " segment_paths.append(segment)\n", "\n", "concat_file = robotics_stitch_dir / \"concat.txt\"\n", "concat_file.write_text(\"\".join(f\"file '{path.as_posix()}'\\n\" for path in segment_paths))\n", "\n", "robotics_stitched_video_path = robotics_fd_output_dir / \"robotics_action_cond_stitched.mp4\"\n", "subprocess.run(\n", " [\n", " FFMPEG,\n", " \"-y\",\n", " \"-loglevel\",\n", " \"error\",\n", " \"-f\",\n", " \"concat\",\n", " \"-safe\",\n", " \"0\",\n", " \"-i\",\n", " str(concat_file),\n", " \"-c\",\n", " \"copy\",\n", " str(robotics_stitched_video_path),\n", " ],\n", " check=True,\n", ")\n", "\n", "print(\"stitched robotics video:\", robotics_stitched_video_path)\n", "print(\"expected generated frames:\", len(robotics_records) * robotics_chunk_length)\n", "print(\"size KB:\", robotics_stitched_video_path.stat().st_size // 1024)\n" ] }, { "cell_type": "markdown", "id": "fdvl-robotics-preview-md", "metadata": {}, "source": [ "### Visualize Robotics Generated Videos\n", "\n", "`Video(..., embed=True)` base64-inlines a file into the notebook, and embedding full-resolution runs can freeze the front-end. This cell first displays a compact preview of the stitched 80-frame video when available, then previews each per-chunk video. The full-resolution `vision.mp4` files and stitched mp4 are left untouched.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fdvl-robotics-preview-code", "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "import imageio_ffmpeg\n", "from IPython.display import Video, display\n", "\n", "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", "\n", "\n", "def make_preview(src: Path, crf: int = 28) -> Path:\n", " \"\"\"Re-encode `src` to a compact, browser-friendly mp4 (cached).\"\"\"\n", " preview = src.with_name(f\"{src.stem}_preview.mp4\")\n", " if not preview.exists() or preview.stat().st_mtime < src.stat().st_mtime:\n", " subprocess.run(\n", " [FFMPEG, \"-y\", \"-loglevel\", \"error\", \"-i\", str(src),\n", " \"-c:v\", \"libx264\", \"-crf\", str(crf),\n", " \"-preset\", \"veryfast\", \"-an\", \"-pix_fmt\", \"yuv420p\", str(preview)],\n", " check=True,\n", " )\n", " return preview\n", "\n", "\n", "if \"robotics_stitched_video_path\" in globals():\n", " assert robotics_stitched_video_path.exists(), f\"missing: {robotics_stitched_video_path}\"\n", " stitched_preview = make_preview(robotics_stitched_video_path)\n", " print(\n", " f\"stitched ({robotics_stitched_video_path.stat().st_size // 1024} KB -> \"\n", " f\"{stitched_preview.stat().st_size // 1024} KB preview)\"\n", " )\n", " display(Video(str(stitched_preview), embed=True))\n", "\n", "for record in robotics_records:\n", " name = record[\"name\"]\n", " src = robotics_fd_output_dir / name / \"vision.mp4\"\n", " assert src.exists(), f\"missing: {src}\"\n", " preview = make_preview(src)\n", " print(f\"{name} ({src.stat().st_size // 1024} KB -> {preview.stat().st_size // 1024} KB preview)\")\n", " display(Video(str(preview), embed=True))\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bb998490", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bd2edde3", "metadata": {}, "source": [ "## UMI\n", "\n", "This example runs UMI forward dynamics through vLLM-Omni autoregressively over all 16-action chunks in `assets/actions/umi.json`. The action file stores the raw UMI 10D action representation, so the setup cell validates the row dimension, writes one action JSON per chunk, and prepares a run plan." ] }, { "cell_type": "markdown", "id": "5abea6a3", "metadata": {}, "source": [ "### Create the UMI Autoregressive Forward-Dynamics Plan\n", "\n", "The UMI action file is stored as one JSON array with `16 * n` action rows. vLLM-Omni receives one request per 16-action chunk. Chunk 0 uses the checked-in UMI conditioning image; later chunks use conditioning images extracted from the previous generated video." ] }, { "cell_type": "code", "execution_count": null, "id": "14cab52f", "metadata": {}, "outputs": [], "source": [ "import json\n", "from pathlib import Path\n", "\n", "umi_input_image = \"cookbooks/cosmos3/generator/action/assets/images/umi.png\"\n", "umi_input_action = \"cookbooks/cosmos3/generator/action/assets/actions/umi.json\"\n", "umi_prompt = \"mouse arrangement\"\n", "umi_fps = 20\n", "umi_action_chunk_size = 16\n", "umi_raw_action_dim = 10\n", "\n", "umi_initial_vision_path = Path(resolve_input(umi_input_image))\n", "umi_source_action_path = Path(resolve_input(umi_input_action))\n", "umi_action = json.loads(umi_source_action_path.read_text())\n", "assert len(umi_action) % umi_action_chunk_size == 0, (\n", " f\"expected action count to be divisible by {umi_action_chunk_size}, got {len(umi_action)}\"\n", ")\n", "assert all(len(row) == umi_raw_action_dim for row in umi_action), \"UMI action rows must be 10D\"\n", "\n", "umi_num_chunks = len(umi_action) // umi_action_chunk_size\n", "COSMOS3_INPUT_DIR.mkdir(parents=True, exist_ok=True)\n", "umi_records = []\n", "\n", "for chunk_idx in range(umi_num_chunks):\n", " chunk_name = f\"umi_action_cond_chunk_{chunk_idx:02d}\"\n", " start = chunk_idx * umi_action_chunk_size\n", " end = start + umi_action_chunk_size\n", " action_chunk_10d = umi_action[start:end]\n", " umi_action_path = COSMOS3_INPUT_DIR / f\"umi_action_chunk_{chunk_idx:02d}_10d.json\"\n", " umi_action_path.write_text(json.dumps(action_chunk_10d, indent=2) + \"\\n\")\n", "\n", " if chunk_idx == 0:\n", " vision_path = umi_initial_vision_path\n", " else:\n", " vision_path = COSMOS3_INPUT_DIR / f\"umi_autoregressive_input_chunk_{chunk_idx:02d}.png\"\n", "\n", " umi_records.append(\n", " {\n", " \"action_chunk_size\": umi_action_chunk_size,\n", " \"action_path\": str(umi_action_path),\n", " \"domain_name\": \"umi\",\n", " \"fps\": umi_fps,\n", " \"image_size\": 256,\n", " \"view_point\": \"ego_view\",\n", " \"model_mode\": \"forward_dynamics\",\n", " \"name\": chunk_name,\n", " \"prompt\": umi_prompt,\n", " \"seed\": chunk_idx,\n", " \"vision_path\": str(vision_path),\n", " }\n", " )\n", "\n", "umi_fd_input_path = COSMOS3_INPUT_DIR / \"action_forward_dynamics_umi_custom.jsonl\"\n", "umi_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in umi_records))\n", "umi_fd_output_dir = COSMOS3_OUTPUT_ROOT / \"action_forward_dynamics_umi_custom\"\n", "\n", "print(\"UMI chunks:\", umi_num_chunks)\n", "print(\"wrote UMI spec:\", umi_fd_input_path)\n", "print(umi_fd_input_path.read_text())" ] }, { "cell_type": "markdown", "id": "50abf709", "metadata": {}, "source": [ "### Run UMI Autoregressive Forward-Dynamics Inference\n", "\n", "Runs one vLLM-Omni video request per UMI action chunk. After each chunk completes, the cell extracts that chunk's last generated frame and uses it as the conditioning image for the next chunk. Each request sets top-level `size` to the current conditioning image resolution to avoid reflection padding." ] }, { "cell_type": "code", "execution_count": null, "id": "6d8f37f5", "metadata": {}, "outputs": [], "source": [ "import json\n", "import mimetypes\n", "import subprocess\n", "import time\n", "from pathlib import Path\n", "\n", "import imageio_ffmpeg\n", "from PIL import Image\n", "\n", "try:\n", " import requests\n", "except ImportError as exc:\n", " raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n", "\n", "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", "\n", "\n", "def check_vllm_server_for_umi(timeout_s: int = 600, interval_s: int = 10) -> None:\n", " deadline = time.time() + timeout_s\n", " last_error: Exception | None = None\n", " while time.time() < deadline:\n", " try:\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/models\", timeout=10)\n", " response.raise_for_status()\n", " print(response.json())\n", " return\n", " except requests.RequestException as exc:\n", " last_error = exc\n", " print(f\"Waiting for vLLM server at {VLLM_BASE_URL}: {exc}\")\n", " time.sleep(interval_s)\n", " raise RuntimeError(\n", " f\"vLLM server did not become ready at {VLLM_BASE_URL} within {timeout_s}s. \"\n", " \"Check `docker logs -f cosmos3-vllm-omni-notebook`.\"\n", " ) from last_error\n", "\n", "\n", "def submit_umi_forward_dynamics(record: dict, fd_output_dir: Path) -> dict:\n", " run_dir = fd_output_dir / record[\"name\"]\n", " run_dir.mkdir(parents=True, exist_ok=True)\n", "\n", " vision_path = Path(record[\"vision_path\"])\n", " input_width, input_height = Image.open(vision_path).size\n", " mime_type = mimetypes.guess_type(vision_path.name)[0] or \"application/octet-stream\"\n", " extra_params = {\n", " \"action_mode\": \"forward_dynamics\",\n", " \"domain_name\": record[\"domain_name\"],\n", " \"action_chunk_size\": record[\"action_chunk_size\"],\n", " \"image_size\": record[\"image_size\"],\n", " \"view_point\": record[\"view_point\"],\n", " \"action\": json.loads(Path(record[\"action_path\"]).read_text()),\n", " \"guardrails\": False,\n", " }\n", " prompt = str(record.get(\"prompt\") or \"\").strip() or \"A robot manipulates an object.\"\n", " form = {\n", " \"prompt\": prompt,\n", " \"num_frames\": record[\"action_chunk_size\"] + 1,\n", " \"fps\": record[\"fps\"],\n", " \"size\": f\"{input_width}x{input_height}\",\n", " \"num_inference_steps\": 30,\n", " \"guidance_scale\": 1.0,\n", " \"flow_shift\": 10.0,\n", " \"seed\": record[\"seed\"],\n", " \"extra_params\": json.dumps(extra_params),\n", " }\n", "\n", " with vision_path.open(\"rb\") as image_file:\n", " response = requests.post(\n", " f\"{VLLM_BASE_URL}/v1/videos\",\n", " data={key: str(value) for key, value in form.items()},\n", " files={\"input_reference\": (vision_path.name, image_file, mime_type)},\n", " timeout=120,\n", " )\n", " if not response.ok:\n", " (run_dir / \"error_response.txt\").write_text(response.text)\n", " print(\"vLLM request failed:\", response.status_code)\n", " print(response.text)\n", " print(\"form:\", json.dumps(form, indent=2))\n", " print(\"extra_params keys:\", sorted(extra_params))\n", " print(\"action shape:\", [len(extra_params[\"action\"]), len(extra_params[\"action\"][0]) if extra_params[\"action\"] else 0])\n", " response.raise_for_status()\n", "\n", " initial = response.json()\n", " (run_dir / \"response.json\").write_text(json.dumps(initial, indent=2))\n", "\n", " while True:\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/videos/{initial['id']}\", timeout=30)\n", " response.raise_for_status()\n", " final = response.json()\n", " (run_dir / \"final.json\").write_text(json.dumps(final, indent=2))\n", " print(initial[\"id\"], final.get(\"status\"), f\"{final.get('progress', 0)}%\")\n", " if final.get(\"status\") == \"completed\":\n", " break\n", " if final.get(\"status\") in {\"failed\", \"cancelled\"}:\n", " raise RuntimeError(json.dumps(final, indent=2))\n", " time.sleep(2)\n", "\n", " response = requests.get(f\"{VLLM_BASE_URL}/v1/videos/{initial['id']}/content\", timeout=300)\n", " response.raise_for_status()\n", " video_path = run_dir / \"vision.mp4\"\n", " video_path.write_bytes(response.content)\n", " print(\"saved\", video_path)\n", " return {\"record\": record, \"initial\": initial, \"final\": final, \"run_dir\": run_dir, \"video_path\": video_path}\n", "\n", "\n", "check_vllm_server_for_umi()\n", "umi_results = []\n", "umi_actual_records = []\n", "current_vision_path = Path(umi_records[0][\"vision_path\"])\n", "assert current_vision_path.exists(), f\"missing initial conditioning image: {current_vision_path}\"\n", "\n", "for chunk_idx, base_record in enumerate(umi_records):\n", " record = dict(base_record)\n", " record[\"vision_path\"] = str(current_vision_path)\n", " umi_records[chunk_idx][\"vision_path\"] = str(current_vision_path)\n", " umi_actual_records.append(record)\n", "\n", " print(f\"\\nSubmitting {record['name']}\")\n", " print(\"conditioning image:\", current_vision_path)\n", " result = submit_umi_forward_dynamics(record, umi_fd_output_dir)\n", " umi_results.append(result)\n", "\n", " if chunk_idx + 1 < len(umi_records):\n", " next_vision_path = COSMOS3_INPUT_DIR / f\"umi_autoregressive_input_chunk_{chunk_idx + 1:02d}.png\"\n", " subprocess.run(\n", " [\n", " FFMPEG,\n", " \"-y\",\n", " \"-loglevel\",\n", " \"error\",\n", " \"-i\",\n", " str(result[\"video_path\"]),\n", " \"-vf\",\n", " fr\"select=eq(n\\,{record['action_chunk_size']})\",\n", " \"-frames:v\",\n", " \"1\",\n", " str(next_vision_path),\n", " ],\n", " check=True,\n", " )\n", " assert next_vision_path.exists(), f\"failed to extract next conditioning image: {next_vision_path}\"\n", " current_vision_path = next_vision_path\n", "\n", "umi_fd_input_path.write_text(\"\".join(json.dumps(r) + \"\\n\" for r in umi_actual_records))\n", "print(\"wrote autoregressive UMI run spec:\", umi_fd_input_path)\n", "print(\"completed UMI chunks:\", [record[\"name\"] for record in umi_actual_records])" ] }, { "cell_type": "markdown", "id": "cc77c77a", "metadata": {}, "source": [ "### Stitch and Visualize UMI Generated Chunks\n", "\n", "Each autoregressive chunk video includes its conditioning frame at frame 0. This cell drops that first frame from every chunk, concatenates the generated frames into one rollout video, transcodes a compact preview, and embeds it in the notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "93e49e34", "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "\n", "import imageio_ffmpeg\n", "from IPython.display import Video, display\n", "\n", "FFMPEG = imageio_ffmpeg.get_ffmpeg_exe()\n", "umi_video_paths = [umi_fd_output_dir / record[\"name\"] / \"vision.mp4\" for record in umi_records]\n", "for path in umi_video_paths:\n", " assert path.exists(), f\"missing UMI chunk video: {path}\"\n", "\n", "umi_stitch_dir = umi_fd_output_dir / \"_stitched_segments\"\n", "umi_stitch_dir.mkdir(parents=True, exist_ok=True)\n", "segment_paths = []\n", "for record, src in zip(umi_records, umi_video_paths, strict=True):\n", " segment = umi_stitch_dir / f\"{record['name']}_generated.mp4\"\n", " subprocess.run(\n", " [\n", " FFMPEG,\n", " \"-y\",\n", " \"-loglevel\",\n", " \"error\",\n", " \"-i\",\n", " str(src),\n", " \"-vf\",\n", " r\"select=gte(n\\,1),setpts=N/FRAME_RATE/TB\",\n", " \"-an\",\n", " \"-r\",\n", " str(record[\"fps\"]),\n", " \"-c:v\",\n", " \"libx264\",\n", " \"-crf\",\n", " \"18\",\n", " \"-preset\",\n", " \"veryfast\",\n", " \"-pix_fmt\",\n", " \"yuv420p\",\n", " str(segment),\n", " ],\n", " check=True,\n", " )\n", " segment_paths.append(segment)\n", "\n", "concat_file = umi_stitch_dir / \"umi_concat.txt\"\n", "concat_file.write_text(\"\".join(f\"file '{path.as_posix()}'\\n\" for path in segment_paths))\n", "umi_stitched_video_path = umi_fd_output_dir / \"umi_action_cond_stitched.mp4\"\n", "subprocess.run(\n", " [\n", " FFMPEG,\n", " \"-y\",\n", " \"-loglevel\",\n", " \"error\",\n", " \"-f\",\n", " \"concat\",\n", " \"-safe\",\n", " \"0\",\n", " \"-i\",\n", " str(concat_file),\n", " \"-c\",\n", " \"copy\",\n", " str(umi_stitched_video_path),\n", " ],\n", " check=True,\n", ")\n", "\n", "\n", "def make_umi_preview(src: Path, crf: int = 28) -> Path:\n", " preview = src.with_name(f\"{src.stem}_preview.mp4\")\n", " if not preview.exists() or preview.stat().st_mtime < src.stat().st_mtime:\n", " subprocess.run(\n", " [\n", " FFMPEG,\n", " \"-y\",\n", " \"-loglevel\",\n", " \"error\",\n", " \"-i\",\n", " str(src),\n", " \"-c:v\",\n", " \"libx264\",\n", " \"-crf\",\n", " str(crf),\n", " \"-preset\",\n", " \"veryfast\",\n", " \"-an\",\n", " \"-pix_fmt\",\n", " \"yuv420p\",\n", " str(preview),\n", " ],\n", " check=True,\n", " )\n", " return preview\n", "\n", "umi_preview_path = make_umi_preview(umi_stitched_video_path)\n", "print(\"stitched UMI video:\", umi_stitched_video_path)\n", "print(\"expected generated frames:\", len(umi_records) * umi_action_chunk_size)\n", "print(f\"UMI preview: {umi_preview_path}\")\n", "display(Video(str(umi_preview_path), embed=True))" ] }, { "cell_type": "code", "execution_count": null, "id": "668f11b6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.13" } }, "nbformat": 4, "nbformat_minor": 5 }