{ "cells": [ { "cell_type": "markdown", "id": "license-header", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "f1bb4ce5", "metadata": {}, "source": [ "# Cosmos3-Nano inference with Cosmos Framework\n", "\n", "This notebook runs Cosmos3 Reasoner Nano inference through the Cosmos Framework inference entrypoint:\n", "\n", "```bash\n", "python -m cosmos_framework.scripts.inference\n", "```\n", "\n", "It is intentionally written as a first-run cookbook: clone or locate the framework source, install dependencies from scratch, create working Reasoner input JSON files, run Nano text/image examples, and try several image-based capability prompts.\n", "\n", "Tested path from the audit:\n", "\n", "- Framework checkout: `packages/cosmos3`\n", "- Install command: `uv sync --all-extras --group=cu130-train`\n", "- Backend: Cosmos Framework / `cosmos_framework.scripts.inference`\n", "- Model: `Cosmos3-Nano`\n" ] }, { "cell_type": "markdown", "id": "nano-prerequisites", "metadata": {}, "source": [ "## 1. Prerequisites\n", "\n", "Before running the notebook:\n", "\n", "1. Use a Linux machine with NVIDIA GPU access.\n", "2. Make sure your Hugging Face account can access the Cosmos3 model repos.\n", "3. Authenticate with Hugging Face:\n", "\n", "```bash\n", "uvx hf@latest auth login\n", "```\n", "\n", "or set:\n", "\n", "```bash\n", "export HF_TOKEN=\n", "```\n", "\n", "4. Use a disk/cache location with enough free space. Nano downloads can use tens of GiB in the Hugging Face cache.\n" ] }, { "cell_type": "markdown", "id": "1867a10c", "metadata": {}, "source": [ "## 2. Configure Paths\n", "\n", "The defaults are intentionally relative to this `cosmos` checkout:\n", "\n", "```text\n", "/packages/cosmos3\n", "```\n", "\n", "You can override the important knobs before running the next cell:\n", "\n", "```bash\n", "export COSMOS3_REPO=/path/to/cosmos-framework\n", "export COSMOS3_GIT_URL=https://github.com/NVIDIA/cosmos-framework.git\n", "export COSMOS3_UV_GROUP=cu130-train # CUDA 13 driver; use cu128-train for a CUDA 12.x driver\n", "export HF_HOME=/path/to/large/huggingface/cache\n", "export CUDA_VISIBLE_DEVICES=0\n", "```\n", "\n", "For SSH access, set `COSMOS3_GIT_URL=git@github.com:NVIDIA/cosmos-framework.git`.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "59a8c486", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "import socket\n", "\n", "def find_repo_root(start: Path) -> Path:\n", " for path in [start, *start.parents]:\n", " if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n", " return path\n", " return start\n", "\n", "def free_local_port() -> str:\n", " with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:\n", " sock.bind((\"127.0.0.1\", 0))\n", " return str(sock.getsockname()[1])\n", "\n", "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n", "COSMOS_REASONER_ASSETS = COSMOS_ROOT / \"cookbooks\" / \"cosmos3\" / \"reasoner\" / \"assets\"\n", "COSMOS3_REPO = Path(os.environ.get(\"COSMOS3_REPO\", COSMOS_ROOT / \"packages\" / \"cosmos3\")).resolve()\n", "COSMOS3_GIT_URL = os.environ.get(\n", " \"COSMOS3_GIT_URL\",\n", " \"https://github.com/NVIDIA/cosmos-framework.git\",\n", ")\n", "COSMOS3_UV_GROUP = os.environ.get(\"COSMOS3_UV_GROUP\", \"cu130-train\")\n", "COSMOS3_OUTPUT_ROOT = Path(\n", " os.environ.get(\"COSMOS3_OUTPUT_ROOT\", COSMOS3_REPO / \"outputs\" / \"cookbooks\" / \"cosmos3\" / \"reasoner\" / \"nano\")\n", ").resolve()\n", "COSMOS3_INPUT_DIR = COSMOS3_OUTPUT_ROOT / \"inputs\"\n", "\n", "# Keep these available to bash cells. Override any of them before running this cell.\n", "os.environ[\"COSMOS_ROOT\"] = str(COSMOS_ROOT)\n", "os.environ[\"COSMOS_REASONER_ASSETS\"] = str(COSMOS_REASONER_ASSETS)\n", "os.environ[\"COSMOS3_REPO\"] = str(COSMOS3_REPO)\n", "os.environ[\"COSMOS3_GIT_URL\"] = COSMOS3_GIT_URL\n", "os.environ[\"COSMOS3_UV_GROUP\"] = COSMOS3_UV_GROUP\n", "os.environ[\"COSMOS3_OUTPUT_ROOT\"] = str(COSMOS3_OUTPUT_ROOT)\n", "os.environ[\"COSMOS3_INPUT_DIR\"] = str(COSMOS3_INPUT_DIR)\n", "os.environ.setdefault(\"UV_CACHE_DIR\", str(Path.home() / \".cache\" / \"uv\"))\n", "os.environ.setdefault(\"HF_HOME\", str(Path.home() / \".cache\" / \"huggingface\"))\n", "os.environ.setdefault(\"CUDA_VISIBLE_DEVICES\", \"0\")\n", "os.environ.setdefault(\"COSMOS3_MASTER_ADDR\", \"127.0.0.1\")\n", "os.environ.setdefault(\"COSMOS3_NANO_TEXT_MASTER_PORT\", free_local_port())\n", "os.environ.setdefault(\"COSMOS3_NANO_IMAGE_MASTER_PORT\", free_local_port())\n", "os.environ.setdefault(\"COSMOS3_CAPABILITY_MASTER_PORT\", free_local_port())\n", "\n", "print(\"cosmos root:\", COSMOS_ROOT)\n", "print(\"Reasoner assets:\", COSMOS_REASONER_ASSETS)\n", "print(\"Cosmos Framework path:\", COSMOS3_REPO)\n", "print(\"Framework git URL:\", COSMOS3_GIT_URL)\n", "print(\"uv dependency group:\", COSMOS3_UV_GROUP)\n", "print(\"output root:\", COSMOS3_OUTPUT_ROOT)\n", "print(\"UV_CACHE_DIR:\", os.environ[\"UV_CACHE_DIR\"])\n", "print(\"HF_HOME:\", os.environ[\"HF_HOME\"])\n", "print(\"CUDA_VISIBLE_DEVICES:\", os.environ[\"CUDA_VISIBLE_DEVICES\"])\n" ] }, { "cell_type": "markdown", "id": "7a7cbee9", "metadata": {}, "source": [ "## 3. Clone or Reuse Cosmos Framework\n", "\n", "This cell creates `packages/` and clones the framework into `packages/cosmos3` if it is not already there. The default clone URL uses HTTPS so users do not need to configure an SSH key unless their access requires it.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0e7efe2c", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "mkdir -p \"$(dirname \"$COSMOS3_REPO\")\"\n", "\n", "if [ -d \"$COSMOS3_REPO/.git\" ]; then\n", " echo \"Using existing framework checkout: $COSMOS3_REPO\"\n", "else\n", " echo \"Cloning $COSMOS3_GIT_URL into $COSMOS3_REPO\"\n", " git clone \"$COSMOS3_GIT_URL\" \"$COSMOS3_REPO\"\n", "fi\n", "\n", "cd \"$COSMOS3_REPO\"\n", "git status --short --branch\n", "git remote -v" ] }, { "cell_type": "markdown", "id": "5f874488", "metadata": {}, "source": [ "## 4. Install Cosmos Framework Dependencies\n", "\n", "This is the full install path used for the Cosmos Framework audit. It is heavier than an inference-only install, but it avoids missing training-extra dependencies that are currently imported by the framework inference path.\n", "\n", "The dependency group selects the CUDA build of `torch`, and it must match your NVIDIA driver:\n", "\n", "| Driver CUDA | `COSMOS3_UV_GROUP` |\n", "| --- | --- |\n", "| 13.x | `cu130-train` (default) |\n", "| 12.x (most machines today) | `cu128-train` |\n", "\n", "The default `cu130-train` group installs CUDA 13 wheels, which need a CUDA 13 driver. On a CUDA 12.x driver, set `COSMOS3_UV_GROUP=cu128-train` before the configuration cell, otherwise the verify cell below reports `cuda available: False`. (These groups are defined in the framework's `pyproject.toml`; only `cu130-train` and `cu128-train` are provided.)\n", "\n", "Expected behavior:\n", "\n", "- Creates `.venv` inside `packages/cosmos3`.\n", "- Downloads CUDA/Torch dependencies.\n", "- May take several minutes.\n", "- May print a uv cache hardlink warning if your cache and repo are on different filesystems; this is usually harmless.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0b03c48b", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "if ! command -v uv >/dev/null 2>&1; then\n", " echo \"uv is not installed. Install it first: https://docs.astral.sh/uv/getting-started/installation/\"\n", " exit 1\n", "fi\n", "\n", "cd \"$COSMOS3_REPO\"\n", "uv sync --all-extras --group=\"$COSMOS3_UV_GROUP\"\n" ] }, { "cell_type": "markdown", "id": "f136a01e", "metadata": {}, "source": [ "## 5. Verify GPU and Python Environment\n", "\n", "The Cosmos Framework commands below use `CUDA_VISIBLE_DEVICES=0` by default. Adjust this if you want a different GPU." ] }, { "cell_type": "code", "execution_count": null, "id": "ceaf747f", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "cd \"$COSMOS3_REPO\"\n", "CUDA_VISIBLE_DEVICES=\"$CUDA_VISIBLE_DEVICES\" .venv/bin/python - <<'PY'\n", "import torch\n", "print(\"torch:\", torch.__version__)\n", "print(\"torch cuda:\", torch.version.cuda)\n", "print(\"cuda available:\", torch.cuda.is_available())\n", "print(\"device count:\", torch.cuda.device_count())\n", "if torch.cuda.is_available():\n", " print(\"device 0:\", torch.cuda.get_device_name(0))\n", "PY\n" ] }, { "cell_type": "markdown", "id": "e33c18f1", "metadata": {}, "source": [ "## 6. Create Reasoner Input Files\n", "\n", "The current shipped Reasoner examples fail without `enable_sound=false`. This cell writes patched Nano smoke-test inputs and image-based capability inputs under:\n", "\n", "```text\n", "packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/inputs/\n", "packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/inputs/capabilities/\n", "```\n", "\n", "These are local cookbook inputs; they do not modify the shipped framework examples. Cosmos Framework Reasoner currently treats `vision_path` as a PIL image input, so video Reasoner examples should be run with [`run_with_vllm.ipynb`](./run_with_vllm.ipynb).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8eb23779", "metadata": {}, "outputs": [], "source": [ "import json\n", "from pathlib import Path\n", "import os\n", "\n", "input_dir = Path(os.environ[\"COSMOS3_INPUT_DIR\"])\n", "assets_dir = Path(os.environ[\"COSMOS_REASONER_ASSETS\"])\n", "capability_dir = input_dir / \"capabilities\"\n", "input_dir.mkdir(parents=True, exist_ok=True)\n", "capability_dir.mkdir(parents=True, exist_ok=True)\n", "\n", "\n", "def write_reasoner_input(path: Path, payload: dict) -> None:\n", " path.write_text(json.dumps(payload, indent=2) + \"\\n\")\n", " print(path)\n", " print(path.read_text())\n", "\n", "\n", "write_reasoner_input(\n", " input_dir / \"nano_text.json\",\n", " {\n", " \"model_mode\": \"reasoner\",\n", " \"name\": \"nano_text\",\n", " \"prompt\": \"Describe a modern robotics research laboratory in one sentence.\",\n", " \"enable_sound\": False,\n", " },\n", ")\n", "\n", "write_reasoner_input(\n", " input_dir / \"nano_image.json\",\n", " {\n", " \"model_mode\": \"reasoner\",\n", " \"name\": \"nano_image\",\n", " \"prompt\": \"Describe what is happening in this image in one sentence.\",\n", " \"vision_path\": str((assets_dir / \"robot_153.jpg\").resolve()),\n", " \"enable_sound\": False,\n", " },\n", ")\n", "\n", "write_reasoner_input(\n", " capability_dir / \"image_caption_detail.json\",\n", " {\n", " \"model_mode\": \"reasoner\",\n", " \"name\": \"image_caption_detail\",\n", " \"prompt\": \"Caption the image in detail.\",\n", " \"vision_path\": str((assets_dir / \"robot_153.jpg\").resolve()),\n", " \"enable_sound\": False,\n", " \"max_new_tokens\": 4096,\n", " },\n", ")\n", "\n", "write_reasoner_input(\n", " capability_dir / \"robot_planning.json\",\n", " {\n", " \"model_mode\": \"reasoner\",\n", " \"name\": \"robot_planning\",\n", " \"prompt\": \"The task is to put flower into the red bottle. Generate a plan consisting of subtasks for accomplish the task.\",\n", " \"vision_path\": str((assets_dir / \"robot_planning.png\").resolve()),\n", " \"enable_sound\": False,\n", " \"max_new_tokens\": 4096,\n", " },\n", ")\n", "\n", "write_reasoner_input(\n", " capability_dir / \"ground_load_bbox.json\",\n", " {\n", " \"model_mode\": \"reasoner\",\n", " \"name\": \"ground_load_bbox\",\n", " \"prompt\": \"Locate the accurate bounding box of the load as a whole. Return a json.\",\n", " \"vision_path\": str((assets_dir / \"grounding_2d.png\").resolve()),\n", " \"enable_sound\": False,\n", " \"max_new_tokens\": 4096,\n", " },\n", ")\n", "\n", "write_reasoner_input(\n", " capability_dir / \"describe_marked_subjects.json\",\n", " {\n", " \"model_mode\": \"reasoner\",\n", " \"name\": \"describe_marked_subjects\",\n", " \"prompt\": 'Please caption the notable attributes in the provided image. List and describe all marked subjects in the image with their categories and detailed captions using a json with keyword \"subject_id\", \"category\" and \"caption\".',\n", " \"vision_path\": str((assets_dir / \"describe_anything.png\").resolve()),\n", " \"enable_sound\": False,\n", " \"max_new_tokens\": 4096,\n", " },\n", ")\n", "\n", "write_reasoner_input(\n", " capability_dir / \"trajectory_bowl.json\",\n", " {\n", " \"model_mode\": \"reasoner\",\n", " \"name\": \"trajectory_bowl\",\n", " \"prompt\": \"\"\"You are given the task \"Move the pink bowl to the right\". Specify the 2D trajectory your end effector should follow in pixel space. Return the trajectory coordinates in JSON format like this: {\"point_2d\": [x, y], \"label\": \"gripper trajectory\"}.\n", "Answer the question using the following format:\n", "\n", "\n", "Your reasoning.\n", "\n", "\n", "Write your final answer immediately after the tag.\n", "\"\"\",\n", " \"vision_path\": str((assets_dir / \"action_cot_trajectory.png\").resolve()),\n", " \"enable_sound\": False,\n", " \"max_new_tokens\": 4096,\n", " \"do_sample\": True,\n", " \"temperature\": 0.6,\n", " \"top_p\": 0.95,\n", " \"top_k\": 20,\n", " \"repetition_penalty\": 1.0,\n", " \"presence_penalty\": 0.0,\n", " },\n", ")\n", "\n", "write_reasoner_input(\n", " capability_dir / \"trajectory_flower.json\",\n", " {\n", " \"model_mode\": \"reasoner\",\n", " \"name\": \"trajectory_flower\",\n", " \"prompt\": \"\"\"You are given the task \"Put flower into the red bottle\". Specify the 2D trajectory your end effector should follow in pixel space. Return the trajectory coordinates in JSON format like this: {\"point_2d\": [x, y], \"label\": \"gripper trajectory\"}.\n", "Answer the question using the following format:\n", "\n", " Your reasoning. \n", "Write your final answer immediately after the tag.\n", "\"\"\",\n", " \"vision_path\": str((assets_dir / \"robot_planning.png\").resolve()),\n", " \"enable_sound\": False,\n", " \"max_new_tokens\": 4096,\n", " \"do_sample\": True,\n", " \"temperature\": 0.6,\n", " \"top_p\": 0.95,\n", " \"top_k\": 20,\n", " \"repetition_penalty\": 1.0,\n", " \"presence_penalty\": 0.0,\n", " },\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "media-display-helper", "metadata": {}, "outputs": [], "source": [ "from html import escape\n", "from pathlib import Path\n", "import json\n", "import re\n", "\n", "from IPython.display import HTML, display\n", "from PIL import Image as PILImage, ImageDraw\n", "\n", "IMAGE_EXTENSIONS = {\".jpg\", \".jpeg\", \".png\", \".webp\", \".gif\", \".bmp\"}\n", "VIDEO_EXTENSIONS = {\".mp4\", \".webm\", \".mov\", \".avi\", \".mkv\", \".m4v\"}\n", "\n", "\n", "def _media_kind(path_or_url: str) -> str:\n", " suffix = Path(path_or_url.split(\"?\")[0]).suffix.lower()\n", " if suffix in IMAGE_EXTENSIONS:\n", " return \"image\"\n", " if suffix in VIDEO_EXTENSIONS:\n", " return \"video\"\n", " return \"unknown\"\n", "\n", "\n", "def show_reasoner_io(sample: dict, output_text: str | None = None) -> None:\n", " # Render prompt, media input, and model output together for notebook review.\n", " prompt = escape(sample.get(\"prompt\", \"\"))\n", " name = escape(sample.get(\"name\", \"sample\"))\n", " media_path = sample.get(\"vision_path\") or sample.get(\"video_path\")\n", "\n", " if media_path:\n", " safe_media_path = escape(str(media_path), quote=True)\n", " media_kind = _media_kind(str(media_path))\n", " if media_kind == \"image\":\n", " media_html = f''\n", " elif media_kind == \"video\":\n", " media_html = f''\n", " else:\n", " media_html = f'{safe_media_path}'\n", " else:\n", " media_html = \"No image or video input for this sample.\"\n", "\n", " output_block = \"\"\n", " if output_text is not None:\n", " output_block = (\n", " '
'\n", " '
Model output
'\n", " f'
{escape(output_text.strip())}
'\n", " '
'\n", " )\n", "\n", " html = (\n", " '
'\n", " '
'\n", " f'
Input media: {name}
'\n", " f'{media_html}'\n", " '
'\n", " '
'\n", " '
Text prompt
'\n", " f'
{prompt}
'\n", " f'{output_block}'\n", " '
'\n", " '
'\n", " )\n", " display(HTML(html))\n", "\n", "\n", "def load_reasoner_result(input_path: Path, output_root: Path) -> tuple[dict, str | None]:\n", " sample = json.loads(input_path.read_text())\n", " text_path = output_root / sample[\"name\"] / \"reasoner_text.txt\"\n", " output_text = text_path.read_text() if text_path.exists() else None\n", " show_reasoner_io(sample, output_text)\n", " print(\"output file:\", text_path)\n", " return sample, output_text\n", "\n", "\n", "def extract_json_payload(text: str):\n", " if \"\" in text:\n", " text = text.split(\"\", 1)[1]\n", " text = re.sub(r\"```(?:json)?\", \"\", text).strip().strip(\"`\").strip()\n", " match = re.search(r\"\\[.*\\]|\\{.*\\}\", text, re.DOTALL)\n", " if not match:\n", " return None\n", " try:\n", " return json.loads(match.group(0))\n", " except json.JSONDecodeError:\n", " return None\n", "\n", "\n", "def draw_boxes_if_present(sample: dict, output_text: str | None) -> None:\n", " if not output_text:\n", " return\n", " data = extract_json_payload(output_text)\n", " if data is None:\n", " return\n", " items = data if isinstance(data, list) else [data]\n", " boxes = []\n", " for item in items:\n", " if not isinstance(item, dict):\n", " continue\n", " box = item.get(\"bbox_2d\") or item.get(\"bbox\") or item.get(\"box\")\n", " if box and len(box) == 4:\n", " boxes.append((box, item.get(\"label\") or item.get(\"name\") or item.get(\"category\")))\n", " if not boxes:\n", " return\n", "\n", " image_path = Path(sample[\"vision_path\"])\n", " if not image_path.exists():\n", " return\n", " img = PILImage.open(image_path).convert(\"RGB\")\n", " width, height = img.size\n", " draw = ImageDraw.Draw(img)\n", " for box, label in boxes:\n", " x1, y1, x2, y2 = box\n", " # Cosmos/Qwen grounding prompts commonly return normalized 0-1000 coordinates.\n", " if max(abs(x1), abs(y1), abs(x2), abs(y2)) <= 1000:\n", " x1, x2 = x1 / 1000 * width, x2 / 1000 * width\n", " y1, y2 = y1 / 1000 * height, y2 / 1000 * height\n", " draw.rectangle([x1, y1, x2, y2], outline=\"red\", width=3)\n", " if label:\n", " draw.text((x1, max(0, y1 - 14)), str(label), fill=\"red\")\n", " img.thumbnail((768, 768))\n", " display(img)\n", "\n", "\n", "def draw_trajectory_if_present(sample: dict, output_text: str | None) -> None:\n", " if not output_text:\n", " return\n", " data = extract_json_payload(output_text)\n", " if data is None:\n", " return\n", " items = data if isinstance(data, list) else [data]\n", " points = []\n", " for item in items:\n", " if isinstance(item, dict) and \"point_2d\" in item and len(item[\"point_2d\"]) == 2:\n", " points.append(tuple(item[\"point_2d\"]))\n", " if not points:\n", " return\n", "\n", " image_path = Path(sample[\"vision_path\"])\n", " if not image_path.exists():\n", " return\n", " img = PILImage.open(image_path).convert(\"RGB\")\n", " width, height = img.size\n", " scaled = []\n", " for x, y in points:\n", " if max(abs(x), abs(y)) <= 1000:\n", " scaled.append((x / 1000 * width, y / 1000 * height))\n", " else:\n", " scaled.append((x, y))\n", " draw = ImageDraw.Draw(img)\n", " if len(scaled) > 1:\n", " draw.line(scaled, fill=\"lime\", width=5)\n", " for idx, (x, y) in enumerate(scaled):\n", " radius = 12\n", " draw.ellipse([x - radius, y - radius, x + radius, y + radius], fill=\"red\", outline=\"white\", width=3)\n", " draw.text((x + 14, y - 14), str(idx), fill=\"yellow\")\n", " img.thumbnail((900, 900))\n", " display(img)" ] }, { "cell_type": "markdown", "id": "fb501705", "metadata": {}, "source": [ "## 7. Run Nano Text Inference\n", "\n", "This runs `Cosmos3-Nano` on a text-only Reasoner prompt.\n", "\n", "Expected output file:\n", "\n", "```text\n", "packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_nano_text/nano_text/reasoner_text.txt\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "id": "34136be0", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "cd \"$COSMOS3_REPO\"\n", "COSMOS_TRAINING=false CUDA_VISIBLE_DEVICES=\"$CUDA_VISIBLE_DEVICES\" \\\n", "MASTER_ADDR=\"$COSMOS3_MASTER_ADDR\" MASTER_PORT=\"$COSMOS3_NANO_TEXT_MASTER_PORT\" RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \\\n", ".venv/bin/python -m cosmos_framework.scripts.inference \\\n", " --parallelism-preset=latency \\\n", " -i \"$COSMOS3_INPUT_DIR/nano_text.json\" \\\n", " -o \"$COSMOS3_OUTPUT_ROOT/cosmos_framework_nano_text\" \\\n", " --checkpoint-path Cosmos3-Nano \\\n", " --seed=0 \\\n", " --benchmark\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9ab0e750", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os, json\n", "\n", "output_root = Path(os.environ[\"COSMOS3_OUTPUT_ROOT\"])\n", "text_path = output_root / \"cosmos_framework_nano_text\" / \"nano_text\" / \"reasoner_text.txt\"\n", "benchmark_path = output_root / \"cosmos_framework_nano_text\" / \"benchmark.json\"\n", "\n", "print(text_path)\n", "print(text_path.read_text())\n", "if benchmark_path.exists():\n", " print(json.dumps(json.loads(benchmark_path.read_text()).get(\"average\", {}), indent=2))\n" ] }, { "cell_type": "markdown", "id": "5e7b2393", "metadata": {}, "source": [ "## 8. Run Nano Image Inference\n", "\n", "This runs `Cosmos3-Nano` on an image-conditioned Reasoner prompt. The result cell below renders the input image, text prompt, and model output side by side. The same display helper supports video URLs if future inputs use video files.\n", "\n", "Expected output file:\n", "\n", "```text\n", "packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_nano_image/nano_image/reasoner_text.txt\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "id": "859b1411", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "cd \"$COSMOS3_REPO\"\n", "COSMOS_TRAINING=false CUDA_VISIBLE_DEVICES=\"$CUDA_VISIBLE_DEVICES\" \\\n", "MASTER_ADDR=\"$COSMOS3_MASTER_ADDR\" MASTER_PORT=\"$COSMOS3_NANO_IMAGE_MASTER_PORT\" RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \\\n", ".venv/bin/python -m cosmos_framework.scripts.inference \\\n", " --parallelism-preset=latency \\\n", " -i \"$COSMOS3_INPUT_DIR/nano_image.json\" \\\n", " -o \"$COSMOS3_OUTPUT_ROOT/cosmos_framework_nano_image\" \\\n", " --checkpoint-path Cosmos3-Nano \\\n", " --seed=0 \\\n", " --benchmark\n" ] }, { "cell_type": "code", "execution_count": null, "id": "93271819", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os, json\n", "\n", "input_dir = Path(os.environ[\"COSMOS3_INPUT_DIR\"])\n", "output_root = Path(os.environ[\"COSMOS3_OUTPUT_ROOT\"])\n", "sample = json.loads((input_dir / \"nano_image.json\").read_text())\n", "text_path = output_root / \"cosmos_framework_nano_image\" / \"nano_image\" / \"reasoner_text.txt\"\n", "benchmark_path = output_root / \"cosmos_framework_nano_image\" / \"benchmark.json\"\n", "output_text = text_path.read_text()\n", "\n", "show_reasoner_io(sample, output_text)\n", "\n", "print(\"output file:\", text_path)\n", "if benchmark_path.exists():\n", " print(json.dumps(json.loads(benchmark_path.read_text()).get(\"average\", {}), indent=2))\n" ] }, { "cell_type": "markdown", "id": "image-caption-detail-section", "metadata": {}, "source": [ "## 9. Image Caption\n", "\n", "> **Note:** The Cosmos Framework Reasoner examples in this notebook are image-only. The current framework entrypoint treats `vision_path` as a PIL image source, so video Reasoner inputs should be run with [`run_with_vllm.ipynb`](./run_with_vllm.ipynb).\n", "\n", "Detailed image captioning example using the same robot image as the smoke test.\n", "\n", "Expected output file:\n", "\n", "```text\n", "packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_image_caption_detail/image_caption_detail/reasoner_text.txt\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "id": "run-image-caption-detail", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "cd \"$COSMOS3_REPO\"\n", "COSMOS_TRAINING=false CUDA_VISIBLE_DEVICES=\"$CUDA_VISIBLE_DEVICES\" \\\n", "MASTER_ADDR=\"$COSMOS3_MASTER_ADDR\" MASTER_PORT=\"$COSMOS3_CAPABILITY_MASTER_PORT\" RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \\\n", ".venv/bin/python -m cosmos_framework.scripts.inference \\\n", " --parallelism-preset=latency \\\n", " -i \"$COSMOS3_INPUT_DIR/capabilities/image_caption_detail.json\" \\\n", " -o \"$COSMOS3_OUTPUT_ROOT/cosmos_framework_image_caption_detail\" \\\n", " --checkpoint-path Cosmos3-Nano \\\n", " --seed=0 \\\n", " --benchmark" ] }, { "cell_type": "code", "execution_count": null, "id": "display-image-caption-detail", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "\n", "sample, output_text = load_reasoner_result(\n", " Path(os.environ[\"COSMOS3_INPUT_DIR\"]) / \"capabilities\" / \"image_caption_detail.json\",\n", " Path(os.environ[\"COSMOS3_OUTPUT_ROOT\"]) / \"cosmos_framework_image_caption_detail\",\n", ")" ] }, { "cell_type": "markdown", "id": "robot-planning-section", "metadata": {}, "source": [ "## 10. Robot Planning\n", "\n", "Embodied planning example for moving the flower into the red bottle.\n", "\n", "Expected output file:\n", "\n", "```text\n", "packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_robot_planning/robot_planning/reasoner_text.txt\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "run-robot-planning", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "cd \"$COSMOS3_REPO\"\n", "COSMOS_TRAINING=false CUDA_VISIBLE_DEVICES=\"$CUDA_VISIBLE_DEVICES\" \\\n", "MASTER_ADDR=\"$COSMOS3_MASTER_ADDR\" MASTER_PORT=\"$COSMOS3_CAPABILITY_MASTER_PORT\" RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \\\n", ".venv/bin/python -m cosmos_framework.scripts.inference \\\n", " --parallelism-preset=latency \\\n", " -i \"$COSMOS3_INPUT_DIR/capabilities/robot_planning.json\" \\\n", " -o \"$COSMOS3_OUTPUT_ROOT/cosmos_framework_robot_planning\" \\\n", " --checkpoint-path Cosmos3-Nano \\\n", " --seed=0 \\\n", " --benchmark" ] }, { "cell_type": "code", "execution_count": null, "id": "display-robot-planning", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "\n", "sample, output_text = load_reasoner_result(\n", " Path(os.environ[\"COSMOS3_INPUT_DIR\"]) / \"capabilities\" / \"robot_planning.json\",\n", " Path(os.environ[\"COSMOS3_OUTPUT_ROOT\"]) / \"cosmos_framework_robot_planning\",\n", ")" ] }, { "cell_type": "markdown", "id": "ground-load-section", "metadata": {}, "source": [ "## 11. 2D Grounding\n", "\n", "Grounding example that asks the model to locate the load as a bounding box and renders the parsed box when the output is valid JSON.\n", "\n", "Expected output file:\n", "\n", "```text\n", "packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_ground_load_bbox/ground_load_bbox/reasoner_text.txt\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "run-ground-load-bbox", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "cd \"$COSMOS3_REPO\"\n", "COSMOS_TRAINING=false CUDA_VISIBLE_DEVICES=\"$CUDA_VISIBLE_DEVICES\" \\\n", "MASTER_ADDR=\"$COSMOS3_MASTER_ADDR\" MASTER_PORT=\"$COSMOS3_CAPABILITY_MASTER_PORT\" RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \\\n", ".venv/bin/python -m cosmos_framework.scripts.inference \\\n", " --parallelism-preset=latency \\\n", " -i \"$COSMOS3_INPUT_DIR/capabilities/ground_load_bbox.json\" \\\n", " -o \"$COSMOS3_OUTPUT_ROOT/cosmos_framework_ground_load_bbox\" \\\n", " --checkpoint-path Cosmos3-Nano \\\n", " --seed=0 \\\n", " --benchmark" ] }, { "cell_type": "code", "execution_count": null, "id": "display-ground-load-bbox", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "\n", "sample, output_text = load_reasoner_result(\n", " Path(os.environ[\"COSMOS3_INPUT_DIR\"]) / \"capabilities\" / \"ground_load_bbox.json\",\n", " Path(os.environ[\"COSMOS3_OUTPUT_ROOT\"]) / \"cosmos_framework_ground_load_bbox\",\n", ")\n", "draw_boxes_if_present(sample, output_text)" ] }, { "cell_type": "markdown", "id": "describe-anything-section", "metadata": {}, "source": [ "## 12. Describe Anything\n", "\n", "Marked-subject description example that asks for a JSON list of subject IDs, categories, and captions.\n", "\n", "Expected output file:\n", "\n", "```text\n", "packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_describe_marked_subjects/describe_marked_subjects/reasoner_text.txt\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "run-describe-marked-subjects", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "cd \"$COSMOS3_REPO\"\n", "COSMOS_TRAINING=false CUDA_VISIBLE_DEVICES=\"$CUDA_VISIBLE_DEVICES\" \\\n", "MASTER_ADDR=\"$COSMOS3_MASTER_ADDR\" MASTER_PORT=\"$COSMOS3_CAPABILITY_MASTER_PORT\" RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \\\n", ".venv/bin/python -m cosmos_framework.scripts.inference \\\n", " --parallelism-preset=latency \\\n", " -i \"$COSMOS3_INPUT_DIR/capabilities/describe_marked_subjects.json\" \\\n", " -o \"$COSMOS3_OUTPUT_ROOT/cosmos_framework_describe_marked_subjects\" \\\n", " --checkpoint-path Cosmos3-Nano \\\n", " --seed=0 \\\n", " --benchmark" ] }, { "cell_type": "code", "execution_count": null, "id": "display-describe-marked-subjects", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "\n", "sample, output_text = load_reasoner_result(\n", " Path(os.environ[\"COSMOS3_INPUT_DIR\"]) / \"capabilities\" / \"describe_marked_subjects.json\",\n", " Path(os.environ[\"COSMOS3_OUTPUT_ROOT\"]) / \"cosmos_framework_describe_marked_subjects\",\n", ")" ] }, { "cell_type": "markdown", "id": "trajectory-bowl-section", "metadata": {}, "source": [ "## 13. Action CoT: Trajectory Coordinates\n", "\n", "Action trajectory example that asks for 2D gripper coordinates for moving the pink bowl to the right. The display cell renders parsed points when the output is valid JSON.\n", "\n", "Expected output file:\n", "\n", "```text\n", "packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_trajectory_bowl/trajectory_bowl/reasoner_text.txt\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "run-trajectory-bowl", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "cd \"$COSMOS3_REPO\"\n", "COSMOS_TRAINING=false CUDA_VISIBLE_DEVICES=\"$CUDA_VISIBLE_DEVICES\" \\\n", "MASTER_ADDR=\"$COSMOS3_MASTER_ADDR\" MASTER_PORT=\"$COSMOS3_CAPABILITY_MASTER_PORT\" RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \\\n", ".venv/bin/python -m cosmos_framework.scripts.inference \\\n", " --parallelism-preset=latency \\\n", " -i \"$COSMOS3_INPUT_DIR/capabilities/trajectory_bowl.json\" \\\n", " -o \"$COSMOS3_OUTPUT_ROOT/cosmos_framework_trajectory_bowl\" \\\n", " --checkpoint-path Cosmos3-Nano \\\n", " --seed=0 \\\n", " --benchmark" ] }, { "cell_type": "code", "execution_count": null, "id": "display-trajectory-bowl", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "\n", "sample, output_text = load_reasoner_result(\n", " Path(os.environ[\"COSMOS3_INPUT_DIR\"]) / \"capabilities\" / \"trajectory_bowl.json\",\n", " Path(os.environ[\"COSMOS3_OUTPUT_ROOT\"]) / \"cosmos_framework_trajectory_bowl\",\n", ")\n", "draw_trajectory_if_present(sample, output_text)" ] }, { "cell_type": "markdown", "id": "trajectory-flower-section", "metadata": {}, "source": [ "## 14. Action CoT: Robot Plan Trajectory\n", "\n", "Second trajectory example using the robot planning image and the flower-to-red-bottle task.\n", "\n", "Expected output file:\n", "\n", "```text\n", "packages/cosmos3/outputs/cookbooks/cosmos3/reasoner/nano/cosmos_framework_trajectory_flower/trajectory_flower/reasoner_text.txt\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "run-trajectory-flower", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "cd \"$COSMOS3_REPO\"\n", "COSMOS_TRAINING=false CUDA_VISIBLE_DEVICES=\"$CUDA_VISIBLE_DEVICES\" \\\n", "MASTER_ADDR=\"$COSMOS3_MASTER_ADDR\" MASTER_PORT=\"$COSMOS3_CAPABILITY_MASTER_PORT\" RANK=0 WORLD_SIZE=1 LOCAL_RANK=0 \\\n", ".venv/bin/python -m cosmos_framework.scripts.inference \\\n", " --parallelism-preset=latency \\\n", " -i \"$COSMOS3_INPUT_DIR/capabilities/trajectory_flower.json\" \\\n", " -o \"$COSMOS3_OUTPUT_ROOT/cosmos_framework_trajectory_flower\" \\\n", " --checkpoint-path Cosmos3-Nano \\\n", " --seed=0 \\\n", " --benchmark" ] }, { "cell_type": "code", "execution_count": null, "id": "display-trajectory-flower", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "\n", "sample, output_text = load_reasoner_result(\n", " Path(os.environ[\"COSMOS3_INPUT_DIR\"]) / \"capabilities\" / \"trajectory_flower.json\",\n", " Path(os.environ[\"COSMOS3_OUTPUT_ROOT\"]) / \"cosmos_framework_trajectory_flower\",\n", ")\n", "draw_trajectory_if_present(sample, output_text)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }