{ "cells": [ { "cell_type": "markdown", "id": "license-header", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "b1f96e7d", "metadata": {}, "source": [ "### Cosmos3 Reasoner inference with NVIDIA NIM\n", "\n", "This notebook:\n", "1. Launches the [Cosmos 3 Reasoner NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos3-reasoner) container, a prebuilt, optimized, OpenAI-compatible server.\n", "2. Waits for it to become ready.\n", "3. Sends the same image and video reasoning requests with the `openai` client.\n", "\n", "Compared to the vLLM notebook, there is no Python environment or CUDA-pairing to set up — the NIM container ships everything. The container serves two sizes, selected with `NIM_MODEL_SIZE`:\n", "\n", "| `NIM_MODEL_SIZE` | Served model name |\n", "| --- | --- |\n", "| `nano` (default) | `nvidia/cosmos3-nano-reasoner` |\n", "| `super` | `nvidia/cosmos3-super-reasoner` |\n", "\n", "You can also try this same NIM interactively in your browser on the [cosmos3-nano-reasoner build page](https://build.nvidia.com/nvidia/cosmos3-nano-reasoner). See the [Cosmos Reason 3 NIM API reference](https://docs.nvidia.com/nim/vision-language-models/1.7.0/examples/cosmos-reason3/api.html) for the full request reference." ] }, { "cell_type": "markdown", "id": "99497707", "metadata": {}, "source": [ "## 1. Launch the NIM container\n", "\n", "You need an `NGC_API_KEY` (from [build.nvidia.com](https://build.nvidia.com/nvidia/cosmos3-nano-reasoner) or [NGC](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos3-reasoner)) set in the environment **before** starting Jupyter, or set it from Python with `os.environ[\"NGC_API_KEY\"] = \"nvapi-...\"`. Docker must also be logged in to `nvcr.io` (`docker login nvcr.io`, username `$oauthtoken`, password = your key).\n", "\n", "The first launch downloads and caches the model into `~/.cache/nim`, which can take a while; later launches start quickly." ] }, { "cell_type": "code", "execution_count": null, "id": "b286f1b7", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import os\n", "import subprocess\n", "\n", "\n", "def find_repo_root() -> Path:\n", " try:\n", " return Path(\n", " subprocess.check_output([\"git\", \"rev-parse\", \"--show-toplevel\"], text=True).strip()\n", " ).resolve()\n", " except Exception:\n", " return Path.cwd().resolve()\n", "\n", "\n", "COSMOS_ROOT = find_repo_root()\n", "COSMOS_REASONER_ASSETS = COSMOS_ROOT / \"cookbooks\" / \"cosmos3\" / \"reasoner\" / \"assets\"\n", "\n", "assert COSMOS_REASONER_ASSETS.exists(), COSMOS_REASONER_ASSETS\n", "\n", "\n", "def asset_path(name: str) -> Path:\n", " path = COSMOS_REASONER_ASSETS / name\n", " if not path.exists():\n", " raise FileNotFoundError(path)\n", " return path\n", "\n", "\n", "import base64\n", "import mimetypes\n", "\n", "\n", "def asset_data_uri(name: str) -> str:\n", " \"\"\"Read a local asset and return a base64 ``data:`` URI.\n", "\n", " The NIM container does not see the host filesystem, so local images and\n", " videos are inlined as base64 data URIs rather than passed as ``file://``\n", " paths. The examples use local assets; public URLs work too.\n", " \"\"\"\n", " path = asset_path(name)\n", " mime = mimetypes.guess_type(path.name)[0] or \"application/octet-stream\"\n", " encoded = base64.b64encode(path.read_bytes()).decode(\"ascii\")\n", " return f\"data:{mime};base64,{encoded}\"\n", "\n", "\n", "print(\"cosmos root:\", COSMOS_ROOT)\n", "print(\"Reasoner assets:\", COSMOS_REASONER_ASSETS)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d999ab52", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", ": \"${NGC_API_KEY:?Set NGC_API_KEY (from build.nvidia.com / NGC) before launching}\"\n", "\n", "export CONTAINER_NAME=\"${CONTAINER_NAME:-nvidia-cosmos3-reasoner}\"\n", "export IMG_NAME=\"${IMG_NAME:-nvcr.io/nim/nvidia/cosmos3-reasoner:1.7.0}\"\n", "# Set NIM_MODEL_SIZE=super for the larger Cosmos3-Super-Reasoner.\n", "export NIM_MODEL_SIZE=\"${NIM_MODEL_SIZE:-nano}\"\n", "export LOCAL_NIM_CACHE=\"${LOCAL_NIM_CACHE:-$HOME/.cache/nim}\"\n", "export NIM_PORT=\"${NIM_PORT:-8000}\"\n", "mkdir -p \"$LOCAL_NIM_CACHE\"\n", "\n", "# The container name must be free. If a previous run is still up, stop it first:\n", "# docker stop nvidia-cosmos3-reasoner\n", "# Detached (-d) so the notebook keeps control; logs are read by the next cell.\n", "docker run -d --rm --name=\"$CONTAINER_NAME\" \\\n", " --runtime=nvidia \\\n", " --gpus all \\\n", " --shm-size=32GB \\\n", " -e NGC_API_KEY=\"$NGC_API_KEY\" \\\n", " -e NIM_MODEL_SIZE=\"$NIM_MODEL_SIZE\" \\\n", " -v \"$LOCAL_NIM_CACHE:/opt/nim/.cache\" \\\n", " -u \"$(id -u)\" \\\n", " -p \"${NIM_PORT}:8000\" \\\n", " \"$IMG_NAME\"\n", "\n", "echo \"NIM container '$CONTAINER_NAME' (size=$NIM_MODEL_SIZE) launching on port $NIM_PORT\"\n" ] }, { "cell_type": "markdown", "id": "265a72d4", "metadata": {}, "source": [ "## 2. Wait for the server\n", "\n", "The next cell polls the NIM readiness endpoint and streams recent container logs. The first run downloads the model, so allow several minutes." ] }, { "cell_type": "code", "execution_count": null, "id": "a5f45c3d", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "set -euo pipefail\n", "\n", "PORT=\"${NIM_PORT:-8000}\"\n", "CONTAINER_NAME=\"${CONTAINER_NAME:-nvidia-cosmos3-reasoner}\"\n", "\n", "echo \"Waiting for NIM server on port ${PORT} (first run downloads the model)...\"\n", "\n", "for i in $(seq 1 3600); do\n", " if curl -fsS \"http://127.0.0.1:${PORT}/v1/health/ready\" >/dev/null 2>&1; then\n", " echo \"NIM server is ready.\"\n", " exit 0\n", " fi\n", " if ! docker ps --format '{{.Names}}' | grep -q \"^${CONTAINER_NAME}$\"; then\n", " echo \"Container '${CONTAINER_NAME}' is no longer running. Recent logs:\"\n", " docker logs --tail 120 \"$CONTAINER_NAME\" 2>&1 || true\n", " exit 1\n", " fi\n", " sleep 2\n", "done\n", "\n", "echo \"Timed out waiting for NIM server. Recent logs:\"\n", "docker logs --tail 120 \"$CONTAINER_NAME\" 2>&1 || true\n", "exit 1\n" ] }, { "cell_type": "markdown", "id": "8b6c2d78", "metadata": {}, "source": [ "## 3. Query the model\n", "\n", "Local images and videos are sent as base64 data URIs (the container does not see the host filesystem). Each request resolves the served model dynamically with `client.models.list()`, so the examples work unchanged for both the `nano` and `super` sizes." ] }, { "cell_type": "markdown", "id": "410f97ba", "metadata": {}, "source": [ "### Image Caption" ] }, { "cell_type": "code", "execution_count": null, "id": "3022311f", "metadata": {}, "outputs": [], "source": [ "import openai\n", "from IPython.display import Image, display\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "MODEL = client.models.list().data[0].id\n", "\n", "image_path = asset_path(\"robot_153.jpg\")\n", "image_url = asset_data_uri(image_path.name)\n", "\n", "response = client.chat.completions.create(\n", " model=MODEL,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n", " {\"type\": \"text\", \"text\": \"Caption the image in detail.\"},\n", " ],\n", " }\n", " ],\n", " max_tokens=4096,\n", " seed=0,\n", ")\n", "display(Image(filename=str(image_path), width=512))\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "77326560", "metadata": {}, "source": [ "### Video Caption" ] }, { "cell_type": "code", "execution_count": null, "id": "7f1ea8e7", "metadata": {}, "outputs": [], "source": [ "import openai\n", "from pathlib import Path\n", "from IPython.display import Video, display\n", "\n", "prompt = \"Describe the video in detail.\"\n", "\n", "# Plain filesystem path (used for display)\n", "video_path = str(asset_path(\"video_caption.mp4\"))\n", "# base64 data URI (used for the model request)\n", "video_url = asset_data_uri(Path(video_path).name)\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "\n", "response = client.chat.completions.create(\n", " model=client.models.list().data[0].id,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " },\n", " ],\n", " max_tokens=4096,\n", " extra_body={\"media_io_kwargs\": {\"video\": {\"fps\": 4.0}}},\n", ")\n", "\n", "# Display the input video (plain path, not the base64 data URI sent to the model)\n", "display(Video(video_path, embed=True, width=640))\n", "\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "87d32bd6", "metadata": {}, "source": [ "### Temporal Localization" ] }, { "cell_type": "code", "execution_count": null, "id": "010a21ca", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "from IPython.display import Video, display\n", "\n", "# Plain filesystem path (used for display)\n", "video_path = str(asset_path(\"temporal_localization_1.mp4\"))\n", "\n", "display(Video(video_path, embed=True, width=640))\n", "\n", "import openai\n", "\n", "prompt = (\n", " \"\"\"List all action segments in the video. For each detected event, you must determine:\n", "\n", "Provide the result in json format with 'seconds' for time depiction for each event. Use keywords 'start', 'end' and 'caption' in the json output. Please list multiple events if applicable.\n", "\n", "```json\n", "[\n", "{\n", " \"start\": t_start,\n", " \"end\": t_end,\n", " \"caption\": EVENT1\n", "},\n", "{\n", " \"start\": t_start,\n", " \"end\": t_end,\n", " \"caption\": EVENT2\n", "},\n", "...\n", "]\n", "``` \"\"\"\n", ")\n", "video_url = asset_data_uri(\"temporal_localization_1.mp4\")\n", "\n", "client = openai.OpenAI(\n", " api_key=\"not-used\",\n", " base_url=\"http://localhost:8000/v1\",\n", ")\n", "\n", "response = client.chat.completions.create(\n", " model=client.models.list().data[0].id,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " },\n", " ],\n", " max_tokens=4096,\n", " extra_body={\"media_io_kwargs\": {\"video\": {\"fps\": 4.0}}},\n", ")\n", "\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "code", "execution_count": null, "id": "0a76e9f5", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "from IPython.display import Video, display\n", "\n", "# Plain filesystem path (used for display)\n", "video_path = str(asset_path(\"temporal_localization_2.mp4\"))\n", "\n", "display(Video(video_path, embed=True, width=640))" ] }, { "cell_type": "markdown", "id": "809e9c57", "metadata": {}, "source": [ "#### Event Timeline" ] }, { "cell_type": "code", "execution_count": null, "id": "61b206c0", "metadata": {}, "outputs": [], "source": [ "import openai\n", "\n", "prompt = (\n", " \"Describe the notable events in the provided video. Provide the result in json format with 'mm:ss.ff' format for time depiction for each event.\"\n", " \"Use keywords 'start', 'end' and 'caption' in the json output.\"\n", ")\n", "video_url = asset_data_uri(\"temporal_localization_2.mp4\")\n", "\n", "client = openai.OpenAI(\n", " api_key=\"not-used\",\n", " base_url=\"http://localhost:8000/v1\",\n", ")\n", "\n", "response = client.chat.completions.create(\n", " model=client.models.list().data[0].id,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " },\n", " ],\n", " max_tokens=4096,\n", " extra_body={\"media_io_kwargs\": {\"video\": {\"fps\": 4.0}}},\n", ")\n", "\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "bbb6d43e", "metadata": {}, "source": [ "#### Timestamp Query" ] }, { "cell_type": "code", "execution_count": null, "id": "40bfb46e", "metadata": {}, "outputs": [], "source": [ "import openai\n", "\n", "prompt = \"\"\"When is \"A man in a white sweater walks out of a room carrying a box, closes the door behind him, walks on the floor, and turns left at the end near the wall.\" depicted in the video? Please provide the result in json format with 'mm:ss.ff' format for time depiction for the event. Use keywords 'start', 'end' in the json output.\"\"\"\n", "\n", "response = client.chat.completions.create(\n", " model=client.models.list().data[0].id,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " },\n", " ],\n", " max_tokens=4096,\n", " extra_body={\"media_io_kwargs\": {\"video\": {\"fps\": 4.0}}},\n", ")\n", "\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "f1fbc5ed", "metadata": {}, "source": [ "#### Interval Question" ] }, { "cell_type": "code", "execution_count": null, "id": "b8d26975", "metadata": {}, "outputs": [], "source": [ "import openai\n", "\n", "prompt = \"What happened between 00:05.64 and 00:17.49?\"\n", "response = client.chat.completions.create(\n", " model=client.models.list().data[0].id,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " },\n", " ],\n", " max_tokens=4096,\n", " extra_body={\"media_io_kwargs\": {\"video\": {\"fps\": 4.0}}},\n", ")\n", "\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "ca919a80", "metadata": {}, "source": [ "### Embodied Reasoning\n", "#### Robotics Next Action" ] }, { "cell_type": "code", "execution_count": null, "id": "05a7bbe6", "metadata": {}, "outputs": [], "source": [ "import openai\n", "from pathlib import Path\n", "from IPython.display import Video, display\n", "\n", "prompt = \"What can be the next immediate action? Answer the question using the following format: Your reasoning. Write your final answer immediately after the tag.\"\n", "\n", "# Plain filesystem path (used for display)\n", "video_path = str(asset_path(\"robotics_next_action.mp4\"))\n", "# base64 data URI (used for the model request)\n", "video_url = asset_data_uri(Path(video_path).name)\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "\n", "response = client.chat.completions.create(\n", " model=client.models.list().data[0].id,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " },\n", " ],\n", " max_tokens=4096,\n", " extra_body={\"media_io_kwargs\": {\"video\": {\"fps\": 4.0}}},\n", ")\n", "\n", "# Display the input video (plain path, not the base64 data URI sent to the model)\n", "display(Video(video_path, embed=True, width=640))\n", "\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "b79e6056", "metadata": {}, "source": [ "#### Drive Scene Next Action" ] }, { "cell_type": "code", "execution_count": null, "id": "c8393d28", "metadata": {}, "outputs": [], "source": [ "import openai\n", "from pathlib import Path\n", "from IPython.display import Video, display\n", "\n", "prompt = \"You are an autonomous vehicle planning system. The video depicts the observation from the vehicle's camera. You need to observe the critical objects in the environment and reason your next action and the driving trajectory ahead.\"\n", "\n", "# Plain filesystem path (used for display)\n", "video_path = str(asset_path(\"drive_scene_next_action.mp4\"))\n", "# base64 data URI (used for the model request)\n", "video_url = asset_data_uri(Path(video_path).name)\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "\n", "response = client.chat.completions.create(\n", " model=client.models.list().data[0].id,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " },\n", " ],\n", " max_tokens=4096,\n", " extra_body={\"media_io_kwargs\": {\"video\": {\"fps\": 4.0}}},\n", ")\n", "\n", "# Display the input video (plain path, not the base64 data URI sent to the model)\n", "display(Video(video_path, embed=True, width=640))\n", "\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "1f97ac7c", "metadata": {}, "source": [ "#### Robot Planning" ] }, { "cell_type": "code", "execution_count": null, "id": "7a2377d3", "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "import openai\n", "from pathlib import Path\n", "from PIL import Image as PILImage, ImageDraw\n", "from IPython.display import display\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "MODEL = client.models.list().data[0].id\n", "\n", "image_path = str(asset_path(\"robot_planning.png\"))\n", "image_url = asset_data_uri(Path(image_path).name) # base64 data URI for the model\n", "\n", "# Display the input image (scaled down to fit the cell)\n", "preview = PILImage.open(image_path).convert(\"RGB\")\n", "preview.thumbnail((768, 768))\n", "display(preview)\n", "\n", "response = client.chat.completions.create(\n", " model=MODEL,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n", " {\"type\": \"text\", \"text\": 'The task is to put flower into the red bottle. Generate a plan consisting of subtasks for accomplish the task.'},\n", " ],\n", " }\n", " ],\n", " max_tokens=4096,\n", " seed=0,\n", ")\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "4c163252", "metadata": {}, "source": [ "#### Assisted Task Next Action" ] }, { "cell_type": "code", "execution_count": null, "id": "3c6147d2", "metadata": {}, "outputs": [], "source": [ "import openai\n", "from pathlib import Path\n", "from IPython.display import Video, display\n", "\n", "prompt = \"\"\"This is the overall task that the agent is trying to complete: \"The student exchanges the black ink cartridge of the printer.\"\n", " In the video, the agent is trying to follow the instruction (a single step out of many to complete the overall task): \"place old ink_cartridge.\"\n", " What should be the next action of the agent?\n", " Answer the question using the following format:\n", " \n", " Your reasoning.\n", " \n", " Write your final answer immediately after the tag.\"\"\"\n", "\n", "# Plain filesystem path (used for display)\n", "video_path = str(asset_path(\"assisted_task_next_action.mp4\"))\n", "# base64 data URI (used for the model request)\n", "video_url = asset_data_uri(Path(video_path).name)\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "\n", "response = client.chat.completions.create(\n", " model=client.models.list().data[0].id,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " },\n", " ],\n", " max_tokens=4096,\n", " extra_body={\"media_io_kwargs\": {\"video\": {\"fps\": 4.0}}},\n", ")\n", "\n", "# Display the input video (plain path, not the base64 data URI sent to the model)\n", "display(Video(video_path, embed=True, width=640))\n", "\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "1b956864", "metadata": {}, "source": [ "### Common Sense Reasoning" ] }, { "cell_type": "code", "execution_count": null, "id": "50b25fa6", "metadata": {}, "outputs": [], "source": [ "import openai\n", "from pathlib import Path\n", "from IPython.display import Video, display\n", "\n", "prompt = \"\"\"Can the countertop support the weight of the juicers?\n", " Answer the question using the following format:\n", "\n", " \n", " Your reasoning.\n", " \n", "\n", " Write your final answer immediately after the tag.\"\"\"\n", "\n", "# Plain filesystem path (used for display)\n", "video_path = str(asset_path(\"common_sense_reasoning.mp4\"))\n", "# base64 data URI (used for the model request)\n", "video_url = asset_data_uri(Path(video_path).name)\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "\n", "response = client.chat.completions.create(\n", " model=client.models.list().data[0].id,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " },\n", " ],\n", " max_tokens=4096,\n", " extra_body={\"media_io_kwargs\": {\"video\": {\"fps\": 4.0}}},\n", ")\n", "\n", "# Display the input video (plain path, not the base64 data URI sent to the model)\n", "display(Video(video_path, embed=True, width=640))\n", "\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "60835463", "metadata": {}, "source": [ "### 2D Grounding" ] }, { "cell_type": "code", "execution_count": null, "id": "53a97880", "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "import openai\n", "from pathlib import Path\n", "from PIL import Image as PILImage, ImageDraw\n", "from IPython.display import display\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "MODEL = client.models.list().data[0].id\n", "\n", "image_path = str(asset_path(\"grounding_2d.png\"))\n", "image_url = asset_data_uri(Path(image_path).name) # base64 data URI for the model\n", "\n", "response = client.chat.completions.create(\n", " model=MODEL,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n", " {\"type\": \"text\", \"text\": \"Locate the accurate bounding box of the load as a whole. Return a json.\"},\n", " ],\n", " }\n", " ],\n", " max_tokens=4096,\n", " seed=0,\n", ")\n", "out = response.choices[0].message.content\n", "print(out)\n", "\n", "\n", "def parse_boxes(text):\n", " \"\"\"Pull a JSON array/object of boxes out of the model text (handles ``` fences).\"\"\"\n", " text = text.strip()\n", " text = re.sub(r\"^```(?:json)?|```$\", \"\", text, flags=re.MULTILINE).strip()\n", " m = re.search(r\"\\[.*\\]|\\{.*\\}\", text, re.DOTALL)\n", " data = json.loads(m.group(0) if m else text)\n", " return data if isinstance(data, list) else [data]\n", "\n", "\n", "# Draw boxes; coords are normalized to 0-1000\n", "img = PILImage.open(image_path).convert(\"RGB\")\n", "W, H = img.size\n", "draw = ImageDraw.Draw(img)\n", "\n", "for obj in parse_boxes(out):\n", " box = obj.get(\"bbox_2d\") or obj.get(\"bbox\") or obj.get(\"box\")\n", " if not box:\n", " continue\n", " x1, y1, x2, y2 = box\n", " x1, x2 = x1 / 1000 * W, x2 / 1000 * W\n", " y1, y2 = y1 / 1000 * H, y2 / 1000 * H\n", " draw.rectangle([x1, y1, x2, y2], outline=\"red\", width=3)\n", " label = obj.get(\"label\") or obj.get(\"name\")\n", " if label:\n", " draw.text((x1, max(0, y1 - 12)), str(label), fill=\"red\")\n", "\n", "# Display scaled down so a large image fits the cell\n", "preview = img.copy()\n", "preview.thumbnail((768, 768))\n", "display(preview)" ] }, { "cell_type": "markdown", "id": "16ed27e9", "metadata": {}, "source": [ "### Describe Anything" ] }, { "cell_type": "code", "execution_count": null, "id": "8a73b225", "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "import openai\n", "from pathlib import Path\n", "from PIL import Image as PILImage, ImageDraw\n", "from IPython.display import display\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "MODEL = client.models.list().data[0].id\n", "\n", "image_path = str(asset_path(\"describe_anything.png\"))\n", "image_url = asset_data_uri(Path(image_path).name) # base64 data URI for the model\n", "\n", "# Display the input image (scaled down to fit the cell)\n", "preview = PILImage.open(image_path).convert(\"RGB\")\n", "preview.thumbnail((768, 768))\n", "display(preview)\n", "\n", "response = client.chat.completions.create(\n", " model=MODEL,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n", " {\"type\": \"text\", \"text\": 'Please caption the notable attributes in the provided image. List and describe all marked subjects in the image with their categories and detailed captions using a json with keyword \"subject_id\", \"category\" and \"caption\".'},\n", " ],\n", " }\n", " ],\n", " max_tokens=4096,\n", " seed=0,\n", ")\n", "print(response.choices[0].message.content)\n" ] }, { "cell_type": "markdown", "id": "303ae5af", "metadata": {}, "source": [ "### Action CoT" ] }, { "cell_type": "markdown", "id": "bd9e82ea", "metadata": {}, "source": [ "#### Trajectory Coordinates" ] }, { "cell_type": "code", "execution_count": null, "id": "7a020759", "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "import openai\n", "from pathlib import Path\n", "from PIL import Image as PILImage, ImageDraw\n", "from IPython.display import display\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "MODEL = client.models.list().data[0].id\n", "\n", "image_path = str(asset_path(\"action_cot_trajectory.png\"))\n", "image_url = asset_data_uri(Path(image_path).name)\n", "\n", "prompt = \"\"\"You are given the task \"Move the pink bowl to the right\". Specify the 2D trajectory your end effector should follow in pixel space. Return the trajectory coordinates in JSON format like this: {\"point_2d\": [x, y], \"label\": \"gripper trajectory\"}.\n", "Answer the question using the following format:\n", "\n", "\n", "Your reasoning.\n", "\n", "\n", "Write your final answer immediately after the tag.\n", "\"\"\"\n", "\n", "response = client.chat.completions.create(\n", " model=MODEL,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " }\n", " ],\n", " max_tokens=4096,\n", " temperature=0.6,\n", " top_p=0.95,\n", " presence_penalty=0.0,\n", " extra_body={\"top_k\": 20, \"repetition_penalty\": 1.0},\n", ")\n", "out = response.choices[0].message.content\n", "print(out)\n", "\n", "\n", "def parse_points(text):\n", " \"\"\"Grab the JSON list of {point_2d, label} after the tag.\"\"\"\n", " if \"\" in text:\n", " text = text.split(\"\")[-1]\n", " text = re.sub(r\"```(?:json)?\", \"\", text).strip().strip(\"`\").strip()\n", " m = re.search(r\"\\[.*\\]\", text, re.DOTALL)\n", " data = json.loads(m.group(0) if m else text)\n", " return data if isinstance(data, list) else [data]\n", "\n", "\n", "# Visualize the trajectory (points are in pixel space)\n", "img = PILImage.open(image_path).convert(\"RGB\")\n", "draw = ImageDraw.Draw(img)\n", "W, H = img.size\n", "\n", "# coords are normalized to 0-1000 (per-axis) -> scale to pixels\n", "pts = [(o[\"point_2d\"][0] / 1000 * W, o[\"point_2d\"][1] / 1000 * H)\n", " for o in parse_points(out) if isinstance(o, dict) and \"point_2d\" in o]\n", "if len(pts) > 1:\n", " draw.line(pts, fill=\"lime\", width=5)\n", "for i, (x, y) in enumerate(pts):\n", " r = 12\n", " draw.ellipse([x - r, y - r, x + r, y + r], fill=\"red\", outline=\"white\", width=3)\n", " draw.text((x + 14, y - 14), str(i), fill=\"yellow\")\n", "preview = img.copy()\n", "preview.thumbnail((900, 900))\n", "display(preview)" ] }, { "cell_type": "code", "execution_count": null, "id": "3351c7e4", "metadata": {}, "outputs": [], "source": [ "import json\n", "import re\n", "import openai\n", "from pathlib import Path\n", "from PIL import Image as PILImage, ImageDraw\n", "from IPython.display import display\n", "\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "MODEL = client.models.list().data[0].id\n", "\n", "image_path = str(asset_path(\"robot_planning.png\"))\n", "image_url = asset_data_uri(Path(image_path).name)\n", "\n", "prompt = \"\"\"You are given the task \"Put flower into the red bottle\". Specify the 2D trajectory your end effector should follow in pixel space. Return the trajectory coordinates in JSON format like this: {\"point_2d\": [x, y], \"label\": \"gripper trajectory\"}. \n", "Answer the question using the following format:\n", "\n", " Your reasoning. \n", "Write your final answer immediately after the tag.\n", "\"\"\"\n", "\n", "response = client.chat.completions.create(\n", " model=MODEL,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " }\n", " ],\n", " max_tokens=4096,\n", " temperature=0.6,\n", " top_p=0.95,\n", " presence_penalty=0.0,\n", " extra_body={\"top_k\": 20, \"repetition_penalty\": 1.0},\n", ")\n", "out = response.choices[0].message.content\n", "print(out)\n", "\n", "\n", "def parse_points(text):\n", " \"\"\"Grab the JSON list of {point_2d, label} after the tag.\"\"\"\n", " if \"\" in text:\n", " text = text.split(\"\")[-1]\n", " text = re.sub(r\"```(?:json)?\", \"\", text).strip().strip(\"`\").strip()\n", " m = re.search(r\"\\[.*\\]\", text, re.DOTALL)\n", " data = json.loads(m.group(0) if m else text)\n", " return data if isinstance(data, list) else [data]\n", "\n", "\n", "# Visualize the trajectory (points are in pixel space)\n", "img = PILImage.open(image_path).convert(\"RGB\")\n", "draw = ImageDraw.Draw(img)\n", "W, H = img.size\n", "\n", "# coords are normalized to 0-1000 (per-axis) -> scale to pixels\n", "pts = [(o[\"point_2d\"][0] / 1000 * W, o[\"point_2d\"][1] / 1000 * H)\n", " for o in parse_points(out) if isinstance(o, dict) and \"point_2d\" in o]\n", "if len(pts) > 1:\n", " draw.line(pts, fill=\"lime\", width=5)\n", "for i, (x, y) in enumerate(pts):\n", " r = 12\n", " draw.ellipse([x - r, y - r, x + r, y + r], fill=\"red\", outline=\"white\", width=3)\n", " draw.text((x + 14, y - 14), str(i), fill=\"yellow\")\n", "preview = img.copy()\n", "preview.thumbnail((900, 900))\n", "display(preview)" ] }, { "cell_type": "markdown", "id": "6cb3d76e", "metadata": {}, "source": [ "#### Driving Scene" ] }, { "cell_type": "code", "execution_count": null, "id": "2d92a055", "metadata": {}, "outputs": [], "source": [ "import openai\n", "from pathlib import Path\n", "from IPython.display import Video, display\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "MODEL = client.models.list().data[0].id\n", "video_path = str(asset_path(\"action_cot_driving_scene.mp4\"))\n", "video_url = asset_data_uri(Path(video_path).name)\n", "prompt = \"\"\"The video depicts the observation from the vehicle's camera. You need to think step by step and identify the objects in the scene that are critical for safe navigation.\n", "Answer the question using the following format:\n", "\n", "Your reasoning.\n", "\n", "Write your final answer immediately after the tag.\"\"\"\n", "# Show the input video\n", "display(Video(video_path, embed=True, width=640))\n", "response = client.chat.completions.create(\n", " model=MODEL,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " }\n", " ],\n", " max_tokens=4096,\n", " temperature=0.6,\n", " top_p=0.95,\n", " presence_penalty=0.0,\n", " extra_body={\n", " \"top_k\": 20,\n", " \"repetition_penalty\": 1.0,\n", " \"media_io_kwargs\": {\"video\": {\"fps\": 4.0}},\n", " },\n", ")\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "f9c09ca4", "metadata": {}, "source": [ "### Physical Plausibility Analysis" ] }, { "cell_type": "code", "execution_count": null, "id": "fd80d461", "metadata": {}, "outputs": [], "source": [ "import openai\n", "from pathlib import Path\n", "from IPython.display import Video, display\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "MODEL = client.models.list().data[0].id\n", "video_path = str(asset_path(\"physical_plausibility.mp4\"))\n", "video_url = asset_data_uri(Path(video_path).name)\n", "prompt = \"\"\"Is this video physically plausible/possible according to your understanding of e.g. object permanence, shape constancy (objects maintain shape over time), continuous trajectories of objects? Assume it is the normal laws of physics.\n", "Your answer should be based on the events in the video and ignore the quality of the simulation engine. The rising wall is part of the experiment setup and should not be judged for plausibility.\n", "(A) Possible\n", "(B) Impossible\"\"\"\n", "# Show the input video\n", "display(Video(video_path, embed=True, width=640))\n", "response = client.chat.completions.create(\n", " model=MODEL,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " }\n", " ],\n", " max_tokens=4096,\n", " extra_body={\n", " \"media_io_kwargs\": {\"video\": {\"fps\": 4.0}},\n", " },\n", ")\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "4fa05fab", "metadata": {}, "source": [ "### Situation Understanding" ] }, { "cell_type": "code", "execution_count": null, "id": "c663f0c4", "metadata": {}, "outputs": [], "source": [ "import openai\n", "from pathlib import Path\n", "from IPython.display import Video, display\n", "client = openai.OpenAI(api_key=\"not-used\", base_url=\"http://localhost:8000/v1\")\n", "MODEL = client.models.list().data[0].id\n", "video_path = str(asset_path(\"situation_understanding.mp4\"))\n", "video_url = asset_data_uri(Path(video_path).name)\n", "prompt = \"What is the person doing with the skillet? What will the person likely do next in this situation?\"\n", "# Show the input video\n", "display(Video(video_path, embed=True, width=640))\n", "response = client.chat.completions.create(\n", " model=MODEL,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n", " {\"type\": \"text\", \"text\": prompt},\n", " ],\n", " }\n", " ],\n", " max_tokens=4096,\n", " extra_body={\n", " \"media_io_kwargs\": {\"video\": {\"fps\": 4.0}},\n", " },\n", ")\n", "print(response.choices[0].message.content)" ] }, { "cell_type": "code", "execution_count": null, "id": "27b44dde", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }