{
"cells": [
{
"cell_type": "markdown",
"id": "license-header",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "b4e72c4a",
"metadata": {},
"source": [
"### Cosmos3-Super inference with vLLM\n",
"\n",
"This notebook:\n",
"1. Sets up an isolated environment (`vllm` + the `vllm-cosmos3` plugin).\n",
"2. Launches an OpenAI-compatible vLLM server in the background.\n",
"3. Sends image and video requests with the `openai` client.\n",
"\n",
"`Cosmos3-Super` is larger than `Cosmos3-Nano` and is served with tensor parallelism across 4 GPUs."
]
},
{
"cell_type": "markdown",
"id": "3f8b961e",
"metadata": {},
"source": [
"## 1. Environment setup\n",
"\n",
"Create a local `.venv` and install vLLM plus the `vllm-cosmos3` plugin, which registers the `Cosmos3ReasonerForConditionalGeneration` architecture. This can take a few minutes.\n",
"\n",
"The install below pins a CUDA build of `torch`/`vllm`, and it must match your NVIDIA driver:\n",
"\n",
"| Driver CUDA | Install |\n",
"| --- | --- |\n",
"| 13.x | `--torch-backend=cu130 \"vllm==0.21.0\"` (default below) |\n",
"| 12.x (most machines today) | `--torch-backend=cu128 \"vllm==0.19.1\"` |\n",
"\n",
"`cu130` wheels need a CUDA 13 driver; on a CUDA 12.x driver use the `cu128` pair instead, or `torch.cuda.is_available()` returns `False` and the server falls back/fails. vLLM does not publish a wheel for every CUDA minor version, so `--torch-backend=auto` is not reliable here.\n",
"\n",
"The model is gated on Hugging Face, so log in once before launching (a token with access is required):\n",
"\n",
"```bash\n",
"hf auth login\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abc67007",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import os\n",
"import subprocess\n",
"\n",
"\n",
"def find_repo_root() -> Path:\n",
" try:\n",
" return Path(\n",
" subprocess.check_output([\"git\", \"rev-parse\", \"--show-toplevel\"], text=True).strip()\n",
" ).resolve()\n",
" except Exception:\n",
" return Path.cwd().resolve()\n",
"\n",
"\n",
"COSMOS_ROOT = find_repo_root()\n",
"COSMOS_REASONER_ASSETS = COSMOS_ROOT / \"cookbooks\" / \"cosmos3\" / \"reasoner\" / \"assets\"\n",
"COSMOS3_MEDIA_ROOT = COSMOS_ROOT / \"cookbooks\" / \"cosmos3\"\n",
"\n",
"assert COSMOS_REASONER_ASSETS.exists(), COSMOS_REASONER_ASSETS\n",
"\n",
"os.environ[\"COSMOS_ROOT\"] = str(COSMOS_ROOT)\n",
"os.environ[\"COSMOS_REASONER_ASSETS\"] = str(COSMOS_REASONER_ASSETS)\n",
"os.environ[\"COSMOS3_MEDIA_ROOT\"] = str(COSMOS3_MEDIA_ROOT)\n",
"\n",
"\n",
"def asset_path(name: str) -> Path:\n",
" path = COSMOS_REASONER_ASSETS / name\n",
" if not path.exists():\n",
" raise FileNotFoundError(path)\n",
" return path\n",
"\n",
"\n",
"def asset_url(name: str) -> str:\n",
" return asset_path(name).resolve().as_uri()\n",
"\n",
"\n",
"print(\"cosmos root:\", COSMOS_ROOT)\n",
"print(\"Reasoner assets:\", COSMOS_REASONER_ASSETS)\n",
"print(\"Allowed media root:\", COSMOS3_MEDIA_ROOT)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55d4497b",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%%bash\n",
"set -euo pipefail\n",
"\n",
": \"${COSMOS_ROOT:=$(git rev-parse --show-toplevel 2>/dev/null || pwd)}\"\n",
": \"${COSMOS3_REPO:=$COSMOS_ROOT/packages/cosmos3}\"\n",
": \"${COSMOS3_GIT_URL:=https://github.com/NVIDIA/cosmos-framework.git}\"\n",
"\n",
"mkdir -p \"$(dirname \"$COSMOS3_REPO\")\"\n",
"\n",
"if [ -d \"$COSMOS3_REPO/.git\" ]; then\n",
" echo \"Using existing framework checkout: $COSMOS3_REPO\"\n",
"else\n",
" echo \"Cloning $COSMOS3_GIT_URL into $COSMOS3_REPO\"\n",
" git clone \"$COSMOS3_GIT_URL\" \"$COSMOS3_REPO\"\n",
"fi\n",
"\n",
"if [ -x .venv/bin/python ]; then\n",
" echo \"Using existing venv: $PWD/.venv\"\n",
"else\n",
" uv venv --python 3.13 --seed --managed-python\n",
"fi\n",
"\n",
"uv pip install --python .venv/bin/python --torch-backend=cu130 \\\n",
" \"vllm==0.21.0\" \\\n",
" \"$COSMOS3_REPO/packages/transformers-cosmos3\" \\\n",
" \"$COSMOS3_REPO/packages/vllm-cosmos3\"\n"
]
},
{
"cell_type": "markdown",
"id": "7fc6b07a",
"metadata": {},
"source": [
"## 2. Launch the vLLM server\n",
"\n",
"Start the OpenAI-compatible server in the background (output goes to `vllm_server.log`). The first start compiles CUDA graphs and can take several minutes. The next cell waits for server readiness automatically while streaming new log lines.\n",
"\n",
"`Cosmos3-Super` is tested with 4 GPUs: it is launched with `--tensor-parallel-size 4` and `CUDA_VISIBLE_DEVICES=0,1,2,3`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d53a0ead",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"# `setsid` is only needed in Jupyter: it detaches the server into its own session so it\n",
"# survives kernel interrupts/restarts. In a plain terminal you can drop `setsid` (and `& disown`)\n",
"# and just run `vllm serve ...`.\n",
"# Output goes to vllm_server.log (watched by the next cell).\n",
"# Cosmos3-Super is served across 4 GPUs (tensor-parallel-size 4).\n",
": \"${COSMOS3_MEDIA_ROOT:=$(dirname \"$(pwd)\")}\"\n",
"export TMPDIR=\"/tmp/${USER:-vllm}-vllm\"\n",
"export VLLM_PORT=\"${VLLM_PORT:-8001}\"\n",
"export VLLM_LOG_FILE=\"${VLLM_LOG_FILE:-vllm_server.log}\"\n",
"mkdir -p \"$TMPDIR\"\n",
"CUDA_VISIBLE_DEVICES=0,1,2,3 \\\n",
"setsid .venv/bin/vllm serve nvidia/Cosmos3-Super \\\n",
" --hf-overrides '{\"architectures\": [\"Cosmos3ReasonerForConditionalGeneration\"]}' \\\n",
" --tensor-parallel-size 4 \\\n",
" --mm-encoder-tp-mode data \\\n",
" --async-scheduling \\\n",
" --allowed-local-media-path \"$COSMOS3_MEDIA_ROOT\" \\\n",
" --media-io-kwargs '{\"video\": {\"num_frames\": -1}}' \\\n",
" --port \"$VLLM_PORT\" > \"$VLLM_LOG_FILE\" 2>&1 &\n",
"disown\n",
"echo \"Server launching in background; watch $VLLM_LOG_FILE\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a6a4314",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%%bash\n",
"set -euo pipefail\n",
"\n",
"PORT=\"${VLLM_PORT:-8001}\"\n",
"LOG_FILE=\"${VLLM_LOG_FILE:-vllm_server.log}\"\n",
"\n",
"echo \"Waiting for vLLM server on port ${PORT}...\"\n",
"touch \"$LOG_FILE\"\n",
"\n",
"LAST_LINE=0\n",
"\n",
"for i in $(seq 1 1800); do\n",
" TOTAL_LINES=$(wc -l < \"$LOG_FILE\" || echo 0)\n",
"\n",
" if [ \"$TOTAL_LINES\" -gt \"$LAST_LINE\" ]; then\n",
" sed -n \"$((LAST_LINE + 1)),$TOTAL_LINES p\" \"$LOG_FILE\"\n",
" LAST_LINE=\"$TOTAL_LINES\"\n",
" fi\n",
"\n",
" if curl -fsS \"http://127.0.0.1:${PORT}/health\" >/dev/null 2>&1; then\n",
" echo \"vLLM server is ready.\"\n",
" exit 0\n",
" fi\n",
"\n",
" sleep 1\n",
"done\n",
"\n",
"echo \"Timed out waiting for vLLM server.\"\n",
"tail -n 120 \"$LOG_FILE\"\n",
"exit 1\n"
]
},
{
"cell_type": "markdown",
"id": "ce7c344c",
"metadata": {},
"source": [
"## 3. Query the model"
]
},
{
"cell_type": "markdown",
"id": "6b8dae47",
"metadata": {},
"source": [
"### Image Caption"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c9851ca",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"from IPython.display import Image, display\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"MODEL = client.models.list().data[0].id\n",
"\n",
"image_path = asset_path(\"robot_153.jpg\")\n",
"image_url = image_path.resolve().as_uri()\n",
"\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
" {\"type\": \"text\", \"text\": \"Caption the image in detail.\"},\n",
" ],\n",
" }\n",
" ],\n",
" max_tokens=4096,\n",
" seed=0,\n",
")\n",
"display(Image(filename=str(image_path), width=512))\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "1611a4c5",
"metadata": {},
"source": [
"### Video Caption"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ecc3a0d6",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"from pathlib import Path\n",
"from IPython.display import Video, display\n",
"\n",
"prompt = \"Describe the video in detail.\"\n",
"\n",
"# Plain filesystem path (used for display)\n",
"video_path = str(asset_path(\"video_caption.mp4\"))\n",
"# file:// URL (used for the model request)\n",
"video_url = Path(video_path).resolve().as_uri()\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=client.models.list().data[0].id,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" },\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True}},\n",
")\n",
"\n",
"# Display the input video (plain path, NOT the file:// URL)\n",
"display(Video(video_path, embed=True, width=640))\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "76977e81-5502-432a-93e7-6c036a8d3ea0",
"metadata": {},
"source": [
"### Temporal Localization"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a23cde6-f552-4354-8509-8a914b0d0382",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from pathlib import Path\n",
"from IPython.display import Video, display\n",
"\n",
"# Plain filesystem path (used for display)\n",
"video_path = str(asset_path(\"temporal_localization_1.mp4\"))\n",
"\n",
"display(Video(video_path, embed=True, width=640))\n",
"\n",
"import openai\n",
"\n",
"prompt = (\n",
" \"\"\"List all action segments in the video. For each detected event, you must determine:\n",
"\n",
"Provide the result in json format with 'seconds' for time depiction for each event. Use keywords 'start', 'end' and 'caption' in the json output. Please list multiple events if applicable.\n",
"\n",
"```json\n",
"[\n",
"{\n",
" \"start\": t_start,\n",
" \"end\": t_end,\n",
" \"caption\": EVENT1\n",
"},\n",
"{\n",
" \"start\": t_start,\n",
" \"end\": t_end,\n",
" \"caption\": EVENT2\n",
"},\n",
"...\n",
"]\n",
"``` \"\"\"\n",
")\n",
"video_url = asset_url(\"temporal_localization_1.mp4\")\n",
"\n",
"client = openai.OpenAI(\n",
" api_key=\"EMPTY\",\n",
" base_url=\"http://localhost:8001/v1\",\n",
")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=client.models.list().data[0].id,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" },\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True}},\n",
")\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e394917-2872-440d-a514-8933a4704425",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"from IPython.display import Video, display\n",
"\n",
"# Plain filesystem path (used for display)\n",
"video_path = str(asset_path(\"temporal_localization_2.mp4\"))\n",
"\n",
"display(Video(video_path, embed=True, width=640))"
]
},
{
"cell_type": "markdown",
"id": "30321842-baa1-4b4e-bf75-19e085ffe927",
"metadata": {},
"source": [
"#### Event Timeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db0d2a8b-216d-41c7-a121-e4fc35b65bb0",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import openai\n",
"\n",
"prompt = (\n",
" \"Describe the notable events in the provided video. Provide the result in json format with 'mm:ss.ff' format for time depiction for each event.\"\n",
" \"Use keywords 'start', 'end' and 'caption' in the json output.\"\n",
")\n",
"video_url = asset_url(\"temporal_localization_2.mp4\")\n",
"\n",
"client = openai.OpenAI(\n",
" api_key=\"EMPTY\",\n",
" base_url=\"http://localhost:8001/v1\",\n",
")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=client.models.list().data[0].id,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" },\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True}},\n",
")\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "946e0382-b5ff-4cb7-a131-422681d7e63a",
"metadata": {},
"source": [
"#### Timestamp Query"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d6f2813-d243-4071-8d85-bcd26901a9ca",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"\n",
"prompt = \"\"\"When is \"A man in a white sweater walks out of a room carrying a box, closes the door behind him, walks on the floor, and turns left at the end near the wall.\" depicted in the video? Please provide the result in json format with 'mm:ss.ff' format for time depiction for the event. Use keywords 'start', 'end' in the json output.\"\"\"\n",
"\n",
"response = client.chat.completions.create(\n",
" model=client.models.list().data[0].id,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" },\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True}},\n",
")\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "b8c4859e-4a4f-4ab2-846c-f116248295a5",
"metadata": {},
"source": [
"#### Interval Question"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e10b4d2-6b4e-4185-b6d5-06a439f2e8fe",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"\n",
"prompt = \"What happened between 00:05.64 and 00:17.49?\"\n",
"response = client.chat.completions.create(\n",
" model=client.models.list().data[0].id,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" },\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True}},\n",
")\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "6b561503-fe30-4b15-bc59-7878fe1acc32",
"metadata": {},
"source": [
"### Embodied Reasoning\n",
"#### Robotics Next Action"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dade5f09-ee93-484f-9fbc-baa1a72aef0b",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import openai\n",
"from pathlib import Path\n",
"from IPython.display import Video, display\n",
"\n",
"prompt = \"What can be the next immediate action? Answer the question using the following format: Your reasoning. Write your final answer immediately after the tag.\"\n",
"\n",
"# Plain filesystem path (used for display)\n",
"video_path = str(asset_path(\"robotics_next_action.mp4\"))\n",
"# file:// URL (used for the model request)\n",
"video_url = Path(video_path).resolve().as_uri()\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=client.models.list().data[0].id,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" },\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True}},\n",
")\n",
"\n",
"# Display the input video (plain path, NOT the file:// URL)\n",
"display(Video(video_path, embed=True, width=640))\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "62acdd68-68e1-4dbe-a3d2-f31e0bceb023",
"metadata": {},
"source": [
"#### Drive Scene Next Action"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "080c7dfc-b392-49e6-8b94-6ff0868be0d7",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import openai\n",
"from pathlib import Path\n",
"from IPython.display import Video, display\n",
"\n",
"prompt = \"You are an autonomous vehicle planning system. The video depicts the observation from the vehicle's camera. You need to observe the critical objects in the environment and reason your next action and the driving trajectory ahead.\"\n",
"\n",
"# Plain filesystem path (used for display)\n",
"video_path = str(asset_path(\"drive_scene_next_action.mp4\"))\n",
"# file:// URL (used for the model request)\n",
"video_url = Path(video_path).resolve().as_uri()\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=client.models.list().data[0].id,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" },\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True}},\n",
")\n",
"\n",
"# Display the input video (plain path, NOT the file:// URL)\n",
"display(Video(video_path, embed=True, width=640))\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "47597159-f8b1-4d4f-a3fe-87074af41e22",
"metadata": {},
"source": [
"#### Robot Planning"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61517853-1cf3-4f77-8dcb-e393f0851bbe",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import re\n",
"import openai\n",
"from pathlib import Path\n",
"from PIL import Image as PILImage, ImageDraw\n",
"from IPython.display import display\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"MODEL = client.models.list().data[0].id\n",
"\n",
"image_path = str(asset_path(\"robot_planning.png\"))\n",
"image_url = Path(image_path).resolve().as_uri() # file:// URL for the model\n",
"\n",
"# Display the input image (scaled down to fit the cell)\n",
"preview = PILImage.open(image_path).convert(\"RGB\")\n",
"preview.thumbnail((768, 768))\n",
"display(preview)\n",
"\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
" {\"type\": \"text\", \"text\": 'The task is to put flower into the red bottle. Generate a plan consisting of subtasks for accomplish the task.'},\n",
" ],\n",
" }\n",
" ],\n",
" max_tokens=4096,\n",
" seed=0,\n",
")\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "08d33a4e-d5fb-4bc1-9fef-56e549a41ff8",
"metadata": {},
"source": [
"#### Assisted Task Next Action"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a1ebcb33-0929-4e88-ac3a-85d29fb193a6",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import openai\n",
"from pathlib import Path\n",
"from IPython.display import Video, display\n",
"\n",
"prompt = \"\"\"This is the overall task that the agent is trying to complete: \"The student exchanges the black ink cartridge of the printer.\"\n",
" In the video, the agent is trying to follow the instruction (a single step out of many to complete the overall task): \"place old ink_cartridge.\"\n",
" What should be the next action of the agent?\n",
" Answer the question using the following format:\n",
" \n",
" Your reasoning.\n",
" \n",
" Write your final answer immediately after the tag.\"\"\"\n",
"\n",
"# Plain filesystem path (used for display)\n",
"video_path = str(asset_path(\"assisted_task_next_action.mp4\"))\n",
"# file:// URL (used for the model request)\n",
"video_url = Path(video_path).resolve().as_uri()\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=client.models.list().data[0].id,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" },\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True}},\n",
")\n",
"\n",
"# Display the input video (plain path, NOT the file:// URL)\n",
"display(Video(video_path, embed=True, width=640))\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "5cc324bb-28f5-43d3-82d8-7d6fb8223af1",
"metadata": {},
"source": [
"### Common Sense Reasoning"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "941b992b-e04b-4cc5-8477-d552ffe77f10",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import openai\n",
"from pathlib import Path\n",
"from IPython.display import Video, display\n",
"\n",
"prompt = \"\"\"Can the countertop support the weight of the juicers?\n",
" Answer the question using the following format:\n",
"\n",
" \n",
" Your reasoning.\n",
" \n",
"\n",
" Write your final answer immediately after the tag.\"\"\"\n",
"\n",
"# Plain filesystem path (used for display)\n",
"video_path = str(asset_path(\"common_sense_reasoning.mp4\"))\n",
"# file:// URL (used for the model request)\n",
"video_url = Path(video_path).resolve().as_uri()\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=client.models.list().data[0].id,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" },\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True}},\n",
")\n",
"\n",
"# Display the input video (plain path, NOT the file:// URL)\n",
"display(Video(video_path, embed=True, width=640))\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "d7b596e7-6bce-4ce9-8e59-eff036e12c94",
"metadata": {},
"source": [
"### 2D Grounding"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70441035-22da-4d38-9ff8-5e5e76be32d4",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import json\n",
"import re\n",
"import openai\n",
"from pathlib import Path\n",
"from PIL import Image as PILImage, ImageDraw\n",
"from IPython.display import display\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"MODEL = client.models.list().data[0].id\n",
"\n",
"image_path = str(asset_path(\"grounding_2d.png\"))\n",
"image_url = Path(image_path).resolve().as_uri() # file:// URL for the model\n",
"\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
" {\"type\": \"text\", \"text\": \"Locate the accurate bounding box of the load as a whole. Return a json.\"},\n",
" ],\n",
" }\n",
" ],\n",
" max_tokens=4096,\n",
" seed=0,\n",
")\n",
"out = response.choices[0].message.content\n",
"print(out)\n",
"\n",
"\n",
"def parse_boxes(text):\n",
" \"\"\"Pull a JSON array/object of boxes out of the model text (handles ``` fences).\"\"\"\n",
" text = text.strip()\n",
" text = re.sub(r\"^```(?:json)?|```$\", \"\", text, flags=re.MULTILINE).strip()\n",
" m = re.search(r\"\\[.*\\]|\\{.*\\}\", text, re.DOTALL)\n",
" data = json.loads(m.group(0) if m else text)\n",
" return data if isinstance(data, list) else [data]\n",
"\n",
"\n",
"# Draw boxes; coords are normalized to 0-1000\n",
"img = PILImage.open(image_path).convert(\"RGB\")\n",
"W, H = img.size\n",
"draw = ImageDraw.Draw(img)\n",
"\n",
"for obj in parse_boxes(out):\n",
" box = obj.get(\"bbox_2d\") or obj.get(\"bbox\") or obj.get(\"box\")\n",
" if not box:\n",
" continue\n",
" x1, y1, x2, y2 = box\n",
" x1, x2 = x1 / 1000 * W, x2 / 1000 * W\n",
" y1, y2 = y1 / 1000 * H, y2 / 1000 * H\n",
" draw.rectangle([x1, y1, x2, y2], outline=\"red\", width=3)\n",
" label = obj.get(\"label\") or obj.get(\"name\")\n",
" if label:\n",
" draw.text((x1, max(0, y1 - 12)), str(label), fill=\"red\")\n",
"\n",
"# Display scaled down so a large image fits the cell\n",
"preview = img.copy()\n",
"preview.thumbnail((768, 768))\n",
"display(preview)"
]
},
{
"cell_type": "markdown",
"id": "02ec583a-b4d5-45fa-90b5-8651ffe1c543",
"metadata": {},
"source": [
"### Describe Anything"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1560bbcf-4067-4252-92bf-0b364e76e254",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import json\n",
"import re\n",
"import openai\n",
"from pathlib import Path\n",
"from PIL import Image as PILImage, ImageDraw\n",
"from IPython.display import display\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"MODEL = client.models.list().data[0].id\n",
"\n",
"image_path = str(asset_path(\"describe_anything.png\"))\n",
"image_url = Path(image_path).resolve().as_uri() # file:// URL for the model\n",
"\n",
"# Display the input image (scaled down to fit the cell)\n",
"preview = PILImage.open(image_path).convert(\"RGB\")\n",
"preview.thumbnail((768, 768))\n",
"display(preview)\n",
"\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
" {\"type\": \"text\", \"text\": 'Please caption the notable attributes in the provided image. List and describe all marked subjects in the image with their categories and detailed captions using a json with keyword \"subject_id\", \"category\" and \"caption\".'},\n",
" ],\n",
" }\n",
" ],\n",
" max_tokens=4096,\n",
" seed=0,\n",
")\n",
"print(response.choices[0].message.content)\n"
]
},
{
"cell_type": "markdown",
"id": "ae8c996b-ae6b-4ac9-acc4-5e9c09354c5f",
"metadata": {},
"source": [
"### Action CoT"
]
},
{
"cell_type": "markdown",
"id": "5e497778-d4f6-4db9-a1f9-ecb6363a7811",
"metadata": {},
"source": [
"#### Trajectory Coordinates"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8ef9a53-8dc2-47d0-a370-1403c51eeae5",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import json\n",
"import re\n",
"import openai\n",
"from pathlib import Path\n",
"from PIL import Image as PILImage, ImageDraw\n",
"from IPython.display import display\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"MODEL = client.models.list().data[0].id\n",
"\n",
"image_path = str(asset_path(\"action_cot_trajectory.png\"))\n",
"image_url = Path(image_path).resolve().as_uri()\n",
"\n",
"prompt = \"\"\"You are given the task \"Move the pink bowl to the right\". Specify the 2D trajectory your end effector should follow in pixel space. Return the trajectory coordinates in JSON format like this: {\"point_2d\": [x, y], \"label\": \"gripper trajectory\"}.\n",
"Answer the question using the following format:\n",
"\n",
"\n",
"Your reasoning.\n",
"\n",
"\n",
"Write your final answer immediately after the tag.\n",
"\"\"\"\n",
"\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" }\n",
" ],\n",
" max_tokens=4096,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" presence_penalty=0.0,\n",
" extra_body={\"top_k\": 20, \"repetition_penalty\": 1.0},\n",
")\n",
"out = response.choices[0].message.content\n",
"print(out)\n",
"\n",
"\n",
"def parse_points(text):\n",
" \"\"\"Grab the JSON list of {point_2d, label} after the tag.\"\"\"\n",
" if \"\" in text:\n",
" text = text.split(\"\")[-1]\n",
" text = re.sub(r\"```(?:json)?\", \"\", text).strip().strip(\"`\").strip()\n",
" m = re.search(r\"\\[.*\\]\", text, re.DOTALL)\n",
" data = json.loads(m.group(0) if m else text)\n",
" return data if isinstance(data, list) else [data]\n",
"\n",
"\n",
"# Visualize the trajectory (points are in pixel space)\n",
"img = PILImage.open(image_path).convert(\"RGB\")\n",
"draw = ImageDraw.Draw(img)\n",
"W, H = img.size\n",
"\n",
"# coords are normalized to 0-1000 (per-axis) -> scale to pixels\n",
"pts = [(o[\"point_2d\"][0] / 1000 * W, o[\"point_2d\"][1] / 1000 * H)\n",
" for o in parse_points(out) if isinstance(o, dict) and \"point_2d\" in o]\n",
"if len(pts) > 1:\n",
" draw.line(pts, fill=\"lime\", width=5)\n",
"for i, (x, y) in enumerate(pts):\n",
" r = 12\n",
" draw.ellipse([x - r, y - r, x + r, y + r], fill=\"red\", outline=\"white\", width=3)\n",
" draw.text((x + 14, y - 14), str(i), fill=\"yellow\")\n",
"preview = img.copy()\n",
"preview.thumbnail((900, 900))\n",
"display(preview)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d3c3e0a0-288b-4080-b49a-b41e8a4ba14c",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import json\n",
"import re\n",
"import openai\n",
"from pathlib import Path\n",
"from PIL import Image as PILImage, ImageDraw\n",
"from IPython.display import display\n",
"\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"MODEL = client.models.list().data[0].id\n",
"\n",
"image_path = str(asset_path(\"robot_planning.png\"))\n",
"image_url = Path(image_path).resolve().as_uri()\n",
"\n",
"prompt = \"\"\"You are given the task \"Put flower into the red bottle\". Specify the 2D trajectory your end effector should follow in pixel space. Return the trajectory coordinates in JSON format like this: {\"point_2d\": [x, y], \"label\": \"gripper trajectory\"}. \n",
"Answer the question using the following format:\n",
"\n",
" Your reasoning. \n",
"Write your final answer immediately after the tag.\n",
"\"\"\"\n",
"\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"image_url\", \"image_url\": {\"url\": image_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" }\n",
" ],\n",
" max_tokens=4096,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" presence_penalty=0.0,\n",
" extra_body={\"top_k\": 20, \"repetition_penalty\": 1.0},\n",
")\n",
"out = response.choices[0].message.content\n",
"print(out)\n",
"\n",
"\n",
"def parse_points(text):\n",
" \"\"\"Grab the JSON list of {point_2d, label} after the tag.\"\"\"\n",
" if \"\" in text:\n",
" text = text.split(\"\")[-1]\n",
" text = re.sub(r\"```(?:json)?\", \"\", text).strip().strip(\"`\").strip()\n",
" m = re.search(r\"\\[.*\\]\", text, re.DOTALL)\n",
" data = json.loads(m.group(0) if m else text)\n",
" return data if isinstance(data, list) else [data]\n",
"\n",
"\n",
"# Visualize the trajectory (points are in pixel space)\n",
"img = PILImage.open(image_path).convert(\"RGB\")\n",
"draw = ImageDraw.Draw(img)\n",
"W, H = img.size\n",
"\n",
"# coords are normalized to 0-1000 (per-axis) -> scale to pixels\n",
"pts = [(o[\"point_2d\"][0] / 1000 * W, o[\"point_2d\"][1] / 1000 * H)\n",
" for o in parse_points(out) if isinstance(o, dict) and \"point_2d\" in o]\n",
"if len(pts) > 1:\n",
" draw.line(pts, fill=\"lime\", width=5)\n",
"for i, (x, y) in enumerate(pts):\n",
" r = 12\n",
" draw.ellipse([x - r, y - r, x + r, y + r], fill=\"red\", outline=\"white\", width=3)\n",
" draw.text((x + 14, y - 14), str(i), fill=\"yellow\")\n",
"preview = img.copy()\n",
"preview.thumbnail((900, 900))\n",
"display(preview)"
]
},
{
"cell_type": "markdown",
"id": "3a1ef608-59e9-4e33-9d6b-25e95fb9c3b8",
"metadata": {},
"source": [
"#### Driving Scene"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b5f1bd5-4dd1-443b-89fc-a92d46bdf82a",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"from pathlib import Path\n",
"from IPython.display import Video, display\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"MODEL = client.models.list().data[0].id\n",
"video_path = str(asset_path(\"action_cot_driving_scene.mp4\"))\n",
"video_url = Path(video_path).resolve().as_uri()\n",
"prompt = \"\"\"The video depicts the observation from the vehicle's camera. You need to think step by step and identify the objects in the scene that are critical for safe navigation.\n",
"Answer the question using the following format:\n",
"\n",
"Your reasoning.\n",
"\n",
"Write your final answer immediately after the tag.\"\"\"\n",
"# Show the input video\n",
"display(Video(video_path, embed=True, width=640))\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" }\n",
" ],\n",
" max_tokens=4096,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" presence_penalty=0.0,\n",
" extra_body={\n",
" \"top_k\": 20,\n",
" \"repetition_penalty\": 1.0,\n",
" \"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True},\n",
" },\n",
")\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "3ba3fc77-3bcd-4509-a796-356ad20136ad",
"metadata": {},
"source": [
"### Physical Plausibility Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "192e522d-81d0-4ff0-a025-aba9636deeac",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"from pathlib import Path\n",
"from IPython.display import Video, display\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"MODEL = client.models.list().data[0].id\n",
"video_path = str(asset_path(\"physical_plausibility.mp4\"))\n",
"video_url = Path(video_path).resolve().as_uri()\n",
"prompt = \"\"\"Is this video physically plausible/possible according to your understanding of e.g. object permanence, shape constancy (objects maintain shape over time), continuous trajectories of objects? Assume it is the normal laws of physics.\n",
"Your answer should be based on the events in the video and ignore the quality of the simulation engine. The rising wall is part of the experiment setup and should not be judged for plausibility.\n",
"(A) Possible\n",
"(B) Impossible\"\"\"\n",
"# Show the input video\n",
"display(Video(video_path, embed=True, width=640))\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" }\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\n",
" \"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True},\n",
" },\n",
")\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "80ff6302-8f3d-430a-b514-579aff17eb08",
"metadata": {},
"source": [
"### Situation Understanding"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42d01738-02f5-45f6-aba2-f12eb53f722d",
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"from pathlib import Path\n",
"from IPython.display import Video, display\n",
"client = openai.OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8001/v1\")\n",
"MODEL = client.models.list().data[0].id\n",
"video_path = str(asset_path(\"situation_understanding.mp4\"))\n",
"video_url = Path(video_path).resolve().as_uri()\n",
"prompt = \"What is the person doing with the skillet? What will the person likely do next in this situation?\"\n",
"# Show the input video\n",
"display(Video(video_path, embed=True, width=640))\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"video_url\", \"video_url\": {\"url\": video_url}},\n",
" {\"type\": \"text\", \"text\": prompt},\n",
" ],\n",
" }\n",
" ],\n",
" max_tokens=4096,\n",
" extra_body={\n",
" \"mm_processor_kwargs\": {\"fps\": 4, \"do_sample_frames\": True},\n",
" },\n",
")\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89735fc4-5c42-410c-8697-878120d08f68",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}