{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "KeU5xC3JcvhG" }, "source": [ "##### Copyright 2026 Google LLC.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "yCm6RNmGcgzw" }, "outputs": [], "source": [ "# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "Bgrop9U9VI08" }, "source": [ "# Pointing and 3D Spatial Understanding with Gemini (Experimental)" ] }, { "cell_type": "markdown", "metadata": { "id": "CeJyB7rG82ph" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "ATkdRllgTFdo" }, "source": [ "This colab highlights some of the exciting use cases for Gemini in spatial understanding. It focuses on how [Gemini](https://ai.google.dev/gemini-api/docs/models/gemini-v2)'s image and real world understanding capabilities including pointing and 3D spatial understanding as briefly teased in the [Building with Gemini 2.0: Spatial understanding](https://www.youtube.com/watch?v=-XmoDzDMqj4) video." ] }, { "cell_type": "markdown", "metadata": { "id": "ToH2aAmNTIxv" }, "source": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " 🚧\n", " \n", "

Points and 3D bounding boxes are experimental. Use 2D bounding boxes for higher accuracy.

\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "id": "dRicv4wSgGup" }, "source": [ "Pointing is an important capability for vision language models, because that allows the model to refer to an entity precisely. Gemini Flash has improved accuracy on spatial understanding, with 2D point prediction as an experimental feature. Below you'll see that pointing can be combined with reasoning.\n", "\n", "\n", "\n", "Traditionally, a Vision Language Model (VLM) sees the world in 2D, however, [Gemini 2.0 Flash](https://ai.google.dev/gemini-api/docs/models/gemini-v2) can perform 3D detection. The model has a general sense of the space and knows where the objects are in 3D space.\n", "\n", "\n", "\n", "The model will respond to spatial understanding-related requests in json format to facilitate parsing, and the coordinates always have the same conventions. For this example to be more readable, it overlays the spatial signals on the image, and the readers can hover their cursor on the image to get the complete response. The coordinates are in the image frame, and are normalized into an integer between 0-1000. The top left is `(0,0)` and the bottom right is `(1000,1000)`. The point is in `[y, x]` order, and 2d bounding boxes are in `y_min, x_min, y_max, x_max` order.\n", "\n", " Additionally, 3D bounding boxes are represented with 9 numbers, the first 3 numbers represent the center of the object in camera frame, they are in metric units; the next 3 numbers represent the size of the object in meters, and the last 3 numbers are Euler angles representing row, pitch and yaw, they are in degree.\n", "\n", "To learn more about 2D spatial understanding, please take a look at [2d examples](../quickstarts/Spatial_understanding.ipynb) and the [Spatial understanding example](https://aistudio.google.com/starter-apps/spatial) from [Google AI Studio](https://aistudio.google.com).\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "Mfk6YY3G5kqp" }, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": { "id": "p_BHH2THWYvf" }, "source": [ "### Install SDK" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BAebsDJoWYvf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/105.8 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m105.8/105.8 kB\u001b[0m \u001b[31m4.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/168.2 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m168.2/168.2 kB\u001b[0m \u001b[31m11.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h" ] } ], "source": [ "%pip install -U -q google-genai" ] }, { "cell_type": "markdown", "metadata": { "id": "krRMl2V1WYvf" }, "source": [ "### Setup your API key\n", "\n", "To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication ![image](https://storage.googleapis.com/generativeai-downloads/images/colab_icon16.png)](../quickstarts/Authentication.ipynb) for an example." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QSuq3-cGWYvf" }, "outputs": [], "source": [ "from google.colab import userdata\n", "\n", "GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')" ] }, { "cell_type": "markdown", "metadata": { "id": "3Hx_Gw9i0Yuv" }, "source": [ "### Initialize SDK client\n", "\n", "With the new SDK you now only need to initialize a client with you API key (or OAuth if using [Vertex AI](https://cloud.google.com/vertex-ai)). The model is now set in each call." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HghvVpbU0Uap" }, "outputs": [], "source": [ "from google import genai\n", "from google.genai import types\n", "\n", "from PIL import Image\n", "\n", "client = genai.Client(api_key=GOOGLE_API_KEY)" ] }, { "cell_type": "markdown", "metadata": { "id": "JFZRN4A6WYvg" }, "source": [ "### Select a model\n", "\n", "3d spatial understanding and pointing are two new capabilities introduced in the Gemini 2.0 Flash model. Later generation models are also capable of using those capabilities.\n", "\n", "For more information about all Gemini models, check the [documentation](https://ai.google.dev/gemini-api/docs/models/gemini) for extended information on each of them.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZVVChtaXWYvh" }, "outputs": [], "source": [ "MODEL_ID = \"gemini-3-flash-preview\" # @param [\"gemini-2.5-flash-lite\", \"gemini-2.5-flash\", \"gemini-2.5-pro\", \"gemini-2.5-flash-preview\", \"gemini-3.1-flash-lite-preview\", \"gemini-3.1-pro-preview\"] {\"allow-input\":true, isTemplate: true}" ] }, { "cell_type": "markdown", "metadata": { "id": "GoJWSi5NWuza" }, "source": [ "### Load sample images" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DictDi0UKwZn" }, "outputs": [], "source": [ "# Load sample images\n", "!wget https://storage.googleapis.com/generativeai-downloads/images/kitchen.jpg -O kitchen.jpg -q\n", "!wget https://storage.googleapis.com/generativeai-downloads/images/room-clock.jpg -O room.jpg -q\n", "!wget https://storage.googleapis.com/generativeai-downloads/images/spill.jpg -O spill.jpg -q\n", "!wget https://storage.googleapis.com/generativeai-downloads/images/tool.png -O tool.png -q\n", "!wget https://storage.googleapis.com/generativeai-downloads/images/music_0.jpg -O music_0.jpg -q\n", "!wget https://storage.googleapis.com/generativeai-downloads/images/music_1.jpg -O music_1.jpg -q\n", "!wget https://storage.googleapis.com/generativeai-downloads/images/traj_00.jpg -O traj_00.jpg -q\n", "!wget https://storage.googleapis.com/generativeai-downloads/images/traj_01.jpg -O traj_01.jpg -q\n", "!wget https://storage.googleapis.com/generativeai-downloads/images/shoe_bench_0.jpg -O shoe_bench_0.jpg -q\n", "!wget https://storage.googleapis.com/generativeai-downloads/images/shoe_bench_1.jpg -O shoe_bench_1.jpg -q" ] }, { "cell_type": "markdown", "metadata": { "id": "12Nxm-AZ_ZfD" }, "source": [ "## Pointing to items using Gemini\n", "\n", "Instead of asking for [bounding boxes](../quickstarts/Spatial_understanding.ipynb), you can ask Gemini to points are things on the image. Depending on your use-case it might be sufficent and will less clutter the images.\n", "\n", "Just be careful that the format Gemini knows the best is (y, x), so it's better to stick to it.\n", "\n", "To prevent the model from repeating itself, it is recommended to use a temperature over 0, in this case 0.5. Limiting the number of items (10 in this case) is also a way to prevent the model from looping and to speed up the decoding of the corrdinates. You can experiment with these parameters and find what works best for your use-case." ] }, { "cell_type": "markdown", "metadata": { "id": "-ZbXwOaHKqpF" }, "source": [ "### Analyze the image using Gemini" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TyeZKUWkJVhc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "```json\n", "[\n", " {\"point\": [130, 760], \"label\": \"handle\"},\n", " {\"point\": [427, 517], \"label\": \"screw\"},\n", " {\"point\": [472, 201], \"label\": \"clamp arm\"},\n", " {\"point\": [466, 345], \"label\": \"clamp arm\"},\n", " {\"point\": [685, 312], \"label\": \"3 inch\"},\n", " {\"point\": [493, 659], \"label\": \"screw\"},\n", " {\"point\": [402, 474], \"label\": \"screw\"},\n", " {\"point\": [437, 664], \"label\": \"screw\"},\n", " {\"point\": [427, 784], \"label\": \"handle\"},\n", " {\"point\": [452, 852], \"label\": \"handle\"}\n", "]\n", "```\n" ] } ], "source": [ "# Load and resize image\n", "img = Image.open(\"tool.png\")\n", "img = img.resize((800, int(800 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS) # Resizing to speed-up rendering\n", "\n", "# Analyze the image using Gemini\n", "image_response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " img,\n", " \"\"\"\n", " Point to no more than 10 items in the image, include spill.\n", " The answer should follow the json format: [{\"point\": , \"label\": }, ...]. The points are in [y, x] format normalized to 0-1000.\n", " \"\"\"\n", " ],\n", " config = types.GenerateContentConfig(\n", " temperature=0.5\n", " )\n", ")\n", "\n", "# Check response\n", "print(image_response.text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "nmypO7qkQn4F" }, "outputs": [], "source": [ "# @title Point visualization code\n", "\n", "import IPython\n", "\n", "def parse_json(json_output):\n", " # Parsing out the markdown fencing\n", " lines = json_output.splitlines()\n", " for i, line in enumerate(lines):\n", " if line == \"```json\":\n", " json_output = \"\\n\".join(lines[i+1:]) # Remove everything before \"```json\"\n", " json_output = json_output.split(\"```\")[0] # Remove everything after the closing \"```\"\n", " break # Exit the loop once \"```json\" is found\n", " return json_output\n", "\n", "def generate_point_html(pil_image, points_json):\n", " # Convert PIL image to base64 string\n", " import base64\n", " from io import BytesIO\n", " buffered = BytesIO()\n", " pil_image.save(buffered, format=\"PNG\")\n", " img_str = base64.b64encode(buffered.getvalue()).decode()\n", " points_json = parse_json(points_json)\n", "\n", " return f\"\"\"\n", "\n", "\n", "\n", " Point Visualization\n", " \n", "\n", "\n", "
\n", " \n", "
\n", "
\n", "\n", " \n", "\n", "\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "id": "afLsdH-k8exd" }, "source": [ "The script create an HTML rendering of the image and the points. It is similar to the one used in the [Spatial understanding example](https://aistudio.google.com/starter-apps/spatial) from [Google AI Studio](https://aistudio.google.com).\n", "\n", "Of course this is just an example and you are free to just write your own." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NqzxptOU04L7" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " Point Visualization\n", " \n", "\n", "\n", "
\n", " \n", "
\n", "
\n", "\n", " \n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the dots on the image\n", "IPython.display.HTML(generate_point_html(img, image_response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "bEjACc2RLEJk" }, "source": [ "### Pointing and reasoning\n", "\n", "You can use Gemini's reasoning capabilities on top of its pointing ones as in the [2d bounding box](../quickstarts/Spatial_understanding.ipynb#scrollTo=GZbhjYkUA86w) example and ask for more detailled labels.\n", "\n", "In this case you can do it by adding this sentence to the prompt: \"Explain how to use each part, put them in the label field, remove duplicated parts and instructions\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OmFvmcQEXvqS" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " Point Visualization\n", " \n", "\n", "\n", "
\n", " \n", "
\n", "
\n", "\n", " \n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load and resize image\n", "img = Image.open(\"tool.png\")\n", "img = img.resize((800, int(800 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS)\n", "\n", "# Analyze the image using Gemini\n", "image_response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " img,\n", " \"\"\"\n", " Pinpoint no more than 10 items in the image.\n", " The answer should follow the json format: [{\"point\": , \"label\": }, ...]. The points are in [y, x] format normalized to 0-1000. One element a line.\n", " Explain how to use each part, put them in the label field, remove duplicated parts and instructions.\n", " \"\"\"\n", " ],\n", " config = types.GenerateContentConfig(\n", " temperature=0.5\n", " )\n", ")\n", "\n", "# Display the dots on the image\n", "IPython.display.HTML(generate_point_html(img, image_response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "Yr3EwywZLnff" }, "source": [ "### More pointing and reasoning examples\n", "\n", "Expend this section to see more examples of images and prompts you can use. Experiment with them and find what works bets for your use-case." ] }, { "cell_type": "markdown", "metadata": { "id": "FX4ETBlGXRHU" }, "source": [ "#### Kitchen safety\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4uYEcgozLrXv" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " Point Visualization\n", " \n", "\n", "\n", "
\n", " \n", "
\n", "
\n", "\n", " \n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load and resize image\n", "img = Image.open(\"kitchen.jpg\")\n", "img = img.resize((800, int(800 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS)\n", "\n", "# Analyze the image using Gemini\n", "image_response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " img,\n", " \"\"\"\n", " Point to no more than 10 items in the image.\n", " The answer should follow the json format: [{\"point\": , \"label\": }, ...]. The points are in [y, x] format normalized to 0-1000. One element a line.\n", " Explain how to prevent kids from getting hurt, put them in the label field, remove duplicated parts and instructions.\n", " \"\"\"\n", " ],\n", " config = types.GenerateContentConfig(\n", " temperature=0.5\n", " )\n", ")\n", "\n", "# Display the dots on the image\n", "IPython.display.HTML(generate_point_html(img, image_response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "3mP9aLEkXaeW" }, "source": [ "#### Office improvements" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MbuMlasKuv4-" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " Point Visualization\n", " \n", "\n", "\n", "
\n", " \n", "
\n", "
\n", "\n", " \n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load and resize image\n", "img = Image.open(\"room.jpg\")\n", "img = img.resize((800, int(800 * img.size[1] / img.size[0])), Image.Resampling.LANCZOS)\n", "\n", "# Analyze the image using Gemini\n", "image_response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " img,\n", " \"\"\"\n", " Point to no more than 10 items in the image.\n", " The answer should follow the json format: [{\"point\": , \"label\": }, ...]. The points are in [y, x] format normalized to 0-1000. One element a line.\n", " Give advices on how to make this space more feng-shui, put them in the label field, remove duplicated parts and instructions.\n", " \"\"\"\n", " ],\n", " config = types.GenerateContentConfig(\n", " temperature=0.5\n", " )\n", ")\n", "\n", "# Display the dots on the image\n", "IPython.display.HTML(generate_point_html(img, image_response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "7e758c4cdec6" }, "source": [ "#### Trajectories - Example 1\n", "\n", "Here are two examples of asking Gemini to predict list of points that represent trajectories.\n", "This first example shows how to interpolate trajectories between a start and end point.\n", "\n", "The image used here is from [Ego4D](https://ego4d-data.org/) with license [here](https://ego4d-data.org/pdfs/Ego4D-Licenses-Draft.pdf)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "066c2ba195e0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "```json\n", "[\n", " {\"point\": [685, 671], \"label\": \"blue brush\"},\n", " {\"point\": [489, 338], \"label\": \"particles\"},\n", " {\"point\": [591, 250], \"label\": \"particles\"},\n", " {\"point\": [612, 443], \"label\": \"particles\"},\n", " {\"point\": [697, 186], \"label\": \"particles\"},\n", " {\"point\": [554, 518], \"label\": \"particles\"},\n", " {\"point\": [529, 175], \"label\": \"particles\"},\n", " {\"point\": [720, 393], \"label\": \"particles\"}\n", "]\n", "```\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", " Point Visualization\n", " \n", "\n", "\n", "
\n", " \n", "
\n", "
\n", "\n", " \n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "img = Image.open(\"traj_00.jpg\")\n", "img.thumbnail((800, 800))\n", "\n", "prompt = \"\"\"\n", "Point to the left hand and the handle of the blue screwdriver, and a trajectory of 6 points connecting them with no more than 10 items.\n", "The points should be labeled by order of the trajectory, from '0' (start point) to (final point)\n", "The answer should follow the json format: [{\"point\": , \"label\": }, ...].\n", "The points are in [y, x] format normalized to 0-1000.\n", "\"\"\"\n", "\n", "# Analyze the image using Gemini.\n", "image_response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " img,\n", " # Text prompt\n", " prompt\n", " ],\n", " config = types.GenerateContentConfig(\n", " temperature=0.1\n", " )\n", ")\n", "\n", "# Print the coordinates.\n", "print(image_response.text)\n", "# Display the dots on the image\n", "IPython.display.HTML(generate_point_html(img, image_response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "422b56f711a8" }, "source": [ "#### Trajectories - Example 2\n", "\n", "This second example shows how Gemini can predict a list of points that covers an area.\n", "\n", "The image used here is from [BridgeData v2](https://rail-berkeley.github.io/bridgedata/) with the license [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6c384919d527" }, "outputs": [], "source": [ "img = Image.open(\"traj_01.jpg\")\n", "img.thumbnail((800, 800))\n", "\n", "prompt = \"\"\"\n", "Point to the the blue brush and a list of points covering the region of particles with no more than 10 items.\n", "The answer should follow the json format: [{\"point\": , \"label\": }, ...].\n", "The points are in [y, x] format normalized to 0-1000.\n", "\"\"\"\n", "\n", "# Analyze the image using Gemini.\n", "image_response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " img,\n", " # Text prompt\n", " prompt\n", " ],\n", " config = types.GenerateContentConfig(\n", " temperature=0.1\n", " )\n", ")\n", "\n", "# Print the coordinates.\n", "print(image_response.text)\n", "# Display the dots on the image\n", "IPython.display.HTML(generate_point_html(img, image_response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "VRCN7NmQ4q8s" }, "source": [ "## Analyzing 3D scenes with Gemini 2.0 (Experimental)" ] }, { "cell_type": "markdown", "metadata": { "id": "dNLh9ff_Whrm" }, "source": [ "#### Multiview Correspondence\n", "Gemini can reason about different views of the same 3D scene.\n", "\n", "In these examples, you first ask Gemini to label some points of interest in a view from a 3d scene. Next, you provide these coordinates and scene view, along with a new view of the same scene, and ask Gemini to point at the same points in the new view.\n", "\n", "In these examples, you label the points as letters ('a','b','c' etc.) rather than semantic labels (e.g. 'guitar', 'drum'). This is to force the model to use the coordinates and the image, vs relying on the labels only.\n", "\n", "Note that multiview correspondence is an experimental feature, which will further improve in future versions.\n", "This capability works best with the model ID `gemini-2.5-pro`\n" ] }, { "cell_type": "markdown", "metadata": { "id": "d9475d618e59" }, "source": [ "##### Musical Instruments step #1: Pointing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yKsGFn1SWhrn" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "```json\n", "[\n", " {\"point\": [641, 346], \"label\": \"a\"},\n", " {\"point\": [756, 353], \"label\": \"b\"},\n", " {\"point\": [535, 407], \"label\": \"c\"},\n", " {\"point\": [577, 751], \"label\": \"d\"}\n", "]\n", "```\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", " Point Visualization\n", " \n", "\n", "\n", "
\n", " \n", "
\n", "
\n", "\n", " \n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PRO_MODEL_ID ='gemini-2.5-pro'\n", "\n", "# Load and resize the image.\n", "img_0 = Image.open(\"music_0.jpg\")\n", "img_0 = img_0.resize((800, int(800 * img_0.size[1] / img_0.size[0])), Image.Resampling.LANCZOS)\n", "\n", "# Analyze the image using Gemini.\n", "# Ask Gemini to point to the musical instruments.\n", "image_response_0 = client.models.generate_content(\n", " model=PRO_MODEL_ID,\n", " contents=[\n", " # The first view of the scene.\n", " img_0,\n", " \"\"\"\n", " Point to the following points in the image:.\n", " a. Dumbak top\n", " b. Dumbak neck\n", " c. Cajon\n", " d. Guitar\n", " The answer should follow the json format: [{\"point\": , \"label\": }, ...]. The points are in [y, x] format normalized to 0-1000.\n", " The point labels should be 'a', 'b', 'c' etc. based on the provided list.\n", " \"\"\"\n", " ],\n", " config = types.GenerateContentConfig(\n", " temperature=0.1\n", " )\n", ")\n", "\n", "# Print the coordinates\n", "print(image_response_0.text)\n", "\n", "# Display the dots on the image\n", "IPython.display.HTML(generate_point_html(img_0, image_response_0.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "7af37a296ce4" }, "source": [ "##### Musical Instruments step #2: Multiview\n", "\n", "Now take a picture from another angle and check if the model can find the corresponding points in the novel view." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nBJgJTWmWhrn" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "```json\n", "[\n", " {'in_frame': true, 'point': [404, 883], 'label': \"a\"},\n", " {'in_frame': true, 'point': [481, 813], 'label': \"b\"},\n", " {'in_frame': false, 'label': \"c\"},\n", " {'in_frame': true, 'point': [618, 285], 'label': \"d\"}\n", "]\n", "```\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", " Point Visualization\n", " \n", "\n", "\n", "
\n", " \n", "
\n", "
\n", "\n", " \n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load and resize image of a different view of the same scene.\n", "img_1 = Image.open(\"music_1.jpg\")\n", "img_1 = img_1.resize((800, int(800 * img_1.size[1] / img_1.size[0])), Image.Resampling.LANCZOS)\n", "\n", "image_response_1 = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", "\n", " # The first view of the scene\n", " img_0,\n", "\n", " # The new prompt\n", " \"\"\"For the following images, predict if the points referenced in the first image are in frame.\n", " If they are, also predict their 2D coordinates.\n", " Each entry in the response should be a single line and have the following keys:\n", " If the point is out of frame: 'in_frame': false, 'label' :