Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Pretrained models

Before using the models, please request access to the checkpoints here. Once your request is approved, you can download the checkpoints. Please note that access requests are reviewed by an automated process based on the information provided in the request.

ModelResolutionText alignmentDownload
VGGT-Omega-1B-512512NoLink
VGGT-Omega-1B-256-Text-Alignment256YesLink

The authors are not involved in the review process and cannot approve or reject individual applications. However, the 🤗 Hugging Face demo is available to everyone.

Quick Start

First, clone this repository and install the dependencies:

git clone git@github.com:facebookresearch/vggt-omega.git
cd vggt-omega
pip install -r requirements.txt
pip install -e .

Now, try the model with a few lines of code:

import torch

from vggt_omega.models import VGGTOmega
from vggt_omega.utils.load_fn import load_and_preprocess_images
from vggt_omega.utils.pose_enc import encoding_to_camera

checkpoint_path = "path/to/vggt_omega_1b_512.pt"
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]

model = VGGTOmega().to("cuda").eval()
model.load_state_dict(torch.load(checkpoint_path, map_location="cpu"))

images = load_and_preprocess_images(image_names, image_resolution=512).to("cuda")

with torch.inference_mode():
    predictions = model(images)

extrinsics, intrinsics = encoding_to_camera(
    predictions["pose_enc"],
    predictions["images"].shape[-2:],
)

depth = predictions["depth"]
depth_conf = predictions["depth_conf"]
camera_and_register_tokens = predictions["camera_and_register_tokens"]
camera_tokens = camera_and_register_tokens[:, :, :1]
registers = camera_and_register_tokens[:, :, 1:]

For the text-aligned checkpoint, use VGGTOmega(enable_alignment=True) with image_resolution=256 and read predictions["text_alignment_embedding"].

Interactive Demo

Install the demo dependencies:

pip install -r requirements_demo.txt

Launch the Gradio demo with a local checkpoint path:

python demo_gradio.py \
  --checkpoint checkpoints/VGGT-Omega-1B-512/model.pt \
  --image-resolution 512

The demo accepts uploaded images or a video, runs camera and depth inference, and visualizes the depth-unprojected point cloud and predicted cameras as a GLB scene.

Runtime and GPU Memory

We benchmark the end-to-end peak GPU memory usage of VGGT-Omega-1B-512 on a single NVIDIA A100 GPU with 624x416 input images. The measurement covers the full inference program, from loading the model weights onto the GPU through the forward pass, so it includes both the memory needed to store the model itself and the memory used by inference activations and buffers. In other words, a GPU with at least the listed available memory is able to run the corresponding number of input frames under this setup.

Input Frames1102550100200300400500
Peak Memory (GB)6.026.677.809.6613.3720.8228.2635.7143.15

The benchmark uses load_and_preprocess_images with the default mode="balanced" and image_resolution=512. For these roughly 3:2 landscape images, this produces 624x416 inputs. You can set mode="max_size" to resize the longest side to 512 instead; for the same aspect ratio, this gives about 512x336 inputs and uses less GPU memory.

License

See the LICENSE file for details about the license under which this code is made available.

@misc{wang2026vggtomega,
      title={VGGT-$\Omega$}, 
      author={Jianyuan Wang and Minghao Chen and Shangzhan Zhang and Nikita Karaev and Johannes Schönberger and Patrick Labatut and Piotr Bojanowski and David Novotny and Andrea Vedaldi and Christian Rupprecht},
      year={2026},
      eprint={2605.15195},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.15195}, 
}

关于 About

[CVPR 2026 Oral] VGGT Omega

语言 Languages

Python100.0%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
1
Total Commits
峰值: 1次/周
Less
More

核心贡献者 Contributors