Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

Yisu Zhang1,2*   Chenjie Cao2*   Tengfei Wang2†   Xuhui Zuo2   Junta Wu2   Jianke Zhu1‡   Chunchao Guo2

1Zhejiang University    2Tencent Hunyuan
*Equal Contribution   †Project Lead   ‡Corresponding Author

arXiv HuggingFace CVPR 2026


📅 News


✅ TODO List

  • Release inference code and model weights of WorldStereo 2.0
  • Release data pre-processing pipelines for panoramic and multi-trajectory scenes

📖 Abstract

We propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules:

  • Global-Geometric Memory (GGM) enables precise camera control while injecting coarse structural priors through incrementally updated point clouds via a ControlNet branch.
  • Spatial-Stereo Memory (SSM) constrains the model's attention receptive fields with 3D correspondences to focus on fine-grained details from the memory bank.

Together, these components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, WorldStereo shows impressive efficiency by leveraging a distribution-matching distilled (DMD) VDM backbone without joint training.


🎬 Results

3D Reconstruction from a Single Image

Given a single reference image, WorldStereo generates multi-view consistent videos and reconstructs a dense 3D point cloud. Below are example results on two scenes.

Scene: Kitchen   |   Input image → Point cloud (5 views)

Kitchen input   Kitchen pcd 1 Kitchen pcd 2 Kitchen pcd 3 Kitchen pcd 4 Kitchen pcd 5

Camera Control

MethodsCamera MetricsVisual Quality
RotErr ↓TransErr ↓ATE ↓Q-Align ↑CLIP-IQA+ ↑Laion-Aes ↑CLIP-I ↑
SEVA1.6901.5782.8793.2320.4794.62377.16
Gen3C0.9441.5802.7893.3530.4894.86382.33
WorldStereo0.7621.2452.1414.1490.5475.25789.05
WorldStereo 2.00.4920.9681.7684.2050.5445.26689.43

Single-View-Generated Reconstruction

MethodsTanks-and-TemplesMipNeRF360
Precision ↑Recall ↑F1-Score ↑AUC ↑Precision ↑Recall ↑F1-Score ↑AUC ↑
SEVA33.5935.3436.7351.0322.3855.6328.7546.81
Gen3C46.7325.5131.2442.4423.2875.3735.2652.10
Lyra50.3828.6732.5443.0530.0258.6036.0549.89
FlashWorld26.5820.7222.2930.4535.9753.7742.6053.86
WorldStereo 2.043.6241.0241.4358.1943.1965.3251.2765.79
WorldStereo 2.0 (DMD)40.4144.4143.1660.0942.3464.8350.5265.64

🆕 WorldStereo 2.0 vs. 1.0

WorldStereo 2.0 introduces four key improvements over the original version:

WorldStereo 1.0WorldStereo 2.0
Latent SpaceStandard video latent spaceKeyframe latent space — encodes each frame independently, substantially improving visual quality of generated novel views and completely supporting parallel encoding/decoding
Memory MechanismCross-attention to retrieved reference framesStereo stitching in the main branch — reference views are spatially concatenated with target frames along the width dimension in the main DiT branch, enabling stronger and more direct memory fusion
Backbone Fine-tuningFrozen backbonePartial backbone fine-tuning — backbone weights are selectively updated to adapt to the keyframe latent space and improve overall generation quality
Training DataLimited camera trajectoriesExpanded UE rendering data — significantly more Unreal Engine rendered scenes with diverse and precise camera motions, leading to stronger camera control and memory capabilities

More details of WorldStereo 2.0 are shown in HY-World 2.0.


⚙️ Installation

1. Clone the repository:

git clone https://github.com/FuchengSu/WorldStereo.git cd WorldStereo

2. Install core dependencies:

conda create -n worldstereo python=3.11 conda activate worldstereo pip install -r requirements.txt

3. Install PyTorch3D (required for point cloud rendering):

pip install --no-build-isolation "git+https://github.com/facebookresearch/pytorch3d.git@stable"

4. Install MoGe (monocular depth estimation):

pip install git+https://github.com/microsoft/MoGe.git@0286b495230a074aadf1c76cc5c679e943e5d1c6

5. (Optional) Install third-party reconstruction module for WorldMirror reconstruction:

mkdir third_party cd third_party git clone https://github.com/Tencent-Hunyuan/HY-World-2.0.git pip install -r HY-World-2.0/requirements.txt

Note: third_party/HY-World-2.0 is required only for apply_worldmirror post-processing (multi-view depth consistency and Gaussian Splatting reconstruction). You can skip it for basic video generation.


🚀 Quick Start

Model Variants

WorldStereo ships three model variants, each suited to a different use case:

Model TypeEntry PointDescription
worldstereo-camerarun_camera_control.pyCamera control only; single-view input
worldstereo-memoryrun_camera_control.py / run_multi_traj.pyFull model with GGM + SSM; multi-view consistent generation; best quality
worldstereo-memory-dmdrun_camera_control.py / run_multi_traj.pyDMD distillation variant; 4-step inference, fastest

Models are automatically downloaded from HuggingFace Hub (hanshanxue/WorldStereo) on first run.


Single-View Camera Control

Generate a video from a single image along a specified camera trajectory:

python run_camera_control.py \ --model_type worldstereo-camera \ --input_path examples/images \ --output_path outputs \ --seed 1024

Multi-GPU Inference (Sequence Parallel)

Scale to multiple GPUs using Sequence Parallelism (SP) and FSDP:

torchrun --nproc_per_node=8 run_camera_control.py \ --model_type worldstereo-memory \ --input_path examples/panorama \ --output_path outputs \ --fsdp

Multi-Trajectory Inference (Panorama / Reconstruction)

For panoramic scene generation or 3D reconstruction from multiple trajectories:

# Panoramic scene generation torchrun --nproc_per_node=8 run_multi_traj.py \ --model_type worldstereo-memory \ --task_type panorama \ --input_path examples/panorama \ --output_path outputs \ --fsdp # Panoramic scene generation (DMD fast variant) torchrun --nproc_per_node=8 run_multi_traj.py \ --model_type worldstereo-memory-dmd \ --task_type panorama \ --input_path examples/panorama \ --output_path outputs \ --fsdp # 3D scene reconstruction torchrun --nproc_per_node=8 run_multi_traj.py \ --model_type worldstereo-memory \ --task_type reconstruction \ --input_path examples/reconstruction \ --output_path outputs \ --fsdp # 3D scene reconstruction (DMD fast variant) torchrun --nproc_per_node=8 run_multi_traj.py \ --model_type worldstereo-memory-dmd \ --task_type reconstruction \ --input_path examples/reconstruction \ --output_path outputs \ --fsdp

WorldMirror 3D Reconstruction (Optional)

After running run_multi_traj.py, the memory bank is automatically exported to a WorldMirror-compatible format under <output_path>/<scene>/world_mirror_data/<model_type>/. You can then run feedforward 3D reconstruction with HY-World 2.0:

# Requires: pip install -r third_party/HY-World-2.0/requirements.txt cd third_party/HY-World-2.0 torchrun --nproc_per_node=8 -m hyworld2.worldrecon.pipeline --input_path ../../outputs/<scene>/world_mirror_data/<model_type>/images \ --prior_cam_path ../../outputs/<scene>/world_mirror_data/<model_type>/cameras.json \ --strict_output_path ../../outputs/<scene>/world_mirror_data/<model_type>/results \ --target_size 832 --use_fsdp --enable_bf16 --no_save_normal --no_save_gs --no_sky_mask \ --apply_edge_mask --apply_confidence_mask --confidence_percentile 15.0 --compress_pts --no_interactive \ --disable_heads gs points

This produces metric-scale depth, surface normals, camera poses, a dense point cloud (.ply), and optionally Gaussian Splat renderings from the generated multi-view frames.


Python API

import torch from models.worldstereo_wrapper import WorldStereo device = torch.device("cuda:0") worldstereo = WorldStereo.from_pretrained( "hanshanxue/WorldStereo", subfolder="worldstereo-memory", # or "worldstereo-camera" / "worldstereo-memory-dmd" sp_world_size=1, fsdp=False, device=device, ) output = worldstereo(**pipeline_inputs)

CLI Reference

run_camera_control.py

FlagDefaultDescription
--model_typeworldstereo-cameraModel variant to use
--input_pathexamples/imagesInput scene directory
--output_pathoutputsOutput directory
--local_files_onlyFalseUse locally cached weights instead of downloading
--fsdpFalseEnable FSDP model sharding
--seed1024Random seed

run_multi_traj.py (additional flags)

FlagDefaultDescription
--task_typepanoramapanorama or reconstruction
--align_nframe8Frames per clip saved for updating the memory bank

📂 Input Data Format

Camera-Only Inference (examples/images/)

<scene>/
├── image.png                 # reference image
├── prompt.json               # text descriptions at three verbosity levels
│   # {"short caption": ..., "medium caption": ..., "long caption": ...}
└── camera.json               # camera trajectory
    # {"motion_list": [...], "extrinsic": [...], "intrinsic": [...]}

Memory-Augmented Multi-Trajectory (examples/panorama/, examples/reconstruction/)

<scene>/
├── panorama.png              # (optional) full panorama — triggers VLM single-path inference
├── meta_info.json            # {"scene_type": "perspective" | "panorama"}
├── start_frame.png           # reference start image for depth initialization
└── render_results/
    └── <view_id>/
        └── <traj_id>/
            ├── render.mp4         # pre-rendered geometry video (point cloud warp)
            ├── render_mask.mp4    # binary occlusion mask video
            └── camera.json        # {"extrinsic": [...], "intrinsic": [...]}

🔧 Architecture

Model Variants

WorldStereo defines two transformer architectures in models/worldstereo.py, both extending WanTransformer3DModel from diffusers:

  • WorldStereoModel — Wan DiT backbone + ControlNet. Used by worldstereo-camera. The ControlNet encodes rendered point cloud geometry and camera embeddings, injecting residuals at each transformer block.
  • WorldStereoRefSModel — Extends WorldStereoModel with WanTransformerSparseSpatialBlock layers. These SSM blocks perform sparse attention over retrieved reference frames, guided by 3D correspondences. Used by worldstereo-memory and worldstereo-memory-dmd.

Inference Pipelines

Three pipelines are provided under models/pipelines/, selected automatically based on model_type in the config:

PipelineClassMode
pipeline_pcd_keyframe.pyKFPCDControllerPipelineCamera; standard DDIM sampling
pipeline_ref_keyframe.pyKFPCDControllerRefPipelineCamera + GGM + SSM; standard DDIM sampling
pipeline_dmd_keyframe.pyRefKFDMDGeneratorPipelineCamera + GGM + SSM; 4-step DMD distillation

3D Memory Bank

The memory bank (src/retrieval_wm.py) manages the growing 3D representation across trajectories:

  1. Init — MoGe depth estimation on the start frame lifts it to a point cloud.
  2. Retrieve — For each new target trajectory, the most relevant reference frames are selected via FOV-overlap scoring combined with DINOv2 image features and quality-aware furthest-point sampling.
  3. Update — After generation, new frames and their estimated depths are appended to the bank.
  4. Reconstruction — Feedforward reconstruction via HY-World 2.0 WorldMirror enforces multi-view depth consistency; final global alignment produces a unified point cloud.

Distributed Inference

WorldStereo supports two parallelism strategies:

  • Sequence Parallel (SP) — The sequence dimension is sharded across the SP group at each attention layer (models/attention.py). Controlled by torchrun --nproc_per_node.
  • FSDP — Full-Sharded Data Parallel wraps both the transformer and auxiliary encoders. Enabled with --fsdp. Requires a device_mesh with ("rep", "shard") dimensions.

🤝 Acknowledgements

WorldStereo builds upon the following excellent works:

  • Wan — Video DiT backbone
  • HunyuanVideo-1.5 — Components of sequence parallel and video generation model
  • MoGe — Monocular geometry estimation
  • HY-World 2.0 — WorldMirror reconstruction module
  • diffusers — Pipeline and model utilities

📝 Citation

If you find WorldStereo useful in your research, please cite:

@article{zhang2026worldstereo, title={WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories}, author={Zhang, Yisu and Cao, Chenjie and Wang, Tengfei and Zuo, Xuhui and Wu, Junta and Zhu, Jianke and Guo, Chunchao}, journal={arXiv preprint arXiv:2603.02049}, year={2026} }

关于 About

[CVPR 2026] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories (WorldExpand of HY-World 2.0)
3d-reconstructionvideogeneration

语言 Languages

Python100.0%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
6
Total Commits
峰值: 3次/周
Less
More

核心贡献者 Contributors