PointWorld
Training and Evaluation Pipeline for "PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation".
Wenlong Huang1,†,
Yu-Wei Chao2,
Arsalan Mousavian2,
Ming-Yu Liu2,
Dieter Fox2,
Kaichun Mo2,*,
Li Fei-Fei1,*
1Stanford University, 2NVIDIA
*Equal advising | †Work done partly at NVIDIA
PointWorld is a large pre-trained 3D world model that predicts full-scene 3D point flows from partially observable RGB-D captures and robot actions, also represented as 3D point flows.
If you find this work useful in your research, please cite using the following BibTeX:
@article{huang2026pointworld, title={PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation}, author={Huang, Wenlong and Chao, Yu-Wei and Mousavian, Arsalan and Liu, Ming-Yu and Fox, Dieter and Mo, Kaichun and Li, Fei-Fei}, journal={arXiv preprint arXiv:2601.03782}, year={2026} }
🗂️ Table of Contents
- Important Notes
- Setup
- Training
- Evaluation
- Visualization
- Known Limitations
- Acknowledgements
- Contributing
📌 Important Notes
- Precomputed datasets and pretrained checkpoints are still under internal review at NVIDIA and are expected to be released in the next 1-2 months.
mainis the training/evaluation code branch for release.datais the dataset preparation pipeline branch.- Please first prepare the data using the
databranch. Then return tomainfor training and evaluation.
🛠️ Setup
Environment
The main branch provides a self-contained conda setup with no local editable dependencies.
Recommended baseline for reproducibility in main:
- Linux
x86_64 - Python
3.10 - NVIDIA driver compatible with CUDA 12.4 wheels
Recommended setup:
# from repo root conda env create -n pointworld-env -f environments/train_eval.yml conda activate pointworld-env # timm is used for PTv3 DropPath; install without pulling extra transitive deps python -m pip install timm==1.0.19 --no-deps # keep urdfpy-compatible graph deps on a Python 3.10-safe networkx release python -m pip install networkx==3.4.2 --no-deps
If you also need visualization extras:
conda env update -n pointworld-env -f environments/train_eval_viz.yml --prune # timm is used for PTv3 DropPath; install without pulling extra transitive deps python -m pip install timm==1.0.19 --no-deps # keep urdfpy-compatible graph deps on a Python 3.10-safe networkx release python -m pip install networkx==3.4.2 --no-deps
Dependency layout:
environments/requirements.txt: canonical base dependency list for train/eval.environments/train_eval_viz.yml: optional visualization extras (matplotlib,open3d,viser).
Third-Party Dependency (DINOv3)
Request access via the official DINOv3 release page first, then use the provided download URL.
git submodule update --init --recursive mkdir -p third_party/dinov3/checkpoints wget -O third_party/dinov3/checkpoints/<dinov3_vitl16_pretrain_*.pth> \ "<URL_FROM_DINOV3_ACCESS_EMAIL>"
Dataset Path Convention
Use this directory layout for generated datasets consumed by main:
- DROID WDS:
/path/to/droid/wds - BEHAVIOR WDS:
/path/to/behavior/wds
The arguments.py defaults now follow this convention under LOCAL_DATASET_DIR:
droid->${LOCAL_DATASET_DIR}/droid/wdsbehavior->${LOCAL_DATASET_DIR}/behavior/wds
🏋️ Training
PTv3 Architecture Variant
PointWorld release now supports three PTv3 variants:
smallbase(default)large
Set the variant explicitly with --ptv3_size=<small|base|large> in training/evaluation commands when needed.
Single-Domain Training (DROID)
python train.py \ --domains=droid \ --data_dirs=/path/to/droid/wds \ --norm_stats_path=stats/droid \ --batch_size=<BATCH_SIZE> \ --num_workers=<NUM_WORKERS> \ --eval_num_workers=<EVAL_NUM_WORKERS> \ --eval_freq=-1
Replace /path/to/droid/wds and worker/batch settings with values that match your machine.
Single-Domain Training (BEHAVIOR)
python train.py \ --domains=behavior \ --data_dirs=/path/to/behavior/wds \ --norm_stats_path=stats/droid_behavior \ --batch_size=<BATCH_SIZE> \ --num_workers=<NUM_WORKERS> \ --eval_num_workers=<EVAL_NUM_WORKERS> \ --eval_freq=-1
Multi-Domain Training (DROID + BEHAVIOR)
python train.py \ --domains=droid,behavior \ --data_dirs=/path/to/droid/wds,/path/to/behavior/wds \ --norm_stats_path=stats/droid_behavior \ --batch_size=<BATCH_SIZE> \ --num_workers=<NUM_WORKERS> \ --eval_num_workers=<EVAL_NUM_WORKERS> \ --eval_freq=-1
DDP Training Template
torchrun \ --standalone \ --nproc_per_node=<NUM_GPUS> \ train.py \ --distributed=true \ <your_train_args>
📊 Evaluation
By default, release evaluation targets the test split.
Expert Model Training For DROID Filtered Metrics (Optional)
This step is only required if you want reliable filtered metrics on the DROID domain (full_eval/test/filtered_l2_moved/mean) and for reproducing the results in the paper.
python train.py \ --domains=droid \ --data_dirs=/path/to/droid/wds \ --norm_stats_path=stats/droid \ --train_splits=test \ --exp_name=droid-test-expert \ --batch_size=<BATCH_SIZE> \ --num_workers=<NUM_WORKERS> \ --eval_num_workers=<EVAL_NUM_WORKERS> \ --eval_freq=-1
1. DROID Evaluation (Annotation-Aware)
The key paper metric is:
full_eval/test/filtered_l2_moved/mean
To evaluate filtered metrics, generate expert confidence locally first.
- Set the expert checkpoint path (for example, from the
--train_splits=testrun above):
EXPERT_MODEL_PATH=/path/to/train_logs/droid-test-expert/model-last.pt
- Generate confidence annotations on DROID test split:
python eval.py \ --model_path "${EXPERT_MODEL_PATH}" \ --domains=droid \ --data_dirs=/path/to/droid/wds \ --run_confidence_annotation=true \ --confidence_thres=0.8 \ --batch_size=1 \ --eval_num_batches=-1
This writes expert_confidence-seed=42.h5 under /path/to/droid/wds/test/.
- Evaluate a target checkpoint using the generated confidence annotation:
MODEL_PATH=/path/to/train_logs/<run_name>/model-last.pt
python eval.py \ --model_path "${MODEL_PATH}" \ --domains=droid \ --data_dirs=/path/to/droid/wds \ --confidence_thres=0.8 \ --batch_size=1 \ --eval_num_batches=-1
For quicker iteration, you can set --eval_num_batches=<N> (for example 100) instead of full-dataset evaluation.
2. BEHAVIOR Evaluation (Simulation-Only)
BEHAVIOR evaluation does not require the expert-confidence annotation because the data is noiseless.
MODEL_PATH=/path/to/train_logs/<run_name>/model-last.pt
python eval.py \ --model_path "${MODEL_PATH}" \ --domains=behavior \ --data_dirs=/path/to/behavior/wds \ --norm_stats_path=stats/droid_behavior \ --batch_size=1 \ --eval_num_batches=-1
🎞️ Visualization
PointWorld visualization is built on top of viser, which provides the live 3D viewer and GUI controls.
Use evaluation-time visualization by setting --eval_viz_num > 0:
python eval.py \ --model_path "${MODEL_PATH}" \ --domains=droid \ --data_dirs=/path/to/droid/wds \ --batch_size=1 \ --eval_num_batches=100 \ --eval_viz_num=8 \ --viewer_port=8080
When running, open http://localhost:8080 in your browser.
Visualization includes these controls:
Frame: step through temporal evolution (frame-by-frame) across the sequence.Ground-truth: switch between model prediction and GT trajectories.Upsample: toggle between coarse and upsampled point rendering.Scene flow densityandRobot flow density: reduce/increase the number of rendered flow vectors.Scene Flow ThicknessandRobot Flow Thickness: adjust vector thickness for readability.Point size: adjust rendered point cloud size.Full overlay opacity: control overlay transparency.
Runtime behavior:
- After each visualized sample, the CLI prompts
Press ENTER to continue ...(typeqto stop). - This prompt requires an interactive TTY (a real terminal stdin). If stdin is redirected/captured, the prompt may fail.
- In headless setups, SSH with a terminal attached and forward the viewer port if needed.
If you want to run evaluation without visualization, set --eval_skip_viz=true (or leave --eval_viz_num=-1).
⚠️ Known Limitations
- Eval outputs are not deterministic on GPU; small run-to-run variation is expected even with fixed seeds.
- Partial-batch comparisons (
eval_num_batches < full dataset) are sensitive tonum_workersandeval_num_workers; match these settings when comparing runs.
🙏 Acknowledgements
We gratefully acknowledge the authors and maintainers of third-party projects that this repository depends on or adapts. Modifications have been made where noted, and the original license terms remain in effect.
Third-party OSS attribution and license references for distributed or adapted code are documented in THIRD_PARTY_LICENSES.md.
| Repository / Project | Usage in this repo | License |
|---|---|---|
| facebookresearch/dinov3 | Scene encoder backbone submodule (third_party/dinov3/) | DINOv3 License |
| Pointcept/PointTransformerV3 | Vendored/adapted PTv3 components (ptv3/) | MIT |
| facebookresearch/sonata | PTv3 lineage reference for adapted components | Apache-2.0 |
| StanfordVL/OmniGibson | Adapted transform utilities (transform_utils.py, deploy/transform_utils_torch.py) | MIT |
| UT-Austin-RPL/deoxys_control | Additional adapted transform routines noted in transform_utils.py | Apache-2.0 |
🤝 Contributing
All external contributions must follow CONTRIBUTING.md in this repository.
In particular, commits must be signed off (git commit -s) to satisfy DCO requirements.