Awesome Spatial Reasoning with MVLMs

This repository collects and organises state‑of‑the‑art papers on spatial reasoning for Multimodal Vision–Language Models (MVLMs).

Feel free to open a Pull Request to add new work!

📑 Table of Contents

Introduction
Papers
Resources
Contributing
Citation
Star History
License

Introduction

In this survey, we provide a comprehensive review of existing tasks in multimodal spatial reasoning with large models, categorizing and highlighting the frontiers of multimodal large language models (MLLMs), and introducing open benchmarks for evaluating these models. We start by reviewing the general spatial reasoning area with focuses on post-training techniques, explainability, and architecture. Beyond classical 2D scenarios, we systemically review the spatial relationship reasoning, scene and layout reasoning, and also visual question answering, grounding in the 3D space.

Further, we also discuss the recent advances in embodied AI tasks, such as vision-language navigation and action models. Additionally, audio and ego-centric video modalities are also considered as part of this survey for distinct and emerging spatial understanding with novel sensors. We believe this survey establishes a solid foundation and offers valuable insights into the critical field of multimodal spatial reasoning.

Existing reasoning surveys are in Reasoning_survey.md.

Papers

Resources

Workshops and Tutorials

TBD

Contributing

Contributions are welcome! To contribute:

Fork this repository
Add your paper/resource in the appropriate markdown file or create a new one
Update the link list in README.md if needed
Submit a Pull Request 🎉

Citation

If you find this project helpful, please cite:

@article{zheng2025multimodal,
  title={Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks},
  author={Zheng, Xu and Dongfang, Zihao and Jiang, Lutao and Zheng, Boyuan and Guo, Yulong and Zhang, Zhenquan and Albanese, Giuliano and Yang, Runyi and Ma, Mengjiao and Zhang, Zixin and others},
  journal={https://arxiv.org/abs/2510.25760},
  year={2025}
}

Star History

License

This project is licensed under the MIT License — see the LICENSE file for details.

Awesome Spatial Reasoning with MVLMs

📑 Table of Contents

Introduction

Papers

3D Vision

Embodied AI

General MLLM

Video / Audio / Egocentric

Spatial Benchmark

Resources

Workshops and Tutorials

Contributing

Citation

Star History

License

关于 About

语言 Languages

提交活跃度 Commit Activity

核心贡献者 Contributors