Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Awesome Spatial Reasoning with MVLMs

Awesome arXiv License: MIT Made With Love

This repository collects and organises state‑of‑the‑art papers on spatial reasoning for Multimodal Vision–Language Models (MVLMs).

Feel free to open a Pull Request to add new work!


📑 Table of Contents


Introduction

In this survey, we provide a comprehensive review of existing tasks in multimodal spatial reasoning with large models, categorizing and highlighting the frontiers of multimodal large language models (MLLMs), and introducing open benchmarks for evaluating these models. We start by reviewing the general spatial reasoning area with focuses on post-training techniques, explainability, and architecture. Beyond classical 2D scenarios, we systemically review the spatial relationship reasoning, scene and layout reasoning, and also visual question answering, grounding in the 3D space.

Further, we also discuss the recent advances in embodied AI tasks, such as vision-language navigation and action models. Additionally, audio and ego-centric video modalities are also considered as part of this survey for distinct and emerging spatial understanding with novel sensors. We believe this survey establishes a solid foundation and offers valuable insights into the critical field of multimodal spatial reasoning.

Existing reasoning surveys are in Reasoning_survey.md.


Papers

3D Vision

🔗 3D_Vision.md

Embodied AI

🔗 Embodied_AI.md

General MLLM

🔗 General_MLLM.md

Video / Audio / Egocentric

🔗 Video_Audio_Egocentric.md

Spatial Benchmark

🔗 Spatial_Benchmark.md


Resources

Workshops and Tutorials

TBD


Contributing

Contributions are welcome! To contribute:

  1. Fork this repository
  2. Add your paper/resource in the appropriate markdown file or create a new one
  3. Update the link list in README.md if needed
  4. Submit a Pull Request 🎉

Citation

If you find this project helpful, please cite:

@article{zheng2025multimodal,
  title={Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks},
  author={Zheng, Xu and Dongfang, Zihao and Jiang, Lutao and Zheng, Boyuan and Guo, Yulong and Zhang, Zhenquan and Albanese, Giuliano and Yang, Runyi and Ma, Mengjiao and Zhang, Zixin and others},
  journal={https://arxiv.org/abs/2510.25760},
  year={2025}
}

Star History

Star History Chart


License

This project is licensed under the MIT License — see the LICENSE file for details.

关于 About

This repository collects and organises state‑of‑the‑art papers on spatial reasoning for Multimodal Vision–Language Models (MVLMs).

语言 Languages

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
55
Total Commits
峰值: 12次/周
Less
More

核心贡献者 Contributors