GEM: A Generalist Model for Human Motion
Monocular whole-body 3D human pose estimation using the SOMA body model
2D Keypoint Overlay | In-Camera Mesh | Global Mesh | Retargeted G1 Motion
📰 News
- [2025-06] 📢 Added ONNX/TensorRT accelerated demo (
demo_soma_onnx.py) — see Demo docs - [2025-06] 📢 Added humanoid robot retargeting to Unitree G1 (
--retarget) - [2025-05] 📢 The GEM codebase is released!
🔎 Overview
GEM is a video-based 3D human pose estimation model developed by NVIDIA. It recovers full-body 77-joint motion — including body, hands, and face — from monocular video using the SOMA parametric body model. The pipeline handles dynamic cameras and recovers global motion trajectories. GEM includes a bundled 2D pose estimation model that detects 77 SOMA keypoints, making the system fully self-contained. Licensed under Apache 2.0 for commercial use.
✨ Key Features
- 77-joint SOMA body model — full body, hands, and face articulation
- Bundled 2D keypoint detector — 2D pose estimator trained for SOMA 77-joint skeleton
- Camera-space motion recovery — camera-space human motion estimation from dynamic monocular video
- World-space motion recovery — world-space human motion estimation from dynamic monocular video
- Humanoid robot retargeting — retarget recovered motion to Unitree G1 robot via SOMA Retargeter
- Apache 2.0 licensed — commercially usable, trained on NVIDIA-owned data only
Research Version: Multi-Modal Conditioning
Looking for multi-modal motion generation (text, audio, music conditioning)? Check out GEM-SMPL, our research model using the SMPL body model that supports both motion estimation and generation from diverse input modalities. Presented at ICCV 2025 (Highlight).
🚀 Quick Start
# 1. Clone git clone --recursive https://github.com/NVlabs/GEM-X.git && cd GEM-X # 2. Setup environment pip install uv && uv venv .venv --python 3.12 && source .venv/bin/activate uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126 uv pip install -e third_party/soma && cd third_party/soma && git lfs pull && cd ../.. bash scripts/install_env.sh # 3. Run demo python scripts/demo/demo_soma.py --video path/to/video.mp4 --ckpt inputs/pretrained/gem_soma.ckpt # 4. (Optional) Retarget to G1 robot uv pip install -e third_party/soma-retargeter python scripts/demo/demo_soma.py --video path/to/video.mp4 --retarget
See docs/INSTALL.md for detailed installation instructions.
Mac users: For real-time webcam demos on Apple Silicon, see docs/INSTALL_MACOS.md instead.
📚 Documentation
| Document | Description |
|---|---|
| Installation | Prerequisites, step-by-step setup, Docker, troubleshooting |
| Installation (macOS) | Apple Silicon setup for the real-time webcam demo |
| Demo | Full 3D pipeline, ONNX/TRT accelerated demo, 2D keypoint-only demo, output formats |
| Training & Evaluation | Dataset preparation, training commands, config system |
| Model Overview | Architecture, SOMA body model, bundled 2D pose model |
| Related Projects | GENMO, SOMA, ecosystem cross-references |
📦 Pretrained Models
| Model | Body Model | Joints | Download |
|---|---|---|---|
| GEM (SOMA) | SOMA | 77 (body + hands + face) | gem_soma.ckpt |
Place checkpoints under inputs/pretrained/ or pass the path via --ckpt. The demo scripts will automatically download the checkpoint from HuggingFace if --ckpt is not provided.
🤝 Related Humanoid Work at NVIDIA
GEM is part of a larger effort to enable humanoid motion data for robotics, physical AI, and other applications.
Check out these related works:
📖 Citation
@inproceedings{genmo2025, title = {GENMO: A GENeralist Model for Human MOtion}, author = {Li, Jiefeng and Cao, Jinkun and Zhang, Haotian and Rempe, Davis and Kautz, Jan and Iqbal, Umar and Yuan, Ye}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025} }
📄 License
This project is released under Apache 2.0. This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use. See ATTRIBUTIONS.md for specifics.
GOVERNING TERMS:
Use of the source code is governed by the Apache License, Version 2.0. Use of the associated model is governed by the NVIDIA Open Model License Agreement.