# Setup Guide
> **Skill:** `.agents/skills/cosmos3-setup/SKILL.md`
______________________________________________________________________
**Table of Contents**
- [System Requirements](#system-requirements)
- [Recommended Base Image](#recommended-base-image)
- [Installation](#installation)
- [Quickstart: From the Recommended Base Image](#quickstart-from-the-recommended-base-image)
- [Docker Container](#docker-container)
- [Advanced](#advanced)
- [Virtual Environment](#virtual-environment)
- [CUDA Variants](#cuda-variants)
- [Environment Variables](#environment-variables)
- [Downloading Base Checkpoints](#downloading-base-checkpoints)
- [Troubleshooting](#troubleshooting)
- [PyTorch Import Issue](#pytorch-import-issue)
- [Dependency Issue](#dependency-issue)
- [Python Issue](#python-issue)
- [CUDA Issue](#cuda-issue)
______________________________________________________________________
## System Requirements
- NVIDIA GPUs with Ampere architecture (RTX 30 Series, A100) or newer — Hopper (H100) or Blackwell (B200) recommended for full training throughput
- NVIDIA driver compatible with [CUDA version](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)
- [NVIDIA CUDA >=12.8](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#ubuntu)
- Linux x86-64/aarch64
- glibc >=2.35 (e.g. Ubuntu >=22.04)
- Python >=3.10
- Multi-node training additionally requires a working NCCL setup (IB/RoCE recommended) and a shared filesystem visible to all ranks for checkpoint I/O
- Free disk: ~150 GiB recommended for a first-run inference or training workflow (Hugging Face cache ~90 GiB, uv cache ~20 GiB, run outputs ~30 GiB). See [FAQ → Expected disk footprint](./faq.md#q-how-much-disk-space-do-i-need) for the breakdown and how to relocate caches.
Recommended Base Image
### Recommended Base Image
For CUDA 13 builds, the [NVIDIA NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) is the recommended starting point — it bundles PyTorch + CUDA 13 + cuDNN + NCCL tuned for NVIDIA hardware, plus Apex, TransformerEngine, and Megatron utilities that training infra users commonly need.
```dockerfile
FROM nvcr.io/nvidia/pytorch:25.09-py3
```
For CUDA 12.8 builds, pin to an earlier NGC tag (e.g. `nvcr.io/nvidia/pytorch:25.06-py3`) that still ships CUDA 12.
## Installation
If you encounter issues, see [Troubleshooting](#troubleshooting).
Clone the repository:
```bash
git clone git@github.com:NVIDIA/cosmos-framework.git
cd cosmos-framework
```
The two supported install paths are the recommended base image and the Docker container. For other paths (standalone venv, custom torch/cuda) see [Advanced](#advanced).
Quickstart: From the Recommended Base Image
### Quickstart: From the Recommended Base Image
If you started from the [recommended base image](#recommended-base-image) (`nvcr.io/nvidia/pytorch:25.09-py3`), the following commands set up the full environment in one go. Run them **from the root of this repository** (i.e. inside the `Cosmos/` directory you just cloned):
```shell
apt-get update
apt-get install -y --no-install-recommends curl ffmpeg git-lfs libx11-dev tree wget
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
# CUDA 13.0 (recommended); for CUDA 12.8 use `--group=cu128-train`
uv sync --all-extras --group=cu130-train
source .venv/bin/activate && export LD_LIBRARY_PATH=
```
Docker Container
### Docker Container
Please make sure you have access to Docker on your machine and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) is installed.
Build the container:
```bash
image_tag=$(docker build -q .)
```
Run the container:
```bash
docker run -it --runtime=nvidia --ipc=host --rm \
-v .:/workspace -v /workspace/.venv \
-v /root/.cache:/root/.cache \
-e HF_TOKEN="$HF_TOKEN" \
$image_tag
```
For multi-node training, also bind-mount your shared dataset and checkpoint directories so all ranks see the same filesystem.
Optional arguments:
- `--ipc=host`: Use host system's shared memory, since parallel torchrun consumes a large amount of shared memory. If not allowed by security policy, increase `--shm-size` ([documentation](https://docs.docker.com/engine/containers/run/#runtime-constraints-on-resources)).
- `-v /root/.cache:/root/.cache`: Mount host cache to avoid re-downloading cache entries.
If you get `docker: Error response from daemon: unknown or invalid runtime name: nvidia`, you need to [configure docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker):
```shell
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
```
See [docker/README.md](../docker/README.md) for additional images and build options.
Advanced
### Advanced
Use these paths only when the recommended base image or Docker container are not viable for your environment.
Virtual Environment
#### Virtual Environment
Install system dependencies:
```shell
sudo apt-get install -y --no-install-recommends curl ffmpeg git-lfs libx11-dev tree wget
```
Install [uv](https://docs.astral.sh/uv/getting-started/installation/):
```shell
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
```
Install the package using one of the following methods:
UV Sync: fully reproducible environment
Choose the dependency group that matches your CUDA toolkit (see [CUDA Variants](#cuda-variants)):
```shell
# CUDA 13.0 (recommended)
uv sync --all-extras --group=cu130-train
# Or, for CUDA 12.8:
# uv sync --all-extras --group=cu128-train
source .venv/bin/activate && export LD_LIBRARY_PATH=
```
UV Pip: virtual environment
```shell
# Create virtual environment (skip if using an existing environment)
uv venv --clear && source .venv/bin/activate && export LD_LIBRARY_PATH=
uv pip install -r pyproject.toml --all-extras --group=cu130-train
uv pip install -e .
```
UV Pip: system environment
```shell
uv pip install --system --break-system-packages -r pyproject.toml --all-extras --group=cu130-train
```
Custom torch/cuda versions
```shell
cuda_name=cu130
torch_name=torch210
# 1. Create and activate the virtual environment
uv venv --clear && source .venv/bin/activate
# 2. Install the desired torch/cuda versions
uv pip install "torch==2.10.0" "torchvision" --torch-backend=$cuda_name
# 3. Install the package with desired extras
uv pip install -r pyproject.toml --all-extras --group=cu130-train
# 4. Install one of the following attention backends:
# * Blackwell
uv pip install "natten==0.21.6.dev6+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/natten
# * Hopper
uv pip install "flash-attn-3-nv==1.0.3+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/flash-attn-3-nv
# * Ada/Ampere
uv pip install "flash-attn==2.7.4.post1+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/flash-attn
```
If there is no attention backend wheel for your torch/cuda versions, you can build one using [cosmos-dependencies](https://github.com/nvidia-cosmos/cosmos-dependencies).
Optional package extras:
- `train`: Training infrastructure (FSDP, parallelism, checkpointing, datasets)
- `eval`: Evaluation harnesses for trained checkpoints
#### CUDA Variants
This repository is **training-focused**, so the `*-train` dependency groups are the supported install path. Inference-only groups exist for evaluating trained checkpoints in-tree but are not required for training.
| CUDA Version | Training (recommended) | Notes |
| --------------------------- | ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| **CUDA 13.0 (recommended)** | `--group=cu130-train` | [NVIDIA Driver](https://docs.nvidia.com/cuda/archive/13.0.0/cuda-toolkit-release-notes/index.html#cuda-toolkit-major-component-versions) |
| CUDA 12.8 | `--group=cu128-train` | [NVIDIA Driver](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html#cuda-toolkit-major-component-versions) |
## Environment Variables
Export the following before downloading checkpoints or launching training. See [environment_variables.md](environment_variables.md) for the full reference.
| Variable | Purpose |
| ------------------------ | ------------------------------------------------------------------------------------------------------ |
| `HF_TOKEN` | Hugging Face access token for gated model/dataset downloads. Alternative to `uvx hf auth login`. |
| `HF_HOME` | Cache directory for Hugging Face models and datasets. Recommend ≥ 1 TB free. |
| `IMAGINAIRE_OUTPUT_ROOT` | Output root for training DCP checkpoints and logs. Recommend ≥ 1 TB free. |
| `UV_CACHE_DIR` | Cache directory for `uv`-managed dependencies. |
| `LD_LIBRARY_PATH=` | Clear (set to empty) after sourcing the venv to avoid host library bleed-through into PyTorch imports. |
## Downloading Base Checkpoints
Training in this repo typically starts from a pretrained base checkpoint that you fine-tune or post-train. The recommended source is the Hugging Face Hub.
1. Get a [Hugging Face Access Token](https://huggingface.co/settings/tokens) with `Read` permission.
2. Authenticate using **either** mechanism (they are equivalent — pick one, do not set both with different tokens):
- **`HF_TOKEN` environment variable** — preferred for Docker and non-interactive shells. Export it once and any `huggingface_hub` call (CLI or library) picks it up.
- **`uvx hf auth login`** — preferred for local interactive use. Writes the token to `~/.cache/huggingface/token`, persisted across sessions (and across Docker runs if you bind-mount `/root/.cache`).
3. Accept the license for any gated model you intend to use (e.g. the [NVIDIA Open Model License Agreement](https://huggingface.co/nvidia/Cosmos-Guardrail1) where applicable).
4. Test access:
```shell
uvx hf@latest download --repo-type model nvidia/Cosmos-Guardrail1 \
--revision d6d4bfa899a71454a700907664f3e88f503950cf --include "README.md"
```
If you encounter issues:
1. Check that you don't have conflicting [environment variables](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables) — e.g. an `HF_TOKEN` set to a different token than the one cached by `hf auth login`: `printenv | grep HF_`.
2. Check that your [token](https://huggingface.co/settings/tokens) has sufficient permissions.
Checkpoints are downloaded on demand during training and evaluation. To change the cache location, set [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome). See [training.md](training.md) for [DCP conversion](training.md#step-2--prepare-checkpoint) and [Hugging Face safetensors export](training.md#export-checkpoint-to-hugging-face-safetensors).
## Troubleshooting
### PyTorch Import Issue
Errors:
- `ImportError: cannot import name '_functionalization' from 'torch._C'`
Clear the library path in your current shell:
```shell
export LD_LIBRARY_PATH=
```
This applies to the current session only. To persist, add the line to your `Dockerfile` or `~/.bashrc`.
If this doesn't fix the issue, try [reinstalling venv](#dependency-issue).
### Dependency Issue
Errors:
- `ModuleNotFoundError: No module named `
Reinstall venv:
```shell
uv sync --all-extras --group=cu130-train --reinstall
source .venv/bin/activate && export LD_LIBRARY_PATH=
```
If this doesn't fix the issue, try [reinstalling uv](#python-issue).
### Python Issue
Errors:
- `fatal error: Python.h: No such file or directory`
Reinstall uv and venv:
```shell
curl -LsSf https://astral.sh/uv/install.sh | sh
uv python install --reinstall
rm -rf .venv
uv sync --all-extras --group=cu130-train --reinstall
source .venv/bin/activate && export LD_LIBRARY_PATH=
```
### CUDA Issue
- `OSError: : cannot open shared object file: No such file or directory`
Ensure you have [CUDA installed](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#ubuntu). The major version must match between the system and virtual environment CUDA versions.
```shell
sudo apt-get install -y --no-install-recommends cuda-toolkit-
```
Alternatively, use the [Docker container](#docker-container).