# Setup Guide > **Skill:** `.agents/skills/cosmos3-setup/SKILL.md` ______________________________________________________________________ **Table of Contents** - [System Requirements](#system-requirements) - [Recommended Base Image](#recommended-base-image) - [Installation](#installation) - [Quickstart: From the Recommended Base Image](#quickstart-from-the-recommended-base-image) - [Docker Container](#docker-container) - [Advanced](#advanced) - [Virtual Environment](#virtual-environment) - [CUDA Variants](#cuda-variants) - [Environment Variables](#environment-variables) - [Downloading Base Checkpoints](#downloading-base-checkpoints) - [Troubleshooting](#troubleshooting) - [PyTorch Import Issue](#pytorch-import-issue) - [Dependency Issue](#dependency-issue) - [Python Issue](#python-issue) - [CUDA Issue](#cuda-issue) ______________________________________________________________________ ## System Requirements - NVIDIA GPUs with Ampere architecture (RTX 30 Series, A100) or newer — Hopper (H100) or Blackwell (B200) recommended for full training throughput - NVIDIA driver compatible with [CUDA version](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) - [NVIDIA CUDA >=12.8](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#ubuntu) - Linux x86-64/aarch64 - glibc >=2.35 (e.g. Ubuntu >=22.04) - Python >=3.10 - Multi-node training additionally requires a working NCCL setup (IB/RoCE recommended) and a shared filesystem visible to all ranks for checkpoint I/O - Free disk: ~150 GiB recommended for a first-run inference or training workflow (Hugging Face cache ~90 GiB, uv cache ~20 GiB, run outputs ~30 GiB). See [FAQ → Expected disk footprint](./faq.md#q-how-much-disk-space-do-i-need) for the breakdown and how to relocate caches.
Recommended Base Image ### Recommended Base Image For CUDA 13 builds, the [NVIDIA NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) is the recommended starting point — it bundles PyTorch + CUDA 13 + cuDNN + NCCL tuned for NVIDIA hardware, plus Apex, TransformerEngine, and Megatron utilities that training infra users commonly need. ```dockerfile FROM nvcr.io/nvidia/pytorch:25.09-py3 ``` For CUDA 12.8 builds, pin to an earlier NGC tag (e.g. `nvcr.io/nvidia/pytorch:25.06-py3`) that still ships CUDA 12.
## Installation If you encounter issues, see [Troubleshooting](#troubleshooting). Clone the repository: ```bash git clone git@github.com:NVIDIA/cosmos-framework.git cd cosmos-framework ``` The two supported install paths are the recommended base image and the Docker container. For other paths (standalone venv, custom torch/cuda) see [Advanced](#advanced).
Quickstart: From the Recommended Base Image ### Quickstart: From the Recommended Base Image If you started from the [recommended base image](#recommended-base-image) (`nvcr.io/nvidia/pytorch:25.09-py3`), the following commands set up the full environment in one go. Run them **from the root of this repository** (i.e. inside the `Cosmos/` directory you just cloned): ```shell apt-get update apt-get install -y --no-install-recommends curl ffmpeg git-lfs libx11-dev tree wget curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env # CUDA 13.0 (recommended); for CUDA 12.8 use `--group=cu128-train` uv sync --all-extras --group=cu130-train source .venv/bin/activate && export LD_LIBRARY_PATH= ```
Docker Container ### Docker Container Please make sure you have access to Docker on your machine and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) is installed. Build the container: ```bash image_tag=$(docker build -q .) ``` Run the container: ```bash docker run -it --runtime=nvidia --ipc=host --rm \ -v .:/workspace -v /workspace/.venv \ -v /root/.cache:/root/.cache \ -e HF_TOKEN="$HF_TOKEN" \ $image_tag ``` For multi-node training, also bind-mount your shared dataset and checkpoint directories so all ranks see the same filesystem. Optional arguments: - `--ipc=host`: Use host system's shared memory, since parallel torchrun consumes a large amount of shared memory. If not allowed by security policy, increase `--shm-size` ([documentation](https://docs.docker.com/engine/containers/run/#runtime-constraints-on-resources)). - `-v /root/.cache:/root/.cache`: Mount host cache to avoid re-downloading cache entries. If you get `docker: Error response from daemon: unknown or invalid runtime name: nvidia`, you need to [configure docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker): ```shell sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker ``` See [docker/README.md](../docker/README.md) for additional images and build options.
Advanced ### Advanced Use these paths only when the recommended base image or Docker container are not viable for your environment.
Virtual Environment #### Virtual Environment Install system dependencies: ```shell sudo apt-get install -y --no-install-recommends curl ffmpeg git-lfs libx11-dev tree wget ``` Install [uv](https://docs.astral.sh/uv/getting-started/installation/): ```shell curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env ``` Install the package using one of the following methods:
UV Sync: fully reproducible environment Choose the dependency group that matches your CUDA toolkit (see [CUDA Variants](#cuda-variants)): ```shell # CUDA 13.0 (recommended) uv sync --all-extras --group=cu130-train # Or, for CUDA 12.8: # uv sync --all-extras --group=cu128-train source .venv/bin/activate && export LD_LIBRARY_PATH= ```
UV Pip: virtual environment ```shell # Create virtual environment (skip if using an existing environment) uv venv --clear && source .venv/bin/activate && export LD_LIBRARY_PATH= uv pip install -r pyproject.toml --all-extras --group=cu130-train uv pip install -e . ```
UV Pip: system environment ```shell uv pip install --system --break-system-packages -r pyproject.toml --all-extras --group=cu130-train ```
Custom torch/cuda versions ```shell cuda_name=cu130 torch_name=torch210 # 1. Create and activate the virtual environment uv venv --clear && source .venv/bin/activate # 2. Install the desired torch/cuda versions uv pip install "torch==2.10.0" "torchvision" --torch-backend=$cuda_name # 3. Install the package with desired extras uv pip install -r pyproject.toml --all-extras --group=cu130-train # 4. Install one of the following attention backends: # * Blackwell uv pip install "natten==0.21.6.dev6+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/natten # * Hopper uv pip install "flash-attn-3-nv==1.0.3+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/flash-attn-3-nv # * Ada/Ampere uv pip install "flash-attn==2.7.4.post1+$cuda_name.$torch_name" -f https://nvidia-cosmos.github.io/cosmos-dependencies/v1.5.0/flash-attn ``` If there is no attention backend wheel for your torch/cuda versions, you can build one using [cosmos-dependencies](https://github.com/nvidia-cosmos/cosmos-dependencies).
Optional package extras: - `train`: Training infrastructure (FSDP, parallelism, checkpointing, datasets) - `eval`: Evaluation harnesses for trained checkpoints #### CUDA Variants This repository is **training-focused**, so the `*-train` dependency groups are the supported install path. Inference-only groups exist for evaluating trained checkpoints in-tree but are not required for training. | CUDA Version | Training (recommended) | Notes | | --------------------------- | ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | | **CUDA 13.0 (recommended)** | `--group=cu130-train` | [NVIDIA Driver](https://docs.nvidia.com/cuda/archive/13.0.0/cuda-toolkit-release-notes/index.html#cuda-toolkit-major-component-versions) | | CUDA 12.8 | `--group=cu128-train` | [NVIDIA Driver](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html#cuda-toolkit-major-component-versions) |
## Environment Variables Export the following before downloading checkpoints or launching training. See [environment_variables.md](environment_variables.md) for the full reference. | Variable | Purpose | | ------------------------ | ------------------------------------------------------------------------------------------------------ | | `HF_TOKEN` | Hugging Face access token for gated model/dataset downloads. Alternative to `uvx hf auth login`. | | `HF_HOME` | Cache directory for Hugging Face models and datasets. Recommend ≥ 1 TB free. | | `IMAGINAIRE_OUTPUT_ROOT` | Output root for training DCP checkpoints and logs. Recommend ≥ 1 TB free. | | `UV_CACHE_DIR` | Cache directory for `uv`-managed dependencies. | | `LD_LIBRARY_PATH=` | Clear (set to empty) after sourcing the venv to avoid host library bleed-through into PyTorch imports. | ## Downloading Base Checkpoints Training in this repo typically starts from a pretrained base checkpoint that you fine-tune or post-train. The recommended source is the Hugging Face Hub. 1. Get a [Hugging Face Access Token](https://huggingface.co/settings/tokens) with `Read` permission. 2. Authenticate using **either** mechanism (they are equivalent — pick one, do not set both with different tokens): - **`HF_TOKEN` environment variable** — preferred for Docker and non-interactive shells. Export it once and any `huggingface_hub` call (CLI or library) picks it up. - **`uvx hf auth login`** — preferred for local interactive use. Writes the token to `~/.cache/huggingface/token`, persisted across sessions (and across Docker runs if you bind-mount `/root/.cache`). 3. Accept the license for any gated model you intend to use (e.g. the [NVIDIA Open Model License Agreement](https://huggingface.co/nvidia/Cosmos-Guardrail1) where applicable). 4. Test access: ```shell uvx hf@latest download --repo-type model nvidia/Cosmos-Guardrail1 \ --revision d6d4bfa899a71454a700907664f3e88f503950cf --include "README.md" ``` If you encounter issues: 1. Check that you don't have conflicting [environment variables](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables) — e.g. an `HF_TOKEN` set to a different token than the one cached by `hf auth login`: `printenv | grep HF_`. 2. Check that your [token](https://huggingface.co/settings/tokens) has sufficient permissions. Checkpoints are downloaded on demand during training and evaluation. To change the cache location, set [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome). See [training.md](training.md) for [DCP conversion](training.md#step-2--prepare-checkpoint) and [Hugging Face safetensors export](training.md#export-checkpoint-to-hugging-face-safetensors). ## Troubleshooting ### PyTorch Import Issue Errors: - `ImportError: cannot import name '_functionalization' from 'torch._C'` Clear the library path in your current shell: ```shell export LD_LIBRARY_PATH= ``` This applies to the current session only. To persist, add the line to your `Dockerfile` or `~/.bashrc`. If this doesn't fix the issue, try [reinstalling venv](#dependency-issue). ### Dependency Issue Errors: - `ModuleNotFoundError: No module named ` Reinstall venv: ```shell uv sync --all-extras --group=cu130-train --reinstall source .venv/bin/activate && export LD_LIBRARY_PATH= ``` If this doesn't fix the issue, try [reinstalling uv](#python-issue). ### Python Issue Errors: - `fatal error: Python.h: No such file or directory` Reinstall uv and venv: ```shell curl -LsSf https://astral.sh/uv/install.sh | sh uv python install --reinstall rm -rf .venv uv sync --all-extras --group=cu130-train --reinstall source .venv/bin/activate && export LD_LIBRARY_PATH= ``` ### CUDA Issue - `OSError: : cannot open shared object file: No such file or directory` Ensure you have [CUDA installed](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#ubuntu). The major version must match between the system and virtual environment CUDA versions. ```shell sudo apt-get install -y --no-install-recommends cuda-toolkit- ``` Alternatively, use the [Docker container](#docker-container).