PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu^1,2 Wei Xiong^1† Weili Nie¹ Yichen Sheng¹ Shiqiu Liu¹ Jiebo Luo²

¹NVIDIA ²University of Rochester
^†Project Lead and Main Advising

PixelDiT is a single-stage, end-to-end pixel-space diffusion transformer that eliminates the VAE autoencoder entirely. It uses a dual-level architecture — patch-level DiT for global semantics + pixel-level DiT for texture details — to generate images directly in pixel space.

1.61 FID on ImageNet 256×256
0.74 GenEval / 83.5 DPG-Bench on text-to-image at 1024×1024
No VAE, no latent space

🔥 News

[2025/11] Paper, training & inference code, and pre-trained models are released.

Performance

ImageNet 256×256 (PixelDiT-XL, 797M params)

All evaluations use FlowDPMSolver with 100 steps. 50K samples. Metrics follow ADM evaluation protocol.

Epoch	gFID↓	CFG Scale	Steps	Sampler	Time Shift	CFG Interval
80	2.36	3.25	100	FlowDPMSolver	1.0	[0.1, 1.0]
160	1.97	3.25	100	FlowDPMSolver	1.0	[0.1, 1.0]
320	1.61	2.75	100	FlowDPMSolver	1.0	[0.1, 0.9]

ImageNet 512×512 (PixelDiT-XL, 797M params)

Resolution	gFID↓	CFG Scale	Steps	Sampler	Time Shift	CFG Interval
512×512	1.81	3.5	100	FlowDPMSolver	2.0	[0.1, 1.0]

Text-to-Image (PixelDiT-T2I, 1.3B params)

Resolution	GenEval↑	DPG-Bench↑
512×512	0.78	83.7
1024×1024	0.74	83.5

Getting Started

Docker image (recommended): nvcr.io/nvidia/pytorch:24.09-py3

pip install -r requirements.txt

Tasks

Note: Our models are resumed every 4 hours, using the timestamp as the random seed each time. As a result, the final training outcome may have a slight gap compared to a continuous run without intermediate resumes.

Class-to-Image Generation (ImageNet)

Training and evaluation instructions for class-conditioned generation on ImageNet 256×256 and 512×512.

→ c2i/README.md

Text-to-Image Generation

Multi-stage training (512px → 1024px) and inference for text-to-image generation.

→ t2i/README.md

Repository Structure

├── pixdit_core/      # Shared PixelDiT model definitions (c2i & t2i)
├── tools/            # Shared utilities (checkpoint download, GFLOPs computation)
├── c2i/              # Class-to-image
└── t2i/              # Text-to-image

Compute GFLOPs

Measure single-forward-pass GFLOPs for any PixelDiT model (run from project root):

# C2I (ImageNet 256x256, default resolution)
python tools/compute_flops.py --config c2i/configs/pix256_xl.yaml

# T2I at 1024x1024
python tools/compute_flops.py --config t2i/configs/PixelDiT_1024px_pixel_diffusion_stage3.yaml --height 1024 --width 1024

Citation

If you find this work useful, please cite:

@inproceedings{yu2025pixeldit,
      title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
      author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2026},
}