Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu1,2   Wei Xiong1†   Weili Nie1   Yichen Sheng1   Shiqiu Liu1   Jiebo Luo2

1NVIDIA   2University of Rochester
Project Lead and Main Advising

     

PixelDiT is a single-stage, end-to-end pixel-space diffusion transformer that eliminates the VAE autoencoder entirely. It uses a dual-level architecture — patch-level DiT for global semantics + pixel-level DiT for texture details — to generate images directly in pixel space.

  • 1.61 FID on ImageNet 256×256
  • 0.74 GenEval / 83.5 DPG-Bench on text-to-image at 1024×1024
  • No VAE, no latent space

🔥 News

  • [2025/11] Paper, training & inference code, and pre-trained models are released.

Performance

ImageNet 256×256 (PixelDiT-XL, 797M params)

All evaluations use FlowDPMSolver with 100 steps. 50K samples. Metrics follow ADM evaluation protocol.

EpochgFID↓CFG ScaleStepsSamplerTime ShiftCFG Interval
802.363.25100FlowDPMSolver1.0[0.1, 1.0]
1601.973.25100FlowDPMSolver1.0[0.1, 1.0]
3201.612.75100FlowDPMSolver1.0[0.1, 0.9]

ImageNet 512×512 (PixelDiT-XL, 797M params)

ResolutiongFID↓CFG ScaleStepsSamplerTime ShiftCFG Interval
512×5121.813.5100FlowDPMSolver2.0[0.1, 1.0]

Text-to-Image (PixelDiT-T2I, 1.3B params)

ResolutionGenEval↑DPG-Bench↑
512×5120.7883.7
1024×10240.7483.5

Getting Started

Docker image (recommended): nvcr.io/nvidia/pytorch:24.09-py3

pip install -r requirements.txt

Tasks

Note: Our models are resumed every 4 hours, using the timestamp as the random seed each time. As a result, the final training outcome may have a slight gap compared to a continuous run without intermediate resumes.

Class-to-Image Generation (ImageNet)

Training and evaluation instructions for class-conditioned generation on ImageNet 256×256 and 512×512.

c2i/README.md

Text-to-Image Generation

Multi-stage training (512px → 1024px) and inference for text-to-image generation.

t2i/README.md

Repository Structure

├── pixdit_core/      # Shared PixelDiT model definitions (c2i & t2i)
├── tools/            # Shared utilities (checkpoint download, GFLOPs computation)
├── c2i/              # Class-to-image
└── t2i/              # Text-to-image

Compute GFLOPs

Measure single-forward-pass GFLOPs for any PixelDiT model (run from project root):

# C2I (ImageNet 256x256, default resolution) python tools/compute_flops.py --config c2i/configs/pix256_xl.yaml
# T2I at 1024x1024 python tools/compute_flops.py --config t2i/configs/PixelDiT_1024px_pixel_diffusion_stage3.yaml --height 1024 --width 1024

Citation

If you find this work useful, please cite:

@inproceedings{yu2025pixeldit, title={PixelDiT: Pixel Diffusion Transformers for Image Generation}, author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2026}, }

关于 About

[CVPR 2026 Oral] Pixel Diffusion Transformers for Image Generation

语言 Languages

Python99.5%
Shell0.5%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
3
Total Commits
峰值: 3次/周
Less
More

核心贡献者 Contributors