UniCorn

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Ruiyan Han*, Zhen Fang*, Xinyu Sun*, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, and Feng Zhao

contact: fazii@mail.ustc.edu.cn

While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.

📢 News

We sincerely thank all contributors from the open community for their valuable support.

Apr. 12, 2026 We fully released our code.
Jan. 12, 2026: We released the our checkpoint. Welcome to download and try!
Jan. 7, 2026: We released the official report for UniCorn.

📝 To-Do List

This list tracks the progress of our open-source development and model optimization:

Release the code.
Release the ckpt.

We appreciate the support from our contributors and the open-source community.

📮 Notice

Follow the Bagel's original settings, you should focus:

About Inference Hyperparameters:

cfg_text_scale: Controls how strongly the model follows the text prompt. 1.0 disables text guidance. Typical range: 4.0–8.0.
cfg_image_scale: Controls how much the model preserves input image details. 1.0 disables image guidance. Typical range: 1.0–2.0.
cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: [0.4, 1.0].
timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).
num_timesteps: Total denoising steps. Typical: 50.
cfg_renorm_min: Minimum value for CFG-Renorm. 1.0 disables renorm. Typical: 0.
cfg_renorm_type: CFG-Renorm method:
- global: Normalize over all tokens and channels (default for T2I).
- channel: Normalize across channels for each token.
- text_channel: Like channel, but only applies to text condition (good for editing, may cause blur).
If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min or decrease cfg_scale.

📊 Benchmarks

UniCorn

🎨 Visualization

UniCorn

🔗 QuickStart

Environment

Please refer to the official instruction of BAGEL

Data Generation

We provide our code in data_generation, including prompt generation, image generation, reward generation and cognitive pattern reconstruction.

Training

The training script is provided in scripts/train/train_bagel_reward_self_mix.sh, corresponding config file is provided in data/configs/self_mix_data_5k_gen.yaml

🙏 Acknowledgments

This project is built upon several excellent open-source projects: BAGEL, IRG and SRUM. We sincerely thank the authors for their contributions. We are grateful to the broader research community for their open-source spirit and collaborative efforts.

✍️ Citation

@article{han2026unicorn,
      title={UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision}, 
      author={Han, Ruiyan and Fang, Zhen and Sun, XinYu and Ma, Yuchen and Wang, Ziheng and Zeng, Yu and Chen, Zehui and Chen, Lin and Huang, Wenxuan and Xu, Wei-Jie and others},
      journal={arXiv preprint arXiv:2601.03193},
      year={2026},
}

📜 License

UniCorn is licensed under the Apache 2.0.

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

📢 News

📝 To-Do List

📮 Notice

📊 Benchmarks

🎨 Visualization

🔗 QuickStart

Environment

Data Generation

Training

🙏 Acknowledgments

✍️ Citation

📜 License

关于 About

语言 Languages

提交活跃度 Commit Activity

核心贡献者 Contributors