Consistency in Diffusion-Based Visual Generation: A Survey

University of Science and Technology of China Tsinghua University Huazhong University of Science and Technology

Overview · Taxonomy · Evaluation & Optimization · Resources · Data Files · Contribute · Citation

Overview

This repository accompanies the survey:

Consistency in Diffusion-Based Visual Generation: A Survey
Song Yan, Wei Zhai, Chenfeng Wang, Ruixuan Li, Zhangping Yang, Yancheng Cai, Tao Zhang, Ling Wang, Yunwei Lan, Yujie He, Yang Cao, Min Li, and Zheng-Jun Zha.
Preprints, 2026 · Paper · DOI

Diffusion models now support text-to-image synthesis, editing, personalization, video generation, and 3D-aware content creation. Visual fidelity alone, however, does not guarantee that an output follows its prompt, preserves identity, remains coherent over time or viewpoint, or satisfies safety and physical-plausibility requirements.

The survey organizes these failures through a single question:

What should a generated visual output agree with?

Its answer is a relation-based taxonomy of external, internal, and normative consistency. The repository turns this taxonomy into a navigable literature map covering representative methods, benchmarks, evaluators, datasets, and diagnostic resources.

Key contributions

Relation-based taxonomy — organizes consistency according to the target of agreement rather than only by task or modality.
Evaluation protocol abstraction — separates observation units, agreement targets, evidence sources, and evaluation outputs.
Optimization-locus analysis — compares where consistency is imposed: before sampling, at the condition interface, during denoising, across coupled outputs, or after generation.
Machine-readable resource map — provides structured CSV and BibTeX files for maintenance, comparison, and downstream analysis.

Taxonomy

_{Figure 1. Three consistency relations in diffusion-based visual generation.}

Relation	Agreement target	Representative failures	Typical settings	Resources
External consistency	Prompts, references, layouts, masks, poses, controls, and editing instructions	Prompt omission, attribute-binding error, counting error, control mismatch, over-editing	Text-to-image generation, structural control, editing, inpainting, virtual try-on, typography	125
Internal consistency	Generated subjects, views, frames, shots, instances, and story states	Identity drift, view inconsistency, temporal flicker, state forgetting, narrative discontinuity	Personalization, multi-view/3D generation, video generation, story visualization	123
Normative consistency	Preference, safety, fairness, physical plausibility, commonsense, and causal/world-state criteria	Low preference, unsafe output, benign-capability loss, physical violation, causal failure	Preference optimization, safety editing, concept erasure, physical and world-model evaluation	107

The categories are conceptually distinct but practically entangled. A method may address several relations simultaneously; the repository places it according to its primary agreement target while retaining its broader diagnostic role in the description and coverage files.

Evaluation and optimization


Evaluation view. A consistency claim should specify the observation unit, agreement target, evidence source, and evaluation output. This prevents sequence-level claims from being supported only by frame-level evidence, or specific alignment claims from being reduced to broad preference scores.	Optimization view. Consistency can be imposed before sampling, at the condition interface, during denoising, across coupled outputs, or after generation. Each locus creates different trade-offs among persistence, controllability, realism, diversity, memory cost, and modularity.

Resource collection

The collection is designed as a topic-oriented literature map, not a flat bibliography. Resources are first grouped by their primary consistency relation and then divided into focused research themes. All lists are fully expanded to support browser search, direct linking, and rapid visual scanning.

Consistency relation	Methods	Benchmarks & Evaluators	Datasets & Data	Total	Browse
External consistency	85	20	20	125	Open section
Internal consistency	83	20	20	123	Open section
Normative consistency	57	30	20	107	Open section
Collection	225	70	60	355	—

Entry format

Each resource is presented with a prominent title, compact venue/year metadata, and a separate one-line description:

Resource title _{venue / year}
The consistency issue, mechanism, or diagnostic role addressed by the resource.

[!NOTE] Recent papers may temporarily be labeled arXiv, project, or venue TBD until stable proceedings metadata becomes available. Official paper repositories and project pages are preferred over unofficial reimplementations.

01 · External consistency

Agreement target — Agreement with externally specified conditions.
Scope — Prompts, layouts, boxes, masks, depth maps, poses, reference images, editing instructions, and other user- or task-provided controls.

Resource type	Description	Jump
Methods	Architectures, objectives, inference procedures, and intervention mechanisms.	Browse 85
Benchmarks & Evaluators	Test suites, metrics, learned scorers, and evaluation protocols.	Browse 20
Datasets & Data Resources	Training corpora, annotations, prompt sets, and diagnostic data.	Browse 20

Methods

_{85 resources organized into 6 focused topics.}

Topic	Coverage
Prompt following & compositional generation	20
Spatial grounding & structural control	24
Guidance, inversion & image editing	26
Typography & visual text	5
Virtual try-on & dressing	7
Posters & graphic design	3

Prompt following & compositional generation ²⁰

_{Foundational text-conditioned models, semantic binding, prompt planning, and prompt refinement.}

GLIDE _{arXiv 2022}
Early text-guided diffusion model supporting prompt-conditioned generation and editing.
Imagen _{NeurIPS 2022 / arXiv}
High-fidelity text-to-image diffusion model emphasizing language understanding.
Latent Diffusion Models _{CVPR 2022}
Latent-space diffusion backbone widely used for controllable generation and editing.
Composable Diffusion Models _{ECCV 2022}
Combines multiple diffusion score functions for compositional generation.
Structured Diffusion Guidance _{arXiv 2022}
Uses structured guidance signals to improve prompt-object alignment. (same work as StructureDiffusion)
StructureDiffusion _{arXiv 2022}
Parses prompts into structured representations to improve compositional text-to-image generation.
Attend-and-Excite _{SIGGRAPH 2023}
Manipulates cross-attention maps to reduce missing objects and improve prompt coverage Paper
BoxDiff _{ICCV 2023}
Training-free box-constrained generation for spatially grounded text-to-image synthesis Paper
Composer _{ICML 2023}
Composes heterogeneous visual conditions for controllable image synthesis Paper
MultiDiffusion _{ICML 2023}
Fuses multiple diffusion paths to satisfy spatial and regional generation constraints.
LLM-grounded Diffusion _{ICLR 2024}
Uses LLM planning to turn complex prompts into layout-grounded generation constraints.
SynGen _{ICCV 2023}
Uses syntactic guidance to improve compositional text-to-image generation.
RPG: Recaption, Plan, and Generate _{arXiv 2024}
Uses MLLM-based recaptioning and planning for complex prompt following Paper
CONFORM _{arXiv / venue TBD}
Improves object-attribute alignment through contrastive or correspondence-driven prompt grounding.
Divide-and-Bind _{arXiv / venue TBD}
Decomposes complex prompts and binds objects to attributes or relations.
Linguistic Binding in Diffusion _{arXiv / venue TBD}
Studies or improves language-binding failures in text-to-image diffusion.
Promptist _{arXiv 2022}
Optimizes prompts to improve text-to-image generation quality and alignment.
BeautifulPrompt _{AAAI 2024 / arXiv}
Refines user prompts for stronger image generation quality and faithfulness.
Prompt Expansion for Text-to-Image _{topic / resource}
Expands underspecified prompts to reduce ambiguity in generation.
Prompt Decomposition for T2I _{topic / resource}
Decomposes prompts into atomic semantic constraints for evaluation or guidance.

Topic	Coverage
General prompt fidelity & composition	11
Editing & learned-concept evaluation	2
Fine-grained semantic diagnostics	5
Domain-specific control evaluation	2

Topic	Coverage
Instruction-guided editing	2
Captioning, grounding & compositional reasoning	7
Web-scale image–text pretraining	4
Segmentation & object-level control	2
Fashion, pose & typography	5

Resource type	Description	Jump
Methods	Architectures, objectives, inference procedures, and intervention mechanisms.	Browse 83
Benchmarks & Evaluators	Test suites, metrics, learned scorers, and evaluation protocols.	Browse 20
Datasets & Data Resources	Training corpora, annotations, prompt sets, and diagnostic data.	Browse 20

Topic	Coverage
Personalized concepts & subject identity	21
Characters, style & cross-instance consistency	11
Multi-view & 3D consistency	17
Video generation & temporal editing	21
Long-form stories & interactive video	7
Personalized video & human animation	6

Topic	Coverage
Multi-view & 3D consistency	3
Video generation quality & temporal coherence	7
Story, character & long-horizon consistency	4
Editing, tracking & feature-based metrics	6

Topic	Coverage
Video segmentation & tracking	8
Driving & dynamic scenes	3
3D objects & multi-view reconstruction	8
Synthetic controlled environments	1

Resource type	Description	Jump
Methods	Architectures, objectives, inference procedures, and intervention mechanisms.	Browse 57
Benchmarks & Evaluators	Test suites, metrics, learned scorers, and evaluation protocols.	Browse 30
Datasets & Data Resources	Training corpora, annotations, prompt sets, and diagnostic data.	Browse 20

Topic	Coverage
Preference models & reward optimization	20
Safety, unlearning & concept control	20
World models & physical consistency	17

Topic	Coverage
Preference & aesthetics	8
Safety & concept erasure	7
Physics, causality & world-model evaluation	15

Topic	Coverage
Preference & aesthetics	7
Safety & concept control	3
Physical reasoning & world dynamics	10

Label	Meaning
P/C	prompt and compositional faithfulness
S/E	structural control and edit preservation
ID	subject/identity persistence
V/T	multi-view, temporal, or narrative coherence
N/S	preference, safety, or value alignment
P/W	physical, causal, or world-grounded plausibility

Consistency in Diffusion-Based Visual Generation: A Survey

Overview

Key contributions

Taxonomy

Evaluation and optimization

Resource collection

Entry format

01 · External consistency

Methods

Prompt following & compositional generation 20

Spatial grounding & structural control 24

Guidance, inversion & image editing 26

Typography & visual text 5

Virtual try-on & dressing 7

Posters & graphic design 3

Benchmarks & Evaluators

General prompt fidelity & composition 11

Editing & learned-concept evaluation 2

Fine-grained semantic diagnostics 5

Domain-specific control evaluation 2

Datasets & Data Resources

Instruction-guided editing 2

Captioning, grounding & compositional reasoning 7

Web-scale image–text pretraining 4

Segmentation & object-level control 2

Fashion, pose & typography 5

02 · Internal consistency

Methods

Personalized concepts & subject identity 21

Characters, style & cross-instance consistency 11

Multi-view & 3D consistency 17

Video generation & temporal editing 21

Long-form stories & interactive video 7

Personalized video & human animation 6

Benchmarks & Evaluators

Multi-view & 3D consistency 3

Video generation quality & temporal coherence 7

Story, character & long-horizon consistency 4

Editing, tracking & feature-based metrics 6

Datasets & Data Resources

Video segmentation & tracking 8

Driving & dynamic scenes 3

3D objects & multi-view reconstruction 8

Synthetic controlled environments 1

03 · Normative consistency

Methods

Preference models & reward optimization 20

Safety, unlearning & concept control 20

World models & physical consistency 17

Benchmarks & Evaluators

Preference & aesthetics 8

Safety & concept erasure 7

Physics, causality & world-model evaluation 15

Datasets & Data Resources

Preference & aesthetics 7

Safety & concept control 3

Physical reasoning & world dynamics 10

Machine-readable resources

Coverage labels

Contribution guide

Maintenance notes

Citation

License

关于 About

语言 Languages

提交活跃度 Commit Activity

核心贡献者 Contributors

Prompt following & compositional generation ²⁰

Spatial grounding & structural control ²⁴

Guidance, inversion & image editing ²⁶

Typography & visual text ⁵

Virtual try-on & dressing ⁷

Posters & graphic design ³

General prompt fidelity & composition ¹¹

Editing & learned-concept evaluation ²

Fine-grained semantic diagnostics ⁵

Domain-specific control evaluation ²

Instruction-guided editing ²

Captioning, grounding & compositional reasoning ⁷

Web-scale image–text pretraining ⁴

Segmentation & object-level control ²

Fashion, pose & typography ⁵

Personalized concepts & subject identity ²¹

Characters, style & cross-instance consistency ¹¹

Multi-view & 3D consistency ¹⁷

Video generation & temporal editing ²¹

Long-form stories & interactive video ⁷

Personalized video & human animation ⁶

Multi-view & 3D consistency ³

Video generation quality & temporal coherence ⁷

Story, character & long-horizon consistency ⁴

Editing, tracking & feature-based metrics ⁶

Video segmentation & tracking ⁸

Driving & dynamic scenes ³

3D objects & multi-view reconstruction ⁸

Synthetic controlled environments ¹

Preference models & reward optimization ²⁰

Safety, unlearning & concept control ²⁰

World models & physical consistency ¹⁷

Preference & aesthetics ⁸

Safety & concept erasure ⁷

Physics, causality & world-model evaluation ¹⁵

Preference & aesthetics ⁷

Safety & concept control ³

Physical reasoning & world dynamics ¹⁰