Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Consistency in Diffusion-Based Visual Generation: A Survey

University of Science and Technology of China      Tsinghua University      Huazhong University of Science and Technology      University of Cambridge

Paper DOI Awesome MIT License

Overview · Taxonomy · Evaluation & Optimization · Resources · Data Files · Contribute · Citation


Overview

This repository accompanies the survey:

Consistency in Diffusion-Based Visual Generation: A Survey
Song Yan, Wei Zhai, Chenfeng Wang, Ruixuan Li, Zhangping Yang, Yancheng Cai, Tao Zhang, Ling Wang, Yunwei Lan, Yujie He, Yang Cao, Min Li, and Zheng-Jun Zha.
Preprints, 2026 · Paper · DOI

Diffusion models now support text-to-image synthesis, editing, personalization, video generation, and 3D-aware content creation. Visual fidelity alone, however, does not guarantee that an output follows its prompt, preserves identity, remains coherent over time or viewpoint, or satisfies safety and physical-plausibility requirements.

The survey organizes these failures through a single question:

What should a generated visual output agree with?

Its answer is a relation-based taxonomy of external, internal, and normative consistency. The repository turns this taxonomy into a navigable literature map covering representative methods, benchmarks, evaluators, datasets, and diagnostic resources.

Key contributions

  1. Relation-based taxonomy — organizes consistency according to the target of agreement rather than only by task or modality.
  2. Evaluation protocol abstraction — separates observation units, agreement targets, evidence sources, and evaluation outputs.
  3. Optimization-locus analysis — compares where consistency is imposed: before sampling, at the condition interface, during denoising, across coupled outputs, or after generation.
  4. Machine-readable resource map — provides structured CSV and BibTeX files for maintenance, comparison, and downstream analysis.

355 resources 225 methods 70 benchmarks and evaluators 60 datasets and data resources


Taxonomy

Three consistency relations in diffusion-based visual generation

Figure 1. Three consistency relations in diffusion-based visual generation.

RelationAgreement targetRepresentative failuresTypical settingsResources
External consistencyPrompts, references, layouts, masks, poses, controls, and editing instructionsPrompt omission, attribute-binding error, counting error, control mismatch, over-editingText-to-image generation, structural control, editing, inpainting, virtual try-on, typography125
Internal consistencyGenerated subjects, views, frames, shots, instances, and story statesIdentity drift, view inconsistency, temporal flicker, state forgetting, narrative discontinuityPersonalization, multi-view/3D generation, video generation, story visualization123
Normative consistencyPreference, safety, fairness, physical plausibility, commonsense, and causal/world-state criteriaLow preference, unsafe output, benign-capability loss, physical violation, causal failurePreference optimization, safety editing, concept erasure, physical and world-model evaluation107

The categories are conceptually distinct but practically entangled. A method may address several relations simultaneously; the repository places it according to its primary agreement target while retaining its broader diagnostic role in the description and coverage files.


Evaluation and optimization

Evaluation protocol for consistency claims Optimization loci for enforcing consistency
Evaluation view. A consistency claim should specify the observation unit, agreement target, evidence source, and evaluation output. This prevents sequence-level claims from being supported only by frame-level evidence, or specific alignment claims from being reduced to broad preference scores. Optimization view. Consistency can be imposed before sampling, at the condition interface, during denoising, across coupled outputs, or after generation. Each locus creates different trade-offs among persistence, controllability, realism, diversity, memory cost, and modularity.

Resource collection

The collection is designed as a topic-oriented literature map, not a flat bibliography. Resources are first grouped by their primary consistency relation and then divided into focused research themes. All lists are fully expanded to support browser search, direct linking, and rapid visual scanning.

Consistency relationMethodsBenchmarks & EvaluatorsDatasets & DataTotalBrowse
External consistency852020125Open section
Internal consistency832020123Open section
Normative consistency573020107Open section
Collection2257060355

Entry format

Each resource is presented with a prominent title, compact venue/year metadata, and a separate one-line description:

  • Resource title venue / year
    The consistency issue, mechanism, or diagnostic role addressed by the resource.

[!NOTE] Recent papers may temporarily be labeled arXiv, project, or venue TBD until stable proceedings metadata becomes available. Official paper repositories and project pages are preferred over unofficial reimplementations.


01 · External consistency

Agreement target — Agreement with externally specified conditions.
Scope — Prompts, layouts, boxes, masks, depth maps, poses, reference images, editing instructions, and other user- or task-provided controls.

85 methods 20 benchmarks and evaluators 20 datasets and data resources 125 total resources

Resource typeDescriptionJump
MethodsArchitectures, objectives, inference procedures, and intervention mechanisms.Browse 85
Benchmarks & EvaluatorsTest suites, metrics, learned scorers, and evaluation protocols.Browse 20
Datasets & Data ResourcesTraining corpora, annotations, prompt sets, and diagnostic data.Browse 20

Methods

85 resources organized into 6 focused topics.

TopicCoverage
Prompt following & compositional generation20
Spatial grounding & structural control24
Guidance, inversion & image editing26
Typography & visual text5
Virtual try-on & dressing7
Posters & graphic design3

Prompt following & compositional generation 20

Foundational text-conditioned models, semantic binding, prompt planning, and prompt refinement.

  • GLIDE arXiv 2022
    Early text-guided diffusion model supporting prompt-conditioned generation and editing.
  • Imagen NeurIPS 2022 / arXiv
    High-fidelity text-to-image diffusion model emphasizing language understanding.
  • Latent Diffusion Models CVPR 2022
    Latent-space diffusion backbone widely used for controllable generation and editing.
  • Composable Diffusion Models ECCV 2022
    Combines multiple diffusion score functions for compositional generation.
  • Structured Diffusion Guidance arXiv 2022
    Uses structured guidance signals to improve prompt-object alignment. (same work as StructureDiffusion)
  • StructureDiffusion arXiv 2022
    Parses prompts into structured representations to improve compositional text-to-image generation.
  • Attend-and-Excite SIGGRAPH 2023
    Manipulates cross-attention maps to reduce missing objects and improve prompt coverage Paper
  • BoxDiff ICCV 2023
    Training-free box-constrained generation for spatially grounded text-to-image synthesis Paper
  • Composer ICML 2023
    Composes heterogeneous visual conditions for controllable image synthesis Paper
  • MultiDiffusion ICML 2023
    Fuses multiple diffusion paths to satisfy spatial and regional generation constraints.
  • LLM-grounded Diffusion ICLR 2024
    Uses LLM planning to turn complex prompts into layout-grounded generation constraints.
  • SynGen ICCV 2023
    Uses syntactic guidance to improve compositional text-to-image generation.
  • RPG: Recaption, Plan, and Generate arXiv 2024
    Uses MLLM-based recaptioning and planning for complex prompt following Paper
  • CONFORM arXiv / venue TBD
    Improves object-attribute alignment through contrastive or correspondence-driven prompt grounding.
  • Divide-and-Bind arXiv / venue TBD
    Decomposes complex prompts and binds objects to attributes or relations.
  • Linguistic Binding in Diffusion arXiv / venue TBD
    Studies or improves language-binding failures in text-to-image diffusion.
  • Promptist arXiv 2022
    Optimizes prompts to improve text-to-image generation quality and alignment.
  • BeautifulPrompt AAAI 2024 / arXiv
    Refines user prompts for stronger image generation quality and faithfulness.
  • Prompt Expansion for Text-to-Image topic / resource
    Expands underspecified prompts to reduce ambiguity in generation.
  • Prompt Decomposition for T2I topic / resource
    Decomposes prompts into atomic semantic constraints for evaluation or guidance.

Back to methods ↑


Spatial grounding & structural control 24

Boxes, layouts, adapters, scene graphs, masks, poses, and other explicit control interfaces.

  • ControlNet ICCV 2023
    Adds trainable side branches for depth, edge, pose, segmentation, and other controls Paper
  • GLIGEN CVPR 2023
    Grounds generation with boxes and phrase-level grounding tokens Paper
  • T2I-Adapter AAAI 2024
    Uses lightweight adapters for structural conditions such as sketch, depth, and pose Paper
  • IP-Adapter arXiv 2023
    Adds image-prompt conditioning while preserving text compatibility Paper
  • AnyDoor CVPR 2024
    Performs zero-shot object-level customization and insertion Paper
  • FreeDoM ICCV 2023
    Applies training-free energy guidance for conditional diffusion tasks Paper
  • HumanSD ICCV 2023
    Generates human images under native skeleton guidance Paper
  • UniControl NeurIPS 2023 / arXiv
    Provides a unified framework for multiple controllable generation signals.
  • Uni-ControlNet arXiv 2023
    Unifies multi-condition ControlNet-style conditioning.
  • Ctrl-Adapter ICLR 2025
    Uses efficient adapters for diverse spatial and structural controls.
  • UniCon ICLR 2025
    Designs unidirectional information flow for stronger large-scale condition control.
  • InstanceDiffusion CVPR 2024
    Supports instance-level control over object placement and attributes.
  • ControlNet++ arXiv / venue TBD
    Improves ControlNet-style conditioning quality and efficiency.
  • ControlNet-XS arXiv / venue TBD
    Compresses controllable generation modules for efficient deployment.
  • ControlLoRA arXiv / venue TBD
    Uses LoRA-style lightweight control adaptation.
  • SparseCtrl arXiv / venue TBD
    Controls image/video generation from sparse visual conditions.
  • SemanticControl arXiv / venue TBD
    Handles loose or weakly aligned semantic controls.
  • LayoutDiffusion CVPR 2023 / arXiv
    Conditions diffusion generation on layout annotations.
  • LayoutDM CVPR 2023 / arXiv
    Models layout-to-image synthesis through diffusion.
  • SceneComposer arXiv / venue TBD
    Composes scene-level controls for complex generation.
  • Scene Graph Diffusion arXiv / venue TBD
    Uses scene graphs for relation-aware image synthesis.
  • DetDiffusion arXiv / venue TBD
    Integrates detection-like constraints into image generation.
  • Grounded Diffusion topic / resource
    General family of grounding-based diffusion methods.
  • SAM-guided Diffusion Editing topic / resource
    Uses segmentation masks to localize editing constraints.

Back to methods ↑


Guidance, inversion & image editing 26

Sampling-time guidance, inversion, instruction editing, drag editing, and localized inpainting.

  • Diffusion Posterior Sampling ICLR 2023
    Uses measurement likelihoods to guide inverse-problem diffusion.
  • Universal Guidance for Diffusion Models ICML 2023 Workshop
    Applies generic guidance losses during sampling.
  • Classifier Guidance NeurIPS 2021
    Uses classifier gradients to steer diffusion samples.
  • Classifier-Free Guidance NeurIPS 2021 workshop / arXiv
    Steers conditional generation without an external classifier.
  • SDEdit ICLR 2022
    Edits images by adding noise and denoising under new guidance.
  • Prompt-to-Prompt ICLR 2023
    Controls cross-attention to edit prompts while preserving layout/content Paper
  • Null-Text Inversion CVPR 2023
    Inverts real images for more faithful prompt-based editing.
  • DiffEdit ICLR 2023
    Computes semantic edit masks from prompt differences Paper
  • InstructPix2Pix CVPR 2023
    Trains a diffusion editor to follow natural-language instructions Paper
  • InstructDiffusion CVPR 2024
    Unifies several visual instruction tasks in diffusion models Paper
  • Imagic CVPR 2023
    Edits real images by optimizing text embeddings and model weights.
  • Paint-by-Example CVPR 2023
    Uses exemplar images to guide localized editing Paper
  • Plug-and-Play Diffusion Features CVPR 2023
    Injects diffusion features to preserve structure during editing.
  • Pix2Pix-Zero ICCV 2023
    Performs zero-shot image-to-image translation through cross-attention guidance.
  • MasaCtrl ICCV 2023
    Uses mutual self-attention to preserve structure across synthesis/editing Paper
  • LEDITS++ arXiv 2023
    Performs lightweight semantic editing and concept erasure.
  • DragonDiffusion ICLR 2024 / arXiv
    Supports object moving, resizing, and fine-grained interactive editing Paper
  • DragDiffusion CVPR 2024
    Enables point-based drag editing with diffusion priors Paper
  • FreeDrag CVPR 2024 / arXiv
    Improves drag editing without model finetuning.
  • DiffEditor arXiv / venue TBD
    Provides an editing pipeline for localized diffusion modifications.
  • SEGA arXiv 2023
    Steers semantic directions during diffusion sampling.
  • Emu Edit CVPR 2024 / arXiv
    Uses instruction data for high-quality image editing.
  • SmartEdit CVPR 2024 / arXiv
    Combines MLLMs and diffusion for instruction-based editing.
  • BrushNet ECCV 2024 / arXiv
    Adds a dedicated inpainting branch for masked image editing.
  • PowerPaint ECCV 2024 / arXiv
    Supports versatile object removal, insertion, and inpainting.
  • Inpaint Anything arXiv 2023
    Combines segmentation and diffusion inpainting.

Back to methods ↑


Typography & visual text 5

Text rendering, multilingual typography, glyph conditioning, and text-aware editing.

  • TextDiffuser NeurIPS 2023
    Improves text rendering inside generated images.
  • TextDiffuser-2 arXiv 2023
    Improves multilingual and layout-aware text rendering.
  • AnyText ICLR 2024
    Generates and edits multilingual text in images.
  • GlyphDraw NeurIPS 2023 / arXiv
    Uses glyph-level information for visual text generation.
  • GlyphControl arXiv / venue TBD
    Adds explicit glyph constraints for controllable typography.

Back to methods ↑


Virtual try-on & dressing 7

Garment preservation, person–garment alignment, and generalized dressing constraints.

  • TryOnDiffusion CVPR 2023
    Uses diffusion for virtual try-on with garment-person consistency.
  • StableVITON CVPR 2024
    Adapts stable diffusion to virtual try-on Paper
  • IDM-VTON ECCV 2024
    Improves image-based virtual try-on with diffusion Paper
  • CatVTON arXiv 2024
    Provides a lightweight virtual try-on framework Paper
  • OOTDiffusion arXiv 2024
    Generates outfits and try-on images under reference constraints Paper
  • LaDI-VTON ACM MM 2023 / arXiv
    Uses latent diffusion for virtual try-on.
  • AnyDressing arXiv / venue TBD
    Handles generalized dressing and garment transfer constraints.

Back to methods ↑


Posters & graphic design 3

Layout-aware poster generation and structured graphic composition.

  • PosterCraft arXiv / venue TBD
    Studies layout- and text-aware poster generation.
  • CreatiPoster arXiv / venue TBD
    Generates visually structured poster layouts.
  • PosterMaker arXiv / venue TBD
    Uses diffusion for controllable poster design.

Back to methods ↑

Benchmarks & Evaluators

20 resources organized into 4 focused topics.

TopicCoverage
General prompt fidelity & composition11
Editing & learned-concept evaluation2
Fine-grained semantic diagnostics5
Domain-specific control evaluation2

General prompt fidelity & composition 11

Broad prompt-following, semantic alignment, and compositional evaluation protocols.

  • TIFA ICCV 2023
    Evaluates prompt faithfulness using generated question-answer pairs.
  • GenEval NeurIPS 2023 workshop / arXiv
    Tests object presence, counting, colors, positions, and attribute binding.
  • T2I-CompBench NeurIPS 2023
    Measures compositional alignment across attributes, relations, and complex prompts.
  • GenEval2 arXiv / venue TBD
    Extends prompt-following evaluation with harder and less saturated cases.
  • HRS-Bench ICCV 2023
    Provides holistic evaluation of T2I capabilities, robustness, fairness, and bias. Paper
  • DPG-Bench arXiv 2024
    Uses dense prompts to evaluate semantic and relation following.
  • GenAI-Bench / VQAScore ECCV 2024
    Evaluates text-to-visual generation through VQA-style image/video scoring.
  • DrawBench Imagen / NeurIPS 2022 resource
    Human-evaluation prompt suite for text-to-image generation.
  • PartiPrompts arXiv 2022
    Large prompt set for evaluating compositional and high-level prompt following.
  • DSG: Davidsonian Scene Graph evaluation arXiv / venue TBD
    Converts prompts to scene-graph-like checks for semantic consistency.
  • VIEScore arXiv / venue TBD
    Uses vision-language evaluators for image-text alignment.

Back to benchmarks & evaluators ↑


Editing & learned-concept evaluation 2

Evaluation of edit preservation, instruction compliance, and reusable concept learning.

  • EditBench CVPR 2023
    Benchmarks text-guided image inpainting and edit preservation.
  • ConceptBed arXiv / venue TBD
    Evaluates concept learning and reusable concept binding.

Back to benchmarks & evaluators ↑


Fine-grained semantic diagnostics 5

Targeted tests for counting, spatial relations, attribute binding, relations, and typography.

  • CountBench resource / venue TBD
    Tests numerical object-counting consistency in generated images.
  • SpatialBench resource / venue TBD
    Tests spatial relation following.
  • ObjectAttributeBench resource / venue TBD
    Tests object-attribute binding.
  • RelationBench resource / venue TBD
    Tests relational semantics in text-to-image generation.
  • TypographyBench resource / venue TBD
    Evaluates rendered text accuracy in generated images.

Back to benchmarks & evaluators ↑


Domain-specific control evaluation 2

Specialized protocols for virtual try-on and pose-conditioned generation.

Back to benchmarks & evaluators ↑

Datasets & Data Resources

20 resources organized into 5 focused topics.

TopicCoverage
Instruction-guided editing2
Captioning, grounding & compositional reasoning7
Web-scale image–text pretraining4
Segmentation & object-level control2
Fashion, pose & typography5

Instruction-guided editing 2

Data for natural-language image editing and multi-turn edit supervision.

  • MagicBrush NeurIPS 2023 Datasets and Benchmarks
    Instruction-guided image editing dataset with multi-turn annotations.
  • InstructPix2Pix dataset CVPR 2023 resource
    Synthetic instruction-edit pairs for image editing Paper

Back to datasets & data resources ↑


Captioning, grounding & compositional reasoning 7

Caption, region, relation, referring-expression, VQA, and synthetic reasoning annotations.

  • COCO Captions ECCV 2014 / dataset
    Common image-caption source for prompt grounding.
  • Visual Genome IJCV 2017 / dataset
    Dense object, attribute, and relation annotations.
  • OpenImages dataset
    Large-scale object and visual relationship annotations.
  • ADE20K CVPR 2017 / dataset
    Scene parsing annotations for structural control.
  • RefCOCO dataset
    Referring-expression grounding resource.
  • GQA CVPR 2019 / dataset
    Visual-question-answering resource for compositional reasoning.
  • CLEVR CVPR 2017 / dataset
    Synthetic compositional reasoning dataset.

Back to datasets & data resources ↑


Web-scale image–text pretraining 4

Large image–text corpora and aesthetic-filtered pretraining resources.

  • LAION-5B NeurIPS 2022 Datasets and Benchmarks
    Web-scale image-text pretraining data.
  • LAION-Aesthetics dataset resource
    Aesthetic-filtered image-text data.
  • CC3M ACL 2018 / dataset
    Web image-caption data for vision-language pretraining.
  • CC12M CVPR 2021 / dataset
    Larger conceptual-caption dataset.

Back to datasets & data resources ↑


Segmentation & object-level control 2

Mask, instance, and long-tail object annotations for localized control.

  • SA-1B ICCV 2023 / dataset
    Large-scale segmentation masks for editing and control.
  • LVIS CVPR 2019 / dataset
    Long-tail instance annotations for object-level diagnostics.

Back to datasets & data resources ↑


Fashion, pose & typography 5

Domain-specific supervision for dressing, human pose, and visual text generation.

  • DeepFashion CVPR 2016 / dataset
    Fashion data for virtual try-on and garment consistency.
  • VITON-HD CVPR 2021 workshop / dataset
    High-resolution virtual try-on data.
  • DressCode CVPR 2022 / dataset
    Multi-category virtual try-on dataset.
  • OpenPose / pose datasets resource
    Pose supervision for human-conditioned generation.
  • OCR/text rendering corpora resource
    Text-image data for typography generation.

Back to datasets & data resources ↑

Back to resource index ↑ · Back to top ↑


02 · Internal consistency

Agreement target — Agreement among generated states.
Scope — Subjects, identities, views, frames, shots, scenes, instances, and story states that should remain mutually compatible.

83 methods 20 benchmarks and evaluators 20 datasets and data resources 123 total resources

Resource typeDescriptionJump
MethodsArchitectures, objectives, inference procedures, and intervention mechanisms.Browse 83
Benchmarks & EvaluatorsTest suites, metrics, learned scorers, and evaluation protocols.Browse 20
Datasets & Data ResourcesTraining corpora, annotations, prompt sets, and diagnostic data.Browse 20

Methods

83 resources organized into 6 focused topics.

TopicCoverage
Personalized concepts & subject identity21
Characters, style & cross-instance consistency11
Multi-view & 3D consistency17
Video generation & temporal editing21
Long-form stories & interactive video7
Personalized video & human animation6

Personalized concepts & subject identity 21

Concept inversion, parameter-efficient personalization, and identity-preserving generation.

  • Textual Inversion ICLR 2023
    Learns new textual tokens for personalized concepts Paper
  • DreamBooth CVPR 2023
    Finetunes T2I models for subject-driven generation.
  • Custom Diffusion CVPR 2023
    Efficiently customizes multiple concepts through parameter-efficient updates Paper
  • Perfusion SIGGRAPH 2023
    Uses key-locking to preserve personalized concept identity.
  • SVDiff arXiv 2023
    Parameter-efficient personalization via singular-vector updates.
  • P+ arXiv 2023
    Expands textual inversion representation capacity.
  • NeTI arXiv 2023
    Uses neural textual inversion for richer concept embedding.
  • ProSpect SIGGRAPH 2023 / arXiv
    Personalizes without heavy finetuning.
  • DisenBooth arXiv 2023
    Disentangles identity and context for personalization.
  • SuTI arXiv 2023
    Scalable subject-driven text-to-image personalization.
  • BLIP-Diffusion NeurIPS 2023
    Uses pretrained subject representations for controllable subject generation Paper
  • ELITE ICCV 2023
    Encodes visual concepts into textual embeddings for fast personalization Paper
  • FastComposer NeurIPS 2023
    Enables tuning-free multi-subject generation Paper
  • Subject-Diffusion ICCV 2023
    Supports open-domain personalized subject generation Paper
  • PhotoMaker CVPR 2024
    Uses stacked ID embeddings for realistic human personalization Paper
  • InstantID arXiv 2024
    Provides zero-shot identity-preserving generation Paper
  • IP-Adapter-FaceID arXiv 2023/2024
    Preserves face identity through image-prompt adapters Paper
  • PuLID arXiv 2024
    Supports pure and lightning ID customization Paper
  • InfiniteYou arXiv / venue TBD
    Explores scalable identity-consistent personalization.
  • RealCustom arXiv / venue TBD
    Focuses on realistic personalized concept generation.
  • InstantCharacter arXiv / venue TBD
    Builds fast character-consistent generation.

Back to methods ↑


Characters, style & cross-instance consistency 11

Consistent characters, shared style, stories, and repeated identity across generated sets.

  • ConsiStory arXiv 2024
    Training-free consistent character generation across images Paper
  • StoryDiffusion NeurIPS 2024
    Uses consistent self-attention for long-range image/video generation Paper
  • StyleAligned SIGGRAPH 2024
    Shares attention to preserve style across generated sets.
  • The Chosen One SIGGRAPH Asia 2024
    Generates consistent characters across text-to-image outputs.
  • ConsistentID arXiv 2024
    Preserves identity in portrait and character generation.
  • CharaConsist arXiv / venue TBD
    Studies fine-grained character consistency.
  • MagicID arXiv / venue TBD
    Provides ID-conditioned video customization.
  • PersonalVideo arXiv / venue TBD
    Customizes video generation with personalized identity.
  • Phantom arXiv / venue TBD
    Explores subject-consistent video generation.
  • Preserve and Personalize ICLR 2026
    Preserves distributional behavior while personalizing concepts.
  • ConceptPrism CVPR 2026 / project
    Disentangles concepts for personalized diffusion.

Back to methods ↑


Multi-view & 3D consistency 17

Novel-view synthesis, synchronized multi-view diffusion, and image-to-3D reconstruction.

  • Zero-1-to-3 ICCV 2023
    Generates novel views from one image Paper
  • One-2-3-45 arXiv 2023
    Produces multi-view images and 3D assets from a single image Paper
  • Zero123++ arXiv 2023
    Improves single-image novel-view generation.
  • Cascade-Zero123 arXiv 2023
    Cascades view generation for stronger 3D consistency Paper
  • Consistent123 arXiv 2023
    Encourages cross-view consistency in novel-view synthesis.
  • SyncDreamer ICLR 2024
    Synchronizes multi-view diffusion generation Paper
  • MVDream ICLR 2024
    Generates multi-view images with 3D-aware diffusion Paper
  • Wonder3D CVPR 2024
    Reconstructs 3D assets from single images through multi-view diffusion Paper
  • ViewDiff CVPR 2024
    Enforces 3D consistency for text-to-image multi-view generation.
  • EscherNet CVPR 2024
    Performs scalable view synthesis under camera changes.
  • DreamGaussian ICLR 2024
    Uses 3D Gaussians for fast text/image-to-3D generation Paper
  • LGM ECCV 2024
    Reconstructs 3D Gaussians from sparse or generated views Paper
  • GRM ECCV 2024
    Builds large Gaussian reconstruction models.
  • Instant3D arXiv / venue TBD
    Accelerates 3D generation from sparse visual evidence.
  • TripoSR arXiv 2024
    Fast feed-forward 3D reconstruction from a single image Paper
  • CRM arXiv / venue TBD
    Uses reconstruction priors for consistent 3D asset generation.
  • LRM ICLR 2024 / arXiv
    Learns large reconstruction models for image-to-3D.

Back to methods ↑


Video generation & temporal editing 21

Text-to-video models, motion modules, temporal feature propagation, and video editing.

  • VideoLDM CVPR 2023
    Extends latent diffusion to video generation.
  • Text2Video-Zero ICCV 2023
    Adapts image diffusion to zero-shot video generation Paper
  • Tune-A-Video ICCV 2023
    Tunes a T2I model for video generation from one video Paper
  • AnimateDiff ICLR 2024
    Adds motion modules to personalized image diffusion models Paper
  • FateZero ICCV 2023
    Uses attention fusion for zero-shot video editing Paper
  • Video-P2P arXiv 2023
    Extends Prompt-to-Prompt-style editing to videos Paper
  • TokenFlow ICLR 2024
    Propagates diffusion features to improve temporal video editing consistency.
  • CoDeF CVPR 2024
    Uses content deformation fields for temporally consistent video processing.
  • Rerender A Video SIGGRAPH Asia 2023
    Performs zero-shot text-guided video-to-video translation.
  • COVE arXiv 2024
    Uses correspondence guidance for video editing Paper
  • VideoCrafter arXiv 2023
    Open video diffusion framework Paper
  • VideoCrafter2 CVPR 2024
    Improves high-quality video diffusion generation Paper
  • ModelScopeT2V project / 2023
    Open text-to-video generation system.
  • Make-A-Video arXiv 2022
    Generates videos from text using image-text and video data.
  • Imagen Video arXiv 2022
    Cascaded video diffusion model.
  • Phenaki ICLR 2023 / arXiv
    Generates long videos from open-domain prompts.
  • VideoFusion CVPR 2023 / arXiv
    Uses decomposed diffusion for video generation.
  • Latte TMLR / arXiv
    Applies latent diffusion transformers to video generation.
  • VideoPoet ICML 2024 / arXiv
    Multimodal video generation and editing model.
  • Lumiere SIGGRAPH 2024 / arXiv
    Space-time diffusion model for coherent video generation.
  • Sora technical report technical report 2024
    Large-scale video generation model emphasizing world simulation properties.

Back to methods ↑


Long-form stories & interactive video 7

Long-horizon narratives, multi-scene planning, interactive control, and shot consistency.

  • MovieDreamer arXiv / venue TBD
    Studies hierarchical long visual sequence generation.
  • TaleCrafter arXiv / venue TBD
    Generates multi-character visual stories.
  • One-Prompt-One-Story arXiv / venue TBD
    Aims at consistent story generation from a single prompt.
  • Animate-A-Story arXiv 2023
    Generates storytelling videos with retrieval and control signals Paper
  • MotionStream ICLR 2026
    Supports real-time video generation with interactive motion control.
  • VideoDirectorGPT arXiv 2023
    Uses LLM planning for multi-scene video generation.
  • ShotAdapter arXiv / venue TBD
    Adapts video generation for multi-shot consistency.

Back to methods ↑


Personalized video & human animation 6

Subject-aware video customization, talking-human generation, and reference-driven animation.

  • VideoBooth arXiv 2023
    Customizes video generation to a subject or concept.
  • DreamVideo arXiv 2023
    Personalizes video generation with subject-aware priors.
  • Vlogger arXiv 2024
    Generates talking/head or human-centric video content.
  • MagicAnimate CVPR 2024
    Animates human images under motion guidance Paper
  • AnimateAnyone CVPR 2024 / arXiv
    Animates reference characters with strong identity preservation.
  • Champ arXiv 2024
    Enables controllable and consistent human animation.

Back to methods ↑

Benchmarks & Evaluators

20 resources organized into 4 focused topics.

TopicCoverage
Multi-view & 3D consistency3
Video generation quality & temporal coherence7
Story, character & long-horizon consistency4
Editing, tracking & feature-based metrics6

Multi-view & 3D consistency 3

Benchmarks and metrics for geometric compatibility across generated viewpoints.

  • MVG-Bench arXiv 2024
    Evaluates multi-view generation consistency.
  • MET3R arXiv 2024
    Measures 3D-aware multi-view consistency from generated images.
  • Multi-view consistency metrics resource
    Measures cross-view geometric compatibility.

Back to benchmarks & evaluators ↑


Video generation quality & temporal coherence 7

General video quality, text–video alignment, motion, and physical-temporal diagnostics.

  • VBench CVPR 2024
    Comprehensive video generation benchmark including subject/background and temporal consistency.
  • Video-Bench CVPR 2025
    Human-aligned video generation benchmark.
  • EvalCrafter CVPR 2024
    Evaluates generated videos along visual, text-video, and motion dimensions.
  • FETV NeurIPS 2023 Datasets and Benchmarks
    Fine-grained open-domain text-to-video evaluation benchmark. Paper
  • T2V-CompBench arXiv / venue TBD
    Tests compositional text-to-video generation.
  • VideoScore arXiv / venue TBD
    Provides learned or automatic video generation quality scoring.
  • VideoPhy temporal subset ICLR 2025
    Uses physical video checks as temporal/world consistency diagnostics.

Back to benchmarks & evaluators ↑


Story, character & long-horizon consistency 4

Identity persistence, story continuity, and long-video evaluation.

Back to benchmarks & evaluators ↑


Editing, tracking & feature-based metrics 6

Preservation metrics based on editing stability, CLIP/DINO features, face identity, and LPIPS.

Back to benchmarks & evaluators ↑

Datasets & Data Resources

20 resources organized into 4 focused topics.

TopicCoverage
Video segmentation & tracking8
Driving & dynamic scenes3
3D objects & multi-view reconstruction8
Synthetic controlled environments1

Video segmentation & tracking 8

Object persistence, occlusion, scene parsing, and long-term tracking supervision.

  • MeViS ICCV 2023
    Motion-expression video segmentation data useful for temporal grounding.
  • MOSE ICCV 2023 / dataset
    Video object segmentation data with complex occlusions.
  • TAO ECCV 2020
    Long-tail tracking data for object persistence diagnostics.
  • VSPW CVPR 2021
    Video scene parsing dataset for scene-state continuity.
  • DAVIS CVPR 2016
    Video object segmentation data for temporal preservation.
  • YouTube-VOS ECCV 2018 / dataset
    Large-scale video object segmentation data.
  • LaSOT CVPR 2019
    Long-term single-object tracking dataset.
  • TrackingNet ECCV 2018 / dataset
    Large-scale object tracking data.

Back to datasets & data resources ↑


Driving & dynamic scenes 3

Large-scale autonomous-driving data for geometry, motion, and state continuity.

  • nuScenes CVPR 2020
    Driving dataset useful for dynamic-scene consistency.
  • KITTI IJRR 2013 / dataset
    Autonomous-driving visual dataset for geometry and temporal checks.
  • Waymo Open Dataset CVPR 2020 / dataset
    Large-scale driving data for world and motion consistency.

Back to datasets & data resources ↑


3D objects & multi-view reconstruction 8

Object assets, camera trajectories, RGB-D scenes, scans, and multi-view imagery.

  • Objaverse CVPR 2023 / dataset
    Large 3D object dataset for view and 3D generation.
  • Objaverse-XL NeurIPS 2023 Datasets and Benchmarks
    Web-scale 3D object data.
  • CO3D ICCV 2021 / dataset
    Common objects in 3D data for view consistency.
  • RealEstate10K dataset
    Camera-trajectory video data for novel-view synthesis.
  • ScanNet CVPR 2017 / dataset
    RGB-D scene data for geometry-aware generation.
  • ShapeNet arXiv 2015 / dataset
    3D shape dataset for object-level 3D generation.
  • Google Scanned Objects dataset
    High-quality scanned object assets.
  • MVImgNet CVPR 2023 / dataset
    Multi-view image dataset for object-centric reconstruction.

Back to datasets & data resources ↑


Synthetic controlled environments 1

Procedural data generation for controlled scene and temporal diagnostics.

  • Kubric CVPR 2022 / dataset generator
    Synthetic video/scene data generation for controlled temporal diagnostics. Paper

Back to datasets & data resources ↑

Back to resource index ↑ · Back to top ↑


03 · Normative consistency

Agreement target — Agreement with evaluative or world-level criteria.
Scope — Human preference, aesthetics, safety, fairness, concept restrictions, physical plausibility, commonsense, causality, and world-state validity.

57 methods 30 benchmarks and evaluators 20 datasets and data resources 107 total resources

Resource typeDescriptionJump
MethodsArchitectures, objectives, inference procedures, and intervention mechanisms.Browse 57
Benchmarks & EvaluatorsTest suites, metrics, learned scorers, and evaluation protocols.Browse 30
Datasets & Data ResourcesTraining corpora, annotations, prompt sets, and diagnostic data.Browse 20

Methods

57 resources organized into 3 focused topics.

TopicCoverage
Preference models & reward optimization20
Safety, unlearning & concept control20
World models & physical consistency17

Preference models & reward optimization 20

Human-preference scorers, direct preference optimization, reinforcement learning, and multi-reward alignment.

  • Pick-a-Pic / PickScore NeurIPS 2023
    Collects pairwise preferences and trains a preference scorer.
  • ImageReward NeurIPS 2023
    Learns a general human preference reward model for T2I images. Paper
  • HPS ICCV 2023 / arXiv
    Scores generated images according to human preference.
  • HPSv2 arXiv 2023
    Refines human preference scoring and benchmark coverage.
  • HPSv3 arXiv 2025
    Extends preference evaluation to broader text-image distributions. Paper
  • MPS CVPR 2024
    Models multi-dimensional human preferences for T2I generation.
  • VisionReward AAAI 2026
    Learns multi-dimensional reward signals for image and video generation.
  • Diffusion-DPO NeurIPS 2023 / arXiv
    Applies direct preference optimization to diffusion models.
  • DDPO ICLR 2024
    Trains diffusion models with reinforcement learning rewards.
  • AlignProp ICLR 2024 / arXiv
    Backpropagates reward gradients through diffusion sampling.
  • DPOK arXiv 2023
    Applies KL-regularized policy optimization to diffusion models.
  • D3PO arXiv 2024
    Optimizes diffusion policies from preference data.
  • SPO CVPR 2025
    Performs step-by-step preference optimization for aesthetic post-training.
  • DSPO ICLR 2025
    Aligns diffusion models using direct score preference optimization.
  • RankDPO ICCV 2025
    Uses ranked preference data for scalable T2I preference optimization.
  • CMPO / CaPO CVPR 2025
    Calibrates multiple preferences for diffusion alignment.
  • Diffusion-NPO ICLR 2026
    Performs negative preference optimization for diffusion alignment.
  • BranchGRPO ICLR 2026
    Uses branch-level GRPO-style optimization for diffusion generation.
  • Flow-GRPO arXiv / venue TBD
    Applies GRPO-like preference learning to flow/diffusion sampling.
  • RLAIF for Diffusion topic / resource
    Uses AI feedback instead of human feedback for diffusion alignment.

Back to methods ↑


Safety, unlearning & concept control 20

Safety guidance, concept erasure, unlearning, red teaming, filtering, and adversarial robustness.

Back to methods ↑


World models & physical consistency 17

Interactive world models, driving simulation, physics-aware guidance, causality, and state transitions.

  • UniSim ICLR 2024
    Learns interactive real-world simulators for action-conditioned generation.
  • Genie ICML 2024
    Generates interactive environments from videos.
  • GAIA-1 arXiv / technical report 2023
    Builds a generative world model for autonomous driving.
  • WorldDreamer arXiv 2024
    Generates driving videos with world-model priors.
  • DriveDreamer ECCV 2024 / arXiv
    Generates driving scenarios with structured controls.
  • DriveDreamer-2 arXiv / venue TBD
    Extends driving world generation to longer/higher-quality videos.
  • Vista arXiv / venue TBD
    Studies video world models for controllable environments.
  • Pandora arXiv / venue TBD
    Explores world modeling through video generation.
  • Cosmos World Foundation Models technical report / arXiv
    Studies large-scale world foundation models.
  • HunyuanWorld / Hunyuan World technical report / arXiv
    Generates 3D/world environments using generative world modeling.
  • World-consistent Video Diffusion topic / resource
    Enforces geometry, dynamics, and state consistency in video generation.
  • Physics-guided Diffusion topic / resource
    Injects physical constraints into diffusion sampling or training.
  • Simulator-guided Diffusion topic / resource
    Uses simulators or constraints to steer generation.
  • Verifier-guided Generation topic / resource
    Uses post-hoc or in-loop verifiers to reject inconsistent samples.
  • Causal Video Generation topic / resource
    Studies causal state transitions in generated videos.
  • Object-state-change Generation topic / resource
    Focuses on object state changes and action consequences.
  • Embodied Diffusion World Models topic / resource
    Connects diffusion generation with embodied planning and control.

Back to methods ↑

Benchmarks & Evaluators

30 resources organized into 3 focused topics.

TopicCoverage
Preference & aesthetics8
Safety & concept erasure7
Physics, causality & world-model evaluation15

Preference & aesthetics 8

Learned reward models and multidimensional evaluators for human preference and visual quality.

  • Pick-a-Pic / PickScore NeurIPS 2023
    Pairwise preference data and reward model for T2I outputs.
  • ImageReward NeurIPS 2023
    Learned reward model for human preference evaluation.
  • HPSv2 arXiv 2023
    Human preference benchmark for T2I evaluation.
  • HPSv3 arXiv 2025
    Wide-spectrum preference benchmark and reward model.
  • VisionReward AAAI 2026
    Multi-dimensional image/video preference evaluator.
  • MPS evaluation CVPR 2024
    Evaluates multiple dimensions of human preference.
  • Aesthetic score models metric family
    Score visual aesthetics for generated images.
  • LAION aesthetic predictor dataset/model resource
    Provides aesthetic scores for LAION-like data.

Back to benchmarks & evaluators ↑


Safety & concept erasure 7

Unsafe-generation, concept-removal, benign-retention, and red-teaming protocols.

Back to benchmarks & evaluators ↑


Physics, causality & world-model evaluation 15

Static and dynamic physical reasoning, action consequences, and world-state validity.

  • PhyBench arXiv 2024
    Static physical commonsense benchmark for T2I.
  • VideoPhy ICLR 2025
    Physical commonsense benchmark for generated videos.
  • PhyCoBench arXiv 2024
    Optical-flow-guided physical coherence benchmark.
  • PhyGenBench arXiv 2024
    Physical-law benchmark for video generation.
  • VideoPhy-2 ICLR 2026
    Action-centric physical commonsense benchmark.
  • T2VPhysBench arXiv / venue TBD
    Tests first-principles physical consistency in T2V.
  • T2VWorldBench arXiv / venue TBD
    Evaluates world knowledge, commonsense, and causal plausibility.
  • Physics-IQ WACV 2026
    Tests physical principles in generative video models.
  • PhyWorldBench arXiv 2025
    Benchmarks physical realism in text-to-video generation.
  • VideoVerse arXiv 2025
    World-model-oriented T2V evaluation.
  • PhyEduVideo WACV 2026
    Physics-education-oriented video benchmark.
  • PhyWorld ICML 2025
    Studies how far video generation is from physical world models.
  • OSCBench arXiv / venue TBD
    Tests object state change and action consequence.
  • Morpheus arXiv / venue TBD
    Evaluates physical reasoning in video generation.
  • World-model Video Evaluation resource
    General benchmarks for video-as-world-model behavior.

Back to benchmarks & evaluators ↑

Datasets & Data Resources

20 resources organized into 3 focused topics.

TopicCoverage
Preference & aesthetics7
Safety & concept control3
Physical reasoning & world dynamics10

Preference & aesthetics 7

Pairwise preference annotations, reward-model training data, and aesthetic ratings.

Back to datasets & data resources ↑


Safety & concept control 3

Unsafe prompt sets, NSFW resources, and concept-erasure diagnostics.

Back to datasets & data resources ↑


Physical reasoning & world dynamics 10

Physical commonsense, video dynamics, driving, egocentric interaction, and synthetic physics data.

  • Physical commonsense prompts resource
    Prompts testing static physical plausibility.
  • Video physical prompts resource
    Prompts testing dynamic physical plausibility.
  • Driving world-model datasets resource
    Driving data for action-conditioned world models.
  • Ego4D CVPR 2022 / dataset
    Egocentric video data for embodied and action reasoning.
  • Something-Something V2 ICCV 2017 / dataset
    Human-object interaction videos for action/state understanding.
  • CLEVRER ICLR 2020 / dataset
    Synthetic videos for physical and causal reasoning.
  • PHYRE NeurIPS 2019 / benchmark
    Physical reasoning environments.
  • IntPhys arXiv 2018 / dataset
    Intuitive physics video dataset.
  • CLEVR CVPR 2017 / dataset
    Synthetic visual reasoning data.
  • Kubric CVPR 2022
    Synthetic scene/video generator useful for controlled physical diagnostics.

Back to datasets & data resources ↑

Back to resource index ↑ · Back to top ↑


Machine-readable resources

The repository includes structured companion files for programmatic analysis and maintenance:

Coverage labels

LabelMeaning
P/Cprompt and compositional faithfulness
S/Estructural control and edit preservation
IDsubject/identity persistence
V/Tmulti-view, temporal, or narrative coherence
N/Spreference, safety, or value alignment
P/Wphysical, causal, or world-grounded plausibility

Coverage values: H = direct/dedicated coverage, M = partial/adaptable coverage, L = indirect/low coverage.

Contribution guide

Contributions are welcome. Please keep additions concise, verifiable, and aligned with the relation-based taxonomy. Please include the resource title, BibTeX key, venue/year, official paper URL, official project/code URL if available, resource type, modality, primary consistency relation, coverage values, and a short diagnostic-use/blind-spot description.

Use the issue template: Add or correct a resource.

[!TIP] Prefer official paper pages, author-maintained repositories, and stable proceedings links. Include enough information for another maintainer to verify the entry without additional searching.

Maintenance notes

Some 2025--2026 papers may initially appear as arXiv or project-page entries before official proceedings metadata is stable. When official BibTeX becomes available, please update resources/selected_bibtex.bib and any corresponding table entries.

When adding links, prefer official repositories or project pages over unofficial reimplementations. If no stable official repository exists, leave the code URL blank in the CSV table. Entries with arXiv-search links are included as expansion placeholders and should be replaced by stable paper/project links when available.

Citation

If this survey or resource collection is useful in your research, please cite:

@article{yan2026consistency, title = {Consistency in Diffusion-Based Visual Generation: A Survey}, author = {Yan, Song and Zhai, Wei and Wang, Chenfeng and Li, Ruixuan and Yang, Zhangping and Cai, Yancheng and Zhang, Tao and Wang, Ling and Lan, Yunwei and He, Yujie and Cao, Yang and Li, Min and Zha, Zheng-Jun}, year = {2026}, doi = {10.20944/preprints202606.0870.v1}, url = {https://www.preprints.org/manuscript/202606.0870/v1}, note = {Preprints} }

License

This repository is released under the MIT License.

Back to top ↑

关于 About

Consistency in Diffusion-Based Visual Generation: A Survey

语言 Languages

TeX69.3%
Python30.7%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
87
Total Commits
峰值: 59次/周
Less
More

核心贡献者 Contributors