Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Continuously Augmented Discrete Diffusion

arXiv
Hugging Face

This repository accompanies the ICLR 2026 paper: Continuously Augmented Discrete Diffusion Model for Categorical Generative Modeling.

Discrete Diffusion CADD: Continuously Augmented Discrete Diffusion

Left: Standard discrete diffusion (mask-based denoising). Right: CADD augments the discrete process with a continuous flow-matching signal in embedding space.

Overview

CADD (Continuously Augmented Discrete Diffusion) extends discrete diffusion language models by adding a continuous flow-matching component to the masked denoising process. At each diffusion step, a continuous embedding signal is added to the discrete mask-token embeddings, providing additional information about the clean data that helps guide the denoising process.

Key idea: During both training and inference, the model input at masked positions is embed(mask_token) + z_continuous, where z_continuous follows a linear flow-matching trajectory from noise to clean embeddings:

z_continuous = (1 - t) * z_0 + t * noise
  • At t = 1 (fully masked): z_continuous = noise (no signal)
  • At t = 0 (fully clean): z_continuous = z_0 (clean embedding)

This is orthogonal to the discrete unmasking strategy --- any MDM algorithm can be combined with CADD. In this codebase we currently present our code generation experiment as an example. CADD.md provide the quickstart to help turn your MDMs to CADDs.

Results

CADD-Coder achieves the following results on code generation benchmarks:

ModelHumanEvalHumanEval+MBPPMBPP+BCB
DiffuCoder (baseline, no CADD)67.160.474.260.940.2
CADD-Coder72.063.475.763.242.1

This result is produced with settings: alg=entropy, temperature=0.1, steps=512, cadd_sampling_mode=weighted. For full training and sampling details, see reproduce.md.

Codebase Structure

ms-CADD/
  README.md                  # This file
  CADD.md                    # Detailed algorithm description
  reproduce.md               # Training and sampling parameters for reproduction
  LICENSE                    # Apple software license
  LICENSE_MODELS             # Apple model license
  CONTRIBUTING.md            # Contribution guide
  requirements.txt           # Dependencies
  configuration_dream.py     # Model configuration (DreamConfig)
  modeling_dream.py          # Model architecture (DreamModel)
  generation_utils.py        # CADD-enabled sampling (DreamGenerationMixin)
  asset/                     # Visualizations
  • generation_utils.py --- The core file. Defines DreamGenerationMixin with the _sample() method that implements CADD sampling, including continuous flow-matching initialization, forward pass with inputs_embeds, and the flow-matching update loop.
  • modeling_dream.py --- Defines DreamModel, the transformer architecture with bidirectional attention for discrete diffusion. Supports both input_ids and inputs_embeds for the CADD forward pass.
  • configuration_dream.py --- Defines DreamConfig, the model configuration class.
  • CADD.md --- Detailed walkthrough of the CADD training and sampling algorithms with pseudocode.
  • reproduce.md --- Exact hyperparameters (training and sampling) to reproduce the reported results.

CADD Sampling Parameters

ParameterTypeDefaultDescription
use_caddboolTrueEnable CADD continuous augmentation
cadd_sampling_modestr"argmax"How to estimate z_0 from logits: "weighted" (softmax-weighted) or "argmax"
algstr"origin"Unmasking strategy: "entropy", "origin", "maskgit_plus", "topk_margin"
temperaturefloat1.0Sampling temperature for token prediction
stepsint512Number of diffusion steps

Getting Started

1. Install Dependencies

pip install -r requirements.txt

2. Download the Model

The CADD-Coder checkpoint will be released on HuggingFace. Download and place the checkpoint weights to your desired path (./cadd-coder/ for example) and overlay the CADD generation code:

from huggingface_hub import snapshot_download snapshot_download("apple/CADD-Base-7B", local_dir="./cadd-coder")
# Copy CADD code into the model directory cp generation_utils.py modeling_dream.py configuration_dream.py ./cadd-coder/

3. Generate Code with CADD

from transformers import AutoTokenizer, AutoModel import torch model_path = "./cadd-coder" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModel.from_pretrained( model_path, trust_remote_code=True, torch_dtype=torch.bfloat16 ).cuda() # Generate with CADD prompt = "def fibonacci(n):\n" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda() output = model.diffusion_generate( input_ids, max_new_tokens=512, steps=512, temperature=0.1, alg="entropy", alg_temp=0.0, use_cadd=True, cadd_sampling_mode="weighted", ) print(tokenizer.decode(output[0], skip_special_tokens=True))

4. Evaluate on Benchmarks

Use standard public evaluation tools:

# Install evaluation tools pip install evalplus bigcodebench # HumanEval (pass@1) python -m evalplus.generate --model ./cadd-coder --backend diffusion \ --temperature 0.1 --steps 512 --alg entropy \ --use_cadd --cadd_sampling_mode weighted \ --dataset humaneval --bs 1 --n_samples 1 # MBPP (pass@1) python -m evalplus.generate --model ./cadd-coder --backend diffusion \ --temperature 0.1 --steps 512 --alg entropy \ --use_cadd --cadd_sampling_mode weighted \ --dataset mbpp --bs 1 --n_samples 1

See reproduce.md for the complete evaluation setup.

Citation

@article{zheng2025continuously, title={Continuously augmented discrete diffusion model for categorical generative modeling}, author={Zheng, Huangjie and Gong, Shansan and Zhang, Ruixiang and Chen, Tianrong and Gu, Jiatao and Zhou, Mingyuan and Jaitly, Navdeep and Zhang, Yizhe}, journal={arXiv preprint arXiv:2510.01329}, year={2025} }

Acknowledgments

Our codebase is built upon Diffucoder. We sincerely appreciate OpenCoder, LLaMA-Factory, Dream and Qwen2.5-Coder for their opensourcing efforts:

License

Please refer to the LICENSE file for details.

关于 About

No description, website, or topics provided.

语言 Languages

Python100.0%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
3
Total Commits
峰值: 2次/周
Less
More

核心贡献者 Contributors