Abstract

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

1. Bibliographic Information

1.1. Title

Scalable Diffusion Models with Transformers

1.2. Authors

William Peebles: University of California, Berkeley (UC Berkeley).
Saining Xie: New York University.

1.3. Journal/Conference

The paper was initially published on arXiv on December 19, 2022. It is widely recognized as a seminal paper in the field of generative AI, later accepted at ICCV 2023 (International Conference on Computer Vision), a top-tier venue in computer vision.

1.4. Publication Year

2022 (Preprint), 2023 (Conference).

1.5. Abstract

This paper introduces Diffusion Transformers (DiTs), a new architecture for diffusion models that replaces the traditional convolutional U-Net backbone with a Transformer. Operating on latent patches of images, the authors demonstrate that DiTs adhere to scaling laws: performance (measured by Fréchet Inception Distance, FID) improves consistently with increased model complexity (Gflops). The largest model, DiT-XL/2, achieves state-of-the-art performance on ImageNet class-conditional generation benchmarks (256x256 and 512x512), outperforming prior U-Net-based diffusion models.

1.6. Original Source Link

Original Source: https://arxiv.org/abs/2212.09748
Status: Published (arXiv Preprint / ICCV 2023).

2. Executive Summary

2.1. Background & Motivation

The Problem: In the last few years, the Transformer architecture has revolutionized natural language processing (e.g., GPT, BERT) and computer vision (e.g., Vision Transformers). However, the leading class of image generation models, Diffusion Models (such as DALL-E 2's base, Stable Diffusion, etc.), largely relied on a convolutional neural network architecture known as the U-Net. The U-Net was inherited from earlier pixel-level models and had become the de-facto standard without much questioning of its necessity.

Motivation: The authors asked a fundamental question: Is the U-Net inductive bias (the specific design suited for images) actually necessary for diffusion models? Can diffusion models benefit from the massive scalability and standardization of Transformers? The goal was to "demystify" the architecture of diffusion models and see if a standard Transformer could replace the complex U-Net, thereby inheriting the ability to scale up performance simply by adding more compute (layers, width, tokens).

2.2. Main Contributions & Findings

Architecture Innovation (DiT): The paper proposes Diffusion Transformers (DiT). This architecture adheres strictly to the Vision Transformer (ViT) design principles but is adapted for the noise-prediction task of diffusion models. It operates in the latent space of a Variational Autoencoder (VAE).
Scalability Analysis: The authors provide a rigorous analysis of "scaling laws" for generative models. They discover a strong correlation between network complexity (measured in Gflops, giga floating-point operations) and image quality (measured by FID).
- Finding: Increasing the depth/width of the Transformer or increasing the number of input tokens (by using smaller patches) consistently lowers FID.
State-of-the-Art Results: The largest model, DiT-XL/2, achieved an FID of 2.27 on the ImageNet 256x256 benchmark, beating all previous diffusion models (like LDM and ADM) and even highly tuned GANs (StyleGAN-XL).

The following figure (Figure 2 from the original paper) summarizes this success. On the left, it shows how increasing model size (S to XL) and token count (patch size 8 to 2) lowers FID. On the right, it compares DiT against state-of-the-art competitors.

该图像是一个图表，展示了不同扩散变换器（DiTs）模型在ImageNet 256x256上的FID成绩。左侧的散点图显示了DiT-S、DiT-B、DiT-L和DiT-XL模型的性能，右侧的图表则比较了状态-of-the-art（SOTA）扩散模型及其引导性能，展示了DiT-XL/2-G模型的优越性。

3.1. Foundational Concepts

To understand this paper, a beginner needs to grasp four key pillars:

Diffusion Models (DDPMs):
- Concept: A generative model that learns to create images by reversing a noise process. Imagine taking a clear image and slowly adding static (Gaussian noise) until it is pure random noise (Forward Process). The model learns to do the reverse: start with pure noise and step-by-step remove it to reveal a clear image (Reverse Process).
- Training: The model is trained to predict the noise $\epsilon$ that was added to an image $x_0$ to get a noisy version $x_t$ at timestep $t$ .
Latent Diffusion Models (LDMs):
- Concept: Instead of operating directly on pixels (which is computationally expensive for high resolutions), LDMs first compress an image into a smaller, dense representation called a "latent code" using an Autoencoder (VAE). The diffusion process happens in this compressed latent space.
- DiT Context: DiT is an LDM. It does not generate pixels directly; it generates latent codes which are then decoded into images.
Transformers & Vision Transformers (ViT):
- Transformer: A neural network architecture based entirely on the Attention Mechanism, allowing the model to weigh the importance of different parts of the input data relative to each other.
- ViT (Vision Transformer): Adapts Transformers for images. It chops an image into square patches (e.g., 16x16 pixels), flattens them into vectors (tokens), and feeds them into a Transformer, treating the image patches just like words in a sentence.
- Key Insight: DiT treats the noisy latent representation as a sequence of patches.
Evaluation Metrics:
- FID (Fréchet Inception Distance): A score that compares the distribution of generated images to real images. Lower is better. It measures both realism and diversity.
- Gflops (Giga Floating-point Operations): A measure of the computational cost of running the model once. It serves as a proxy for model complexity.

3.2. Previous Works

U-Net in Diffusion (Ho et al., 2020; Dhariwal & Nichol, 2021): The standard architecture. It uses Convolutional layers (ResNet blocks), downsampling (making the image smaller), and upsampling layers, with "skip connections" linking the encoder and decoder parts.
Vision Transformers (Dosovitskiy et al., 2020): Proved that Transformers could beat CNNs (ResNets) in image classification if given enough data and compute.
Latent Diffusion (Rombach et al., 2022): The foundation of Stable Diffusion. It moved diffusion training from pixel space to latent space to save compute. It used a U-Net backbone.

3.3. Technological Evolution

Era of CNNs: For years, ResNets and U-Nets dominated computer vision.
Transformer Revolution: Transformers took over NLP (2017+) and then Vision (2020+).
Generative AI: Diffusion models beat GANs (2021) using U-Nets.
DiT (This Paper): Replaces the U-Net in diffusion with Transformers, completing the unification of architecture across domains.

3.4. Differentiation Analysis

Vs. Standard Diffusion (ADM, DDPM): Those use U-Nets. DiT uses a Transformer.
Vs. Stable Diffusion (LDM): LDM uses a U-Net operating on latents. DiT uses a Transformer operating on latents.
Innovation: DiT proves that the complex, domain-specific design of U-Net (downsampling, upsampling, skip connections) is not necessary. A plain, scalable Transformer works better.

4. Methodology

4.1. Principles

The core principle of DiT is simplicity and scalability. The authors hypothesize that by treating the diffusion process as a sequence modeling problem (predicting noise in a sequence of patches), they can leverage the proven scaling properties of Transformers. The architecture mimics the standard Vision Transformer (ViT) as closely as possible, minimizing custom design choices.

4.2. Core Methodology In-depth

The DiT architecture consists of a specific pipeline that processes latent representations. We break this down step-by-step. The following figure (Figure 3 from the original paper) provides a schematic overview of the architecture.

该图像是展示Diffusion Transformers (DiTs) 结构的示意图，包括基本的Latent Diffusion Transformer和不同类型的DiT Block，如带有adaLN-Zero、Cross-Attention和In-Context Conditioning的模块。图中详细描述了各个部分的功能和数据流动。

Step 1: Input and Patchify

The input to the model is a spatial latent representation $z$ , which comes from a pre-trained VAE encoder.

Input Shape: $z$ has a shape of $I \times I \times C$ . For a $256 \times 256$ image, the latent resolution is $32 \times 32$ ( $I=32$ ) with $C=4$ channels.
Patchify: The first layer converts this spatial input into a sequence of tokens. The spatial grid is divided into square patches of size $p \times p$ $p \times p$ .
- Formula: The number of tokens $T$ is calculated as: $T = \left( \frac{I}{p} \right)^2$
- Explanation: If $I=32$ and patch size $p=2$ , then $T = (32/2)^2 = 16^2 = 256$ tokens. Halving the patch size quadruples the number of tokens ( $T$ ). This is a key knob for scaling compute (Gflops).
- Each patch is linearly embedded into a vector of dimension $d$ (hidden size).
  
  The following figure (Figure 4 from the original paper) illustrates this "Patchify" process and how patch size affects sequence length.
  
  $Figure 4. Input specifications for DiT. Given patch size $p \\times p$ a spatial representation (the noised latent from the VAE) of shape $I \\times I \\times C$ is "patchified" into a sequence of length $T = ( I / p ) ^ { 2 }$ with hidden dimension $d$ A smaller patch size $p$ results in a longer sequence length and thus more Gflops.$ 该图像是DiT Block的示意图。图中展示了输入令牌的维度 T imes d 和噪声潜在表示的形状 1 imes I imes C，以及分块后生成的序列长度 $T = (I/p)^{2}$ ，其中较小的块大小 $p$ 会导致更长的序列长度，进而增加Gflops。

Step 2: Positional Embeddings

Since Transformers process tokens in parallel and have no inherent sense of spatial order (unlike convolutions), Positional Embeddings are added to the tokens.

DiT uses standard sine-cosine 2D positional embeddings (as used in ViT) to inform the model which patch corresponds to which part of the image.

Step 3: Diffusion Transformer Blocks

This is the engine of the model. The sequence of tokens passes through $N$ Transformer blocks. A crucial aspect of diffusion models is that they must be conditioned on:

Timestep $t$ : Telling the model how much noise is currently in the image.
Class Label $c$ : Telling the model what object to generate (e.g., "cat").

The authors explored four block designs to inject this conditional information. The best-performing design was the adaLN-Zero Block.

The adaLN-Zero Block (Adaptive Layer Norm Zero)

In standard Transformers, Layer Normalization (LayerNorm) learns fixed scale and shift parameters. In DiT, these parameters are dynamic (adaptive) based on $t$ and $c$ .

Conditioning Input: The timestep $t$ and label $c$ are embedded into vectors. Their sum is fed into a Multi-Layer Perceptron (MLP).
Regression of Modulation Parameters: The MLP outputs six parameters ( $\gamma_1, \beta_1, \alpha_1, \gamma_2, \beta_2, \alpha_2$ ) for each block.
Mechanism:
- Let $x$ be the input token sequence to the block.
- Attention Sub-block: $x = x + \alpha_1 \cdot \text{Attention}(\text{adaLN}(x, \gamma_1, \beta_1))$ Here, $\text{adaLN}(x, \gamma, \beta) = \gamma \cdot \text{LayerNorm}(x) + \beta$ . The parameter $\alpha_1$ is a gating factor applied before the residual connection.
- MLP (Feed-Forward) Sub-block: $x = x + \alpha_2 \cdot \text{MLP}(\text{adaLN}(x, \gamma_2, \beta_2))$
- Zero-Initialization: This is critical. The MLP predicting $\gamma, \beta, \alpha$ is initialized such that it outputs zero for the $\alpha$ parameters and the $\gamma$ parameters. This means the block initially acts as an identity function (passing the input through unchanged). This stabilizes training significantly.
  
  The impact of this design choice is significant. The following figure (Figure 5 from the original paper) shows that adaLN-Zero (the green line) achieves significantly lower FID (better quality) compared to other conditioning methods like Cross-Attention or In-Context tokens.
  
  该图像是一个示意图，比较了不同条件策略在训练过程中对FID-50K的影响。图中展示了XL/2在不同训练步骤下的表现，其中adaLN-Zero在所有训练阶段均优于其他策略，表明其在模型训练中的优势。

Step 4: Output and Depatchify

After passing through $N$ blocks:

The sequence of tokens is decoded using a standard linear layer.
Depatchify: The linear layer predicts the noise $\epsilon$ and the covariance (for the variance schedule). The output shape for each token matches the original patch size (pixel-wise).
The tokens are rearranged back into the spatial layout $I \times I \times C$ to form the predicted noise map.

4.3. Diffusion Formulation

DiT is trained using the standard Variational Lower Bound used in DDPMs.

Forward Process (Adding Noise): $q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)\mathbf{I})$
- $x_t$ : Noisy image at step $t$ .
- $x_0$ : Original clean image.
- $\bar{\alpha}_t$ : Pre-defined noise schedule constants.
Reverse Process (Denoising): The model predicts the mean $\mu_\theta$ and covariance $\Sigma_\theta$ of the posterior: $p_\theta(x_{t-1} | x_t) = \mathcal{N}(\mu_\theta(x_t), \Sigma_\theta(x_t))$
Loss Function: The model is primarily trained to minimize the Mean Squared Error (MSE) between the true noise $\epsilon_t$ and the predicted noise $\epsilon_\theta$ : $\mathcal{L}_{simple}(\theta) = || \epsilon_\theta(x_t) - \epsilon_t ||_2^2$ (Note: The paper also includes a term for learning the covariance $\Sigma_\theta$ , following the method by Nichol & Dhariwal).
Classifier-Free Guidance: To improve image quality based on class labels, the model predicts noise twice: once conditioned on the class $c$ ( $\epsilon_\theta(x_t, c)$ ) and once unconditionally ( $\epsilon_\theta(x_t, \emptyset)$ ). The final prediction is a weighted combination: $\hat{\epsilon}_\theta(x_t, c) = \epsilon_\theta(x_t, \emptyset) + s \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \emptyset))$
- $s$ : The guidance scale. $s > 1$ amplifies the effect of the class condition.

5. Experimental Setup

5.1. Datasets

Dataset: ImageNet.
Resolutions: $256 \times 256$ and $512 \times 512$ .
Characteristics: ImageNet is the gold-standard benchmark for class-conditional image generation. It contains 1,000 distinct classes (e.g., "husky", "volcano", "balloon") and diverse natural images.
Latent Space: The images are downsampled by a factor of 8 ( $f=8$ ) using the VAE encoder from Stable Diffusion. $256 \times 256$ pixels become $32 \times 32$ latents.

5.2. Evaluation Metrics

5.2.1. Fréchet Inception Distance (FID)

Conceptual Definition: FID measures the distance between the distribution of real images and generated images in the feature space of a pre-trained Inception-v3 network. It captures both fidelity (how real the images look) and diversity (how varied they are). A score of 0 is perfect (identical distributions).
Mathematical Formula: $FID = ||\mu_r - \mu_g||_2^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$
Symbol Explanation:
- $\mu_r, \Sigma_r$ : The mean vector and covariance matrix of the features of real images.
- $\mu_g, \Sigma_g$ : The mean vector and covariance matrix of the features of generated images.
- $\text{Tr}$ : The trace operation (sum of diagonal elements).

5.2.2. Inception Score (IS)

Conceptual Definition: Measures how distinct the objects in the generated images are (sharpness of classification) and how diverse the classes are overall. Higher is better.
Mathematical Formula: $IS = \exp(\mathbb{E}_{x \sim p_g} [ D_{KL}( p(y|x) || p(y) ) ])$
Symbol Explanation:
- $x \sim p_g$ : Generated images.
- $p(y|x)$ : The conditional class distribution predicted by the Inception network for image $x$ (should be sharp for good images).
- p(y): The marginal class distribution (should be uniform for diverse images).
- $D_{KL}$ : Kullback-Leibler divergence.

5.2.3. Precision and Recall

Conceptual Definition:
- Precision: What fraction of generated images fall within the manifold of real images? (Measures quality/realism).
- Recall: What fraction of the real image manifold is covered by generated images? (Measures diversity).

5.3. Baselines

The paper compares DiT against the heavyweights of image generation:

ADM (Ablated Diffusion Model): The previous state-of-the-art pixel-space diffusion model by Dhariwal & Nichol (OpenAI). Uses a U-Net.
LDM (Latent Diffusion Model): The architecture behind Stable Diffusion. Uses a U-Net backbone on latents.
StyleGAN-XL: The strongest GAN model.

6. Results & Analysis

6.1. Core Results Analysis

The primary finding is that DiT-XL/2 (Extra Large model, Patch size 2) achieves a new state-of-the-art (SOTA) on ImageNet.

FID: DiT-XL/2 achieves an FID of 2.27 on ImageNet 256x256. This is significantly lower (better) than LDM-4 (3.60) and ADM (3.85).
Gflops vs. Quality: The paper establishes that Gflops are a strong predictor of quality. The following figure (Figure 8 from the original paper) plots Gflops against FID. Notice the clear negative correlation (diagonal line downwards): adding compute by making the model deeper/wider (S $\to$ B $\to$ L $\to$ XL) or using more tokens (patch size 8 $\to$ 4 $\to$ 2) directly reduces FID.

该图像是一个图表，展示了Transformer模型的Gflops与FID-50K之间的强相关性。图中的点表示不同DiT模型的Gflops与对应的FID值，相关性为-0.93，表明Gflops增大时FID值显著降低。

6.2. Detailed Benchmarks (Table 2)

The following are the results from Table 2 of the original paper, comparing DiT to other top models on ImageNet 256x256. Note how DiT-XL/2 dominates when Classifier-Free Guidance (cfg) is used.

Class-Conditional ImageNet 256×256
Model	FID↓	sFID↓	IS↑	Precision↑	Recall↑
BigGAN-deep	6.95	7.36	171.4	0.87	0.28
StyleGAN-XL	2.30	4.02	265.12	0.78	0.53
ADM	10.94	6.02	100.98	0.69	0.63
ADM-U	7.49	5.13	127.49	0.72	0.63
ADM-G	4.59	5.25	186.70	0.82	0.52
ADM-G, ADM-U	3.94	6.14	215.84	0.83	0.53
CDM	4.88	-	158.71	-	-
LDM-8	15.51	-	79.03	0.65	0.63
LDM-8-G	7.76	-	209.52	0.84	0.35
LDM-4	10.56	-	103.49	0.71	0.62
LDM-4-G (cfg=1.25)	3.95	-	178.22	0.81	0.55
LDM-4-G (cfg=1.50)	3.60	-	247.67	0.87	0.48
DiT-XL/2	9.62	6.85	121.50	0.67	0.67
DiT-XL/2-G (cfg=1.25)	3.22	5.28	201.77	0.76	0.62
DiT-XL/2-G (cfg=1.50)	2.27	4.60	278.24	0.83	0.57

6.3. Scaling Analysis & Efficiency

Model Size vs. Patch Size: The analysis shows that reducing patch size (e.g., from 4 to 2) is a very effective way to improve performance, often similar to making the model much deeper.
Compute Efficiency: The authors found that larger models are actually more compute-efficient.
- Explanation: A larger model (like XL) trained for fewer steps reaches a low FID faster than a small model trained for many steps.
- The following figure (Figure 9 from the original paper) demonstrates this. The larger models (lines ending on the left/lower side) reach lower FID with less total Training Gflops (x-axis) compared to smaller models that plateau.
  
  该图像是一个图表，展示了不同规模的DiT模型在总训练计算量与FID之间的关系。图中显示了多个不同配置的模型（如S/8、B/8等）在计算量（Gflops）变化时，FID值如何降低。插图部分放大了低FID区域的表现，突显了更大模型的计算效率。

6.4. Sampling Compute Analysis

An interesting question is: Can a small model match a large model if we just let it "think" longer during generation (i.e., use more sampling steps)?

Finding: No. Figure 10 shows that even if a small model uses 1000 sampling steps, it cannot beat a large model using only 128 steps. Model capacity (training Gflops) is more important than sampling compute.

$Figure 10. Scaling-up sampling compute does not compensate for a lack of model compute. For each of our DiT models trained for 400K iterations, we compute FID-10K using \[16, 32, 64, 128, 256, 1000\] sampling steps. For each number of steps, we plot the FID as well as the Gflops used to sample each image. Small models cannot close the performance gap with our large models, even if they sample with more test-time Gflops than the large models.$ 该图像是一个图表，展示了不同采样计算（Gflops）与FID-10K的关系。数据点表明，尽管小模型在测试时的Gflops可能更高，但它们无法弥补与大模型之间的性能差距。

6.5. Visual Samples

The visual quality of the samples is extremely high. The following images show samples from the 512x512 model.

Figure 15: Volcano

$Figure 15. Uncurated $5 1 2 \\times 5 1 2$ DiT-XL/2 samples. Classifier-free guidance scale $= 4 . 0$ Class label `=` "volcano" (980)$

Figure 20: Lion

$Figure 20. Uncurated $5 1 2 \\times 5 1 2$ DiT-XL/2 samples. Classifier-free guidance scale $= 4 . 0$ Class label `=` "lion" (291)$ 该图像是第20幅插图，展现了未经过滤的 512 imes 512 DiT-XL/2 样本，类别标签为“狮子”(291)，使用了无分类器引导，比例为4.0。图中展示了多种姿态和表情的狮子，突出其多样性和真实感。

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully demonstrates that the complex convolutional U-Net backbone, long considered essential for diffusion models, is not necessary. By replacing it with a standard Transformer (DiT) operating on latent patches, the authors achieved state-of-the-art performance. The work firmly establishes that diffusion models follow scaling laws: simply adding more compute (via model size or token count) yields better images. This aligns image generation with the massive scaling trends seen in Large Language Models (LLMs).

7.2. Limitations & Future Work

Computational Cost: The highest performing model (DiT-XL/2) is computationally heavy (118 Gflops per forward pass). While efficient relative to pixel-space models, it is still resource-intensive to train.
Latent Space Dependency: The model relies on a fixed, pre-trained VAE. The quality of the diffusion is upper-bounded by the quality of this VAE's compression.
Future Work: The authors suggest scaling DiT even further (larger models, more tokens) and applying it to text-to-image tasks (which later happened with OpenAI's Sora, which is widely believed to be based on DiT principles).

7.3. Personal Insights & Critique

The "Sora" Connection: This paper is widely cited as a foundational architecture for OpenAI's Sora video generation model. Sora treats video as "spacetime patches," a direct extension of DiT's image patches. Understanding DiT is crucial for understanding the current cutting edge of video AI.
Simplicity Wins: Just like in NLP, where complex LSTM/RNN architectures were replaced by the uniform Transformer, DiT shows the same trend in vision generation. The "Bitter Lesson" of AI (that general methods that leverage compute beat clever, hand-engineered methods) applies here perfectly.
Ablation Importance: The discovery of adaLN-Zero is a subtle but critical technical contribution. It highlights that while the macro architecture (Transformer) is robust, the micro details of how conditioning is injected matter immensely for performance.