Paper status: completed

Effective Diffusion Transformer Architecture for Image Super-Resolution

Published:09/29/2024

Image Super-resolution (6)Diffusion Models (8)Diffusion Transformer (6)Multi-Scale Hierarchical Feature Extraction (1)Frequency-Adaptive Time-Step Conditioning Module (1)

Original Link PDF

Price: 0.100000

8 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DiT-SR introduces a U-shaped diffusion transformer with frequency-adaptive conditioning, enhancing multi-scale feature extraction and resource allocation, achieving superior super-resolution without pretraining compared to prior-based methods.

Abstract

Recent advances indicate that diffusion models hold great promise in image super-resolution. While the latest methods are primarily based on latent diffusion models with convolutional neural networks, there are few attempts to explore transformers, which have demonstrated remarkable performance in image generation. In this work, we design an effective diffusion transformer for image super-resolution (DiT-SR) that achieves the visual quality of prior-based methods, but through a training-from-scratch manner. In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks across different stages. The former facilitates multi-scale hierarchical feature extraction, while the latter reallocates the computational resources to critical layers to further enhance performance. Moreover, we thoroughly analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module, enhancing the model's capacity to process distinct frequency information at different time steps. Extensive experiments demonstrate that DiT-SR outperforms the existing training-from-scratch diffusion-based SR methods significantly, and even beats some of the prior-based methods on pretrained Stable Diffusion, proving the superiority of diffusion transformer in image super-resolution.

Mind Map

In-depth Reading

English Analysis~38 min read · 51,245 chars

1. Bibliographic Information

1.1. Title

Effective Diffusion Transformer Architecture for Image Super-Resolution

1.2. Authors

The authors are Kun Cheng, Lei Yu, Zhijun Tu, Xiao He, Liyu Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, and Jie Hu. Their affiliations include:

State Key Laboratory of Integrated Services Networks, Xidian University
Huawei Noah's Ark Lab
Consumer Business Group, Huawei
Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications

1.3. Journal/Conference

This paper is published as a preprint on arXiv. While arXiv is not a peer-reviewed journal or conference in itself, it is a widely recognized platform for sharing cutting-edge research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers published on arXiv are often submitted to prestigious conferences or journals later. The publication date suggests it is a very recent work in the field.

1.4. Publication Year

2024

1.5. Abstract

This paper introduces DiT-SR (Diffusion Transformer for Image Super-Resolution), a novel diffusion model architecture designed to achieve high visual quality in image super-resolution tasks through training from scratch, rather than relying on pre-trained models. The core innovations of DiT-SR include a U-shaped architecture for multi-scale hierarchical feature extraction and a uniform isotropic design for all transformer blocks, which optimizes computational resource allocation to critical layers. Furthermore, the paper identifies limitations in the widely used Adaptive Layer Normalization (AdaLN) for time-step conditioning and proposes a new frequency-adaptive time-step conditioning module called Adaptive Frequency Modulation (AdaFM). This AdaFM module enhances the model's ability to process distinct frequency information at different time steps, crucial for image super-resolution. Extensive experiments demonstrate that DiT-SR significantly outperforms existing training-from-scratch diffusion-based SR methods and even surpasses some prior-based methods (which leverage large pre-trained models like Stable Diffusion), thereby proving the superiority of the diffusion transformer approach in image super-resolution.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/2409.19589 PDF Link: https://arxiv.org/pdf/2409.19589v1.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the performance gap between training-from-scratch diffusion-based super-resolution (SR) methods and prior-based SR methods. While diffusion models (DMs) have shown great promise in image super-resolution, prior-based methods (which fine-tune large pre-trained generative models like Stable Diffusion) generally achieve superior visual quality due to their extensive training on vast datasets. However, these prior-based methods suffer from slow inference speeds and lack flexibility for architectural modifications without massive retraining. Training-from-scratch methods, on the other hand, offer significant flexibility and are ideal for lightweight applications but have historically struggled to match the performance of their prior-based counterparts.

The motivation is to bridge this performance gap. The paper asks: "Can we develop a diffusion architecture trained from scratch while rivaling the performance of prior-based methods, balancing both performance and flexibility?" The advent of the Diffusion Transformer (DiT) architecture, known for its scalability and performance in image generation, makes this question feasible to explore.

2.2. Main Contributions / Findings

The paper's primary contributions are:

DiT-SR Architecture: The introduction of DiT-SR, an effective Diffusion Transformer specifically designed for image super-resolution. It is the first work to seamlessly combine a U-shaped global architecture (for multi-scale feature extraction) with isotropic designs for its transformer blocks (for efficient computational resource allocation to critical layers). This design allows it to achieve visual quality comparable to or better than prior-based methods while being trained from scratch.
Adaptive Frequency Modulation (AdaFM): The development of an efficient and effective frequency-wise time-step conditioning module called AdaFM. This module replaces the widely used AdaLN and is designed to adaptively reweight different frequency components at varying time steps, addressing the specific frequency-perception requirements of image super-resolution tasks. AdaFM uses significantly fewer parameters than AdaLN while boosting performance.
Superior Performance with Fewer Parameters: Extensive experiments demonstrate that DiT-SR dramatically outperforms existing training-from-scratch diffusion-based SR methods. Furthermore, it even surpasses some prior-based SR methods (which leverage pre-trained Stable Diffusion) while using only about 5% of their parameters. This proves the superiority of the diffusion transformer in image super-resolution and achieves a better balance between performance and flexibility.

The key findings are that by carefully designing the transformer architecture to leverage U-shaped multi-scale processing and isotropic resource allocation, and by introducing a frequency-aware time-step conditioning mechanism, it is possible to train a diffusion model from scratch that rivals the performance of computationally intensive prior-based methods, leading to more flexible and efficient super-resolution solutions.

3.1. Foundational Concepts

To understand this paper, a foundational understanding of several key concepts in deep learning and image processing is essential:

Image Super-Resolution (SR): The task of reconstructing a high-resolution (HR) image from a given low-resolution (LR) input image. It's a classic problem in computer vision, aiming to recover lost details and improve image clarity.
Diffusion Models (DMs): A class of generative models that learn to reverse a gradual diffusion process.
- Forward Diffusion Process: In this process, noise is progressively added to a data sample (e.g., an image) over a series of $T$ time steps, transforming it into pure Gaussian noise. For an image $x_0$ , a noisy version $x_t$ at time step $t$ is generated by: $ q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I) $ where $x_t$ is the noisy image at time $t$ , $x_0$ is the original image, $\mathcal{N}$ denotes a normal distribution, $\bar{\alpha}_t = \prod_{i=0}^{t} \alpha_i$ , and $\alpha_i$ is a predefined variance schedule.
- Reverse Denoising Process: The model learns to reverse this process, starting from pure noise and iteratively predicting the original data distribution by removing noise at each time step. This is typically done by training a denoiser network $\epsilon_\theta$ to predict the noise added at time $t$ . The denoising step is typically formulated as sampling from $p_\theta(x_{t-1} | x_t)$ , where the mean $\mu_\theta$ is predicted by the denoiser.
Transformers: An architecture introduced in 2017 primarily for natural language processing, but later adapted for computer vision.
- Self-Attention: The core mechanism of transformers, allowing the model to weigh the importance of different parts of the input sequence (or image patches) when processing a particular element. It calculates a weighted sum of value (V) vectors, where the weights are determined by the similarity between a query (Q) vector and key (K) vectors.
- Multi-Head Self-Attention (MHSA): Multiple self-attention mechanisms run in parallel, each learning different relationships, and their outputs are concatenated.
- Multi-Layer Perceptron (MLP): A feed-forward neural network applied independently to each position in the sequence, typically consisting of two linear layers with an activation function (e.g., GELU) in between.
- Vision Transformer (ViT): An adaptation of Transformers for image data, where images are split into fixed-size patches, linearly embedded, and then processed as sequences by a Transformer encoder.
- Diffusion Transformer (DiT): A variant of Transformers used as the denoiser in diffusion models. It replaces the convolutional U-Net typically used in DMs with an isotropic, full-transformer architecture, often maintaining constant resolution and channel dimensions across layers.
U-Net Architecture: A convolutional neural network (CNN) architecture characterized by its U-shaped encoder-decoder structure with skip connections.
- Encoder: Downsamples the input feature maps, extracting hierarchical features at different scales.
- Decoder: Upsamples the features to the original resolution, combining them with corresponding features from the encoder via skip connections to recover fine-grained details. It's widely used in image segmentation and low-level vision tasks due to its ability to capture both contextual and fine-grained information.
Latent Diffusion Models (LDM): A type of diffusion model that performs the diffusion process in a compressed latent space rather than directly in the pixel space. This significantly reduces computational costs, especially for high-resolution images, by using an encoder-decoder (e.g., VQGAN) to map images to and from the latent space.
Frequency Analysis (Fourier Transform): A mathematical tool that decomposes a signal (like an image) into its constituent frequencies.
- Low Frequencies: Represent the overall structure and smooth variations in an image.
- High Frequencies: Represent fine details, textures, and edges in an image.
- Fast Fourier Transform (FFT): An efficient algorithm to compute the Discrete Fourier Transform.

3.2. Previous Works

The paper contextualizes its work by discussing existing diffusion-based SR methods, categorizing them into train-from-scratch and prior-based approaches, and also reviewing diffusion model architectures.

Diffusion-based Image Super-Resolution:
- SR3 [38]: A pioneer in applying diffusion models to image super-resolution, demonstrating their potential.
- LDM [36]: Improves efficiency by performing the diffusion process in a latent space, making it more practical for high-resolution images. It uses a U-Net as its denoiser.
- ResShift [58]: Reformulates the diffusion process to create a Markov chain between HR and LR images directly, rather than HR and Gaussian noise. This reduces the number of denoising steps required, improving inference speed. The paper adopts this paradigm.
  - Formula context for ResShift: ResShift introduces a residual $\pmb{e}_0 = \pmb{y}_0 - \pmb{x}_0$ between LR and HR images, and a shifting sequence $\{\eta_t\}_{t=1}^T$ . The forward process is: $q(\pmb{x}_t | \pmb{x}_0, \pmb{y}_0) = \mathcal{N}(\pmb{x}_t; \pmb{x}_0 + \eta_t \pmb{e}_0, \kappa^2 \eta_t \pmb{I}), t = 1, 2, \cdots, T$ where $\pmb{x}_t$ is the noisy image at time $t$ , $\pmb{x}_0$ is the clean HR image, $\pmb{y}_0$ is the LR image (though the paper abstract uses $\pmb{y}$ for LR and $\pmb{x}_0$ for HR, in section 3.1, it uses $\pmb{y}$ for LR and $\pmb{x}_0$ for HR, then in 3.2, it uses $\pmb{y}_0$ for LR and $\pmb{x}_0$ for HR. I will stick to $\pmb{y}_0$ for LR and $\pmb{x}_0$ for HR for clarity when discussing ResShift specific formulas, consistent with the formula in 3.2.), $\eta_t$ is the shifting coefficient, $\kappa$ controls noise variance. This formulation directly connects the LR image into the diffusion process. The reverse process predicts $\pmb{x}_0$ from $\pmb{x}_t, \pmb{y}_0, t$ using a denoiser $f_\theta$ . This paradigm effectively reduces the required Markov chain length.
- Prior-based Methods: These methods, including StableSR [45], DiffBIR [29], PASD [55], and SeeSR [52], exploit the generative prior from large pre-trained diffusion models like Stable Diffusion [35, 36]. While achieving remarkable results, they suffer from slow inference and limited flexibility for architectural changes. The paper notes attempts like SinSR [49] and AddSR [53] to reduce denoising steps via knowledge distillation, but these typically don't allow fundamental architecture alterations without retraining.
Diffusion Model Architecture:
- U-Net [37]: Traditionally, most diffusion models (e.g., [6, 16, 33, 36, 40]) have used the U-Net architecture as their denoiser, often incorporating ResBlocks [14] and Transformer blocks [42]. Its U-shaped design is known for hierarchical feature extraction.
- Diffusion Transformer (DiT) [34]: This marked a significant departure by proposing an isotropic full-transformer architecture for denoising. It maintains constant resolution and channel dimensions across its transformer blocks, demonstrating superior scalability and establishing a new paradigm.
  - Formula context for AdaLN (from original DiT paper, though not explicitly in this paper, it's crucial for context): AdaLN (Adaptive Layer Normalization) is a common conditioning mechanism in DiT models. It modulates the normalized features using learned scale and shift parameters derived from the conditioning input (e.g., time step $t$ ). If $z$ is the input feature to a normalization layer, AdaLN computes $y = \gamma(t) \cdot \mathrm{Norm}(z) + \beta(t)$ , where $\gamma(t)$ and $\beta(t)$ are learned functions (often MLPs) of the time step $t$ , which produce per-channel scale and shift parameters.
- Subsequent DiT-based works: [9, 10, 13, 27, 31, 32] have adopted or built upon the DiT architecture, showing strong performance across various tasks.
- U-ViT [2]: A hybrid approach that retains U-Net's long skip connections but without explicit upsampling or downsampling operations within the main processing path.

3.3. Technological Evolution

The field of image super-resolution has evolved significantly:

Traditional Methods: Early methods relied on interpolation or hand-crafted priors.
CNN-based Methods: The advent of Convolutional Neural Networks (CNNs) revolutionized SR, with models like SRCNN and ESRGAN achieving impressive results.
GAN-based Methods: Generative Adversarial Networks (GANs) further pushed visual quality, especially for perceptual metrics, by learning to generate realistic details.
Diffusion Models (DMs): More recently, diffusion models have emerged as powerful generative models, offering superior sample quality and diversity compared to GANs. Their application to SR has shown exceptional potential.
Latent Diffusion Models (LDMs): To address the computational cost of DMs on high-resolution images, LDMs were introduced, performing diffusion in a compressed latent space.
Transformer Architectures: Transformers moved from NLP to vision (ViT, SwinIR) and then adapted for diffusion models (DiT), replacing the U-Net denoiser with a purely transformer-based architecture, known for scalability.
Hybrid and Optimized Architectures: This paper's work (DiT-SR) fits into the latest stage, seeking to combine the best aspects of U-Net (multi-scale feature extraction) and DiT (scalable, isotropic transformer design) within the diffusion model framework, specifically for super-resolution, while also addressing limitations in time-step conditioning.

3.4. Differentiation Analysis

Compared to the main methods in related work, DiT-SR offers several core differences and innovations:

From Standard DiT to U-shaped DiT with Isotropic Design:
- Standard DiT: DiT [34] is an isotropic full-transformer architecture, meaning it typically maintains constant resolution and channel dimensions throughout its layers. While scalable, it lacks the explicit multi-scale hierarchical feature extraction capabilities that U-Nets provide, which are often beneficial for low-level vision tasks like SR.
- U-shaped DiT (traditional): Conventional U-Net based diffusion denoisers use CNNs or transformers in a U-shaped manner (downsampling, upsampling, skip connections), but their transformer blocks are not necessarily isotropically designed or optimized for resource reallocation across different stages in the way DiT is.
- DiT-SR's Innovation: DiT-SR is the first to marry these two paradigms: it adopts an overall U-shaped encoder-decoder structure for multi-scale handling, but crucially, it employs a uniform isotropic design for all transformer blocks across different stages. This means that while the overall feature map resolution changes across stages (like a U-Net), the transformer blocks within each stage are isotropically designed and, more importantly, DiT-SR reallocates computational resources by standardizing channel dimensions in an optimized way. Specifically, it uses a larger channel dimension for high-resolution stages and a smaller one for low-resolution stages, which is a key difference from a purely isotropic DiT (constant channels) or a typical U-Net (channels scale with depth).
From AdaLN to Adaptive Frequency Modulation (AdaFM):
- AdaLN: The widely used Adaptive Layer Normalization in DiT-based models modulates features in a channel-wise manner. While effective for general image generation, the paper argues it's inefficient for SR because SR requires strong frequency perception, and AdaLN doesn't adaptively handle different frequency components at different denoising stages (e.g., low frequencies early, high frequencies late).
- DiT-SR's Innovation (AdaFM): AdaFM directly addresses this limitation by moving time-step conditioning from the spatial domain to the frequency domain. It adaptively reweights different frequency components based on the time step, making it more suitable for SR tasks that progressively recover high-frequency details. This is a novel and more parameter-efficient approach to time-step conditioning.
From Prior-based to Training-from-Scratch (with comparable performance):
- Prior-based Methods: These methods (StableSR, DiffBIR, PASD, SeeSR) achieve high quality by fine-tuning massive diffusion models (Stable Diffusion) pre-trained on enormous datasets. Their advantage comes from generative priors.
- DiT-SR's Innovation: DiT-SR is trained entirely from scratch. Its innovative architectural design and AdaFM allow it to achieve performance that significantly outperforms other training-from-scratch methods and even beats some prior-based methods, all while using substantially fewer parameters (e.g., 5% of prior-based models). This offers a highly flexible and efficient alternative without sacrificing quality.
  
  In essence, DiT-SR innovates by combining architectural strengths, introducing a specialized conditioning mechanism, and achieving a new state-of-the-art for training-from-scratch diffusion SR that challenges the dominance of prior-based methods.

4. Methodology

The proposed DiT-SR (Diffusion Transformer for Image Super-Resolution) model aims to combine the strengths of U-shaped architectures (for multi-scale processing) with isotropic transformer designs (for efficient resource allocation) and introduces a novel frequency-adaptive time-step conditioning module called AdaFM. This section details its principles and core methodology.

4.1. Principles

The core idea behind DiT-SR is to leverage the best of both U-Net and Diffusion Transformer (DiT) paradigms, tailored specifically for image super-resolution, while addressing a critical shortcoming in time-step conditioning.

Multi-scale Feature Extraction: Image super-resolution inherently benefits from processing information at multiple scales, capturing both coarse structural context and fine-grained details. The U-shaped architecture is adept at this, progressively downsampling to extract high-level features and then upsampling to reconstruct details.
Efficient Transformer Scaling: DiT architectures have shown remarkable scalability and performance. The paper observes that high-resolution DiTs (e.g., those processing larger feature maps) benefit more from scaling up computational resources. By applying an isotropic design principle within a U-shaped framework, DiT-SR strategically reallocates computational resources, concentrating more capacity on critical high-resolution layers. This allows for a powerful transformer architecture within a given computational budget, avoiding the tedious scheduling policies often needed in traditional U-Nets.
Frequency-Adaptive Time-step Conditioning: Diffusion models for SR tasks demonstrate a temporal evolution in their reconstruction process: they first generate low-frequency components (structure) and then high-frequency components (details). Standard time-step conditioning mechanisms like AdaLN, which operate channel-wise in the spatial domain, are not frequency-aware and thus inefficient for SR. DiT-SR introduces AdaFM to explicitly adapt modulation based on frequency, allowing the model to emphasize different frequency components at different denoising stages.

4.2. Core Methodology In-depth (Layer by Layer)

DiT-SR is a denoiser that follows the Residual Shifting (ResShift) paradigm for image super-resolution.

4.2.1. Diffusion Models and Residual Shifting Context

The goal of diffusion-based SR methods is to model the conditional distribution $q(\pmb{x}_0 | \pmb{y})$ , where $\pmb{y}$ is the low-resolution (LR) image and $\pmb{x}_0$ is its corresponding high-resolution (HR) image.

Forward Diffusion Process (Standard DMs): The standard forward process gradually adds noise to $\pmb{x}_0$ to obtain $x_t$ at time step $t$ . This can be expressed in a single step using the reparameterization trick: $q(\pmb{x}_t | \pmb{x}_0) = \mathcal{N}(\pmb{x}_t; \sqrt{\bar{\alpha}_t} \pmb{x}_0, (1 - \bar{\alpha}_t)\pmb{I}) \mathrm{ with } \bar{\alpha}_t = \prod_{i=0}^{t} \alpha_i$ Here, $\pmb{x}_t$ is the noisy image at time $t$ , $\pmb{x}_0$ is the original HR image, $\mathcal{N}$ denotes a normal distribution, $\bar{\alpha}_t$ is the cumulative product of $\alpha_i$ (predefined variance schedule parameters up to time $t$ ), and $\pmb{I}$ is the identity matrix. This equation describes how a clean image $\pmb{x}_0$ is transformed into a noisy image $\pmb{x}_t$ by adding Gaussian noise with a mean scaled by $\sqrt{\bar{\alpha}_t}$ and a variance of $(1 - \bar{\alpha}_t)$ .

Reverse Denoising Process (Standard DMs): The reverse process starts from pure Gaussian noise and iteratively generates $x_{t-1}$ from $x_t$ . The model learns to approximate the posterior distribution $p_\theta(\pmb{x}_{t-1} | \pmb{x}_t, \pmb{y}_0)$ : $p_\theta(\pmb{x}_{t-1} | \pmb{x}_t, \pmb{y}_0) = \mathcal{N}(\mu_\theta(\pmb{x}_t, \pmb{y}_0, t), \Sigma(\pmb{x}_t, t))$ In this equation, $\mu_\theta(\pmb{x}_t, \pmb{y}_0, t)$ is the mean predicted by the denoiser (parameterized by $\theta$ ), which estimates the noise (or the clean image) given the current noisy image $\pmb{x}_t$ , the LR condition $\pmb{y}_0$ , and the time step $t$ . $\Sigma(\pmb{x}_t, t)$ is a constant variance that depends on $\alpha_t$ .

Residual Shifting (ResShift) Paradigm: DiT-SR adopts the ResShift paradigm, which constructs a Markov chain directly between HR and LR images. This reformulation is more efficient for SR tasks. Let $\pmb{e}_0 = \pmb{y}_0 - \pmb{x}_0$ represent the residual between the LR image $\pmb{y}_0$ and the HR image $\pmb{x}_0$ . A shifting sequence $\{\eta_t\}_{t=1}^T$ is introduced, increasing from $\eta_1 = 0$ to $\eta_T = 1$ .

The forward process in ResShift is formulated as: $q(\pmb{x}_t | \pmb{x}_0, \pmb{y}_0) = \mathcal{N}(\pmb{x}_t; \pmb{x}_0 + \eta_t \pmb{e}_0, \kappa^2 \eta_t \pmb{I}), t = 1, 2, \cdots, T$ Here, $\pmb{x}_t$ is the noisy state at time $t$ , $\pmb{x}_0$ is the HR image, $\pmb{y}_0$ is the LR image, $\pmb{e}_0$ is the residual $\pmb{y}_0 - \pmb{x}_0$ . The mean of the normal distribution is $\pmb{x}_0 + \eta_t \pmb{e}_0$ , which means the noisy image is shifted towards the LR image based on the shifting coefficient $\eta_t$ . The variance is $\kappa^2 \eta_t \pmb{I}$ , where $\kappa$ is a hyperparameter controlling the noise variance. The parameter $\alpha_t$ is defined as $\alpha_t = \eta_t - \eta_{t-1}$ for $t > 1$ and $\alpha_1 = \eta_1$ .

The reverse (denoising) process in ResShift is formulated to predict $\pmb{x}_0$ directly: $p_\theta(\pmb{x}_{t-1} | \pmb{x}_t, \pmb{x}_0, \pmb{y}_0) = \mathcal{N}\left(\pmb{x}_{t-1} \bigg| \frac{\eta_{t-1}}{\eta_t} \pmb{x}_t + \frac{\alpha_t}{\eta_t} f_\theta(\pmb{x}_t, \pmb{y}_0, t), \kappa^2 \frac{\eta_{t-1}}{\eta_t} \alpha_t \pmb{I}\right)$ In this equation, the denoiser $f_\theta(\pmb{x}_t, \pmb{y}_0, t)$ directly predicts the clean HR image $\pmb{x}_0$ . This design simplifies the Markov chain and reduces the number of required time steps for SR.

4.2.2. Overall Architecture (DiT-SR)

The DiT-SR architecture (depicted in Figure 3 from the original paper) is an encoder-decoder network with an overall U-shaped global framework. However, it uniquely combines this with an isotropic design for all transformer blocks within different stages.

该图像是论文中关于三种不同Diffusion Transformer架构的示意图，分别为(a)标准DiT，(b)U型DiT，以及(c)本文提出的架构。图中展示了各自Transformer块的层次和特征图尺寸变化，体现了方法在多尺度特征提取上的差异。

Figure 3 from the original paper illustrates three Diffusion Transformer architectures: (a) Standard DiT, (b) U-shaped DiT, and (c) the proposed DiT-SR. The Standard DiT maintains constant feature map resolution. The U-shaped DiT (b) and DiT-SR (c) utilize downsampling and upsampling. DiT-SR's key visual difference is the strategic reallocation of channels (C1, C2, C3, C4) to high-resolution layers by making them wider, while other U-shaped DiTs might scale channels differently.

Input: The LR image $\pmb{y}$ and the noisy image $\pmb{x}_t$ (or a representation derived from it) are concatenated along the channel dimension. This concatenated input, along with the time step $t$ , is fed into the denoiser. The denoiser then predicts $\pmb{x}_0$ and iteratively refines it as per the ResShift reverse process (Eq. 4).

Transformer Block: As shown in Figure 5 from the original paper, the fundamental building block of DiT-SR is a transformer block.

Figure 6. Visualization of the feature maps and their corresponding spectrums before and after applying AdaFM. AdaFM enhances the low-frequency components in the early stages of denoising (peripheral…
该图像是论文中图6展示的示意图，呈现了应用AdaFM前后不同时间步 $t=T, t=T/2, t=1$ 下的特征图及其频谱。AdaFM在去噪早期增强低频（频谱外围变暗），晚期增强高频（频谱外围变亮），提升模型对不同时间步频率的响应能力。

Figure 5 from the original paper depicts the internal structure of a transformer block in DiT-SR and the Adaptive Frequency Modulation (AdaFM) module. The block contains Multi-Head Self-Attention (MHSA) and an MLP. AdaFM is integrated after each normalization layer to inject the time step into the frequency domain.

Each transformer block consists of:

Multi-Head Self-Attention (MHSA): This mechanism acts as a spatial mixer. Due to the high computational cost and memory constraints of global self-attention for high-resolution inputs, DiT-SR employs local attention with window shifting, inspired by Swin Transformer [30, 42]. This restricts self-attention calculations to non-overlapping local windows and then enables cross-window interaction through shifted window partitions in successive blocks.
Multi-Layer Perceptron (MLP): This serves as a channel mixer, composed of two fully-connected layers separated by a GELU activation function.
Normalization Layers: Group normalization layers are applied before both the MHSA and MLP operations.
Adaptive Frequency Modulation (AdaFM): Critically, the proposed AdaFM module is integrated immediately after each normalization layer to inject the time step condition.

The computation within a transformer block can be formulated as follows: $\begin{array}{r l} & f_{time}^1, f_{time}^2 = \mathrm{MLP}_{\mathrm{t}}(t), \\ & X = \mathrm{MHSA}(\mathrm{AdaFM}(\mathrm{Norm}(X), f_{time}^1)) + X, \\ & X = \mathrm{MLP}(\mathrm{AdaFM}(\mathrm{Norm}(X), f_{time}^2)) + X. \end{array}$ In this formulation:
$t$ : Represents the current time step.
$\mathrm{MLP}_{\mathrm{t}}(t)$ : A Multi-Layer Perceptron that processes the time step $t$ to generate two distinct time-step feature vectors, $f_{time}^1$ and $f_{time}^2$ . These vectors are used to condition the MHSA and MLP branches, respectively.
$X$ : Represents the feature map input to and output from the transformer block.
$\mathrm{Norm}(X)$ : A normalization layer (specifically Group Normalization in this case) applied to the input feature map $X$ .
$\mathrm{AdaFM}(\mathrm{Norm}(X), f_{time}^1)$ : The proposed Adaptive Frequency Modulation module, which takes the normalized feature map and the time-step feature vector $f_{time}^1$ (or $f_{time}^2$ ) to adaptively modulate the features in the frequency domain.
$\mathrm{MHSA}(\cdot)$ : The Multi-Head Self-Attention mechanism, operating on the AdaFM-conditioned features. The $+ X$ indicates a residual connection (adding the original input $X$ to the output of MHSA).
$\mathrm{MLP}(\cdot)$ : The Multi-Layer Perceptron, operating on the AdaFM-conditioned features. The $+ X$ again indicates a residual connection.

4.2.3. Isotropic Design in U-shaped DiT

The paper's DiT-SR merges the advantages of U-shaped architectures and isotropic designs.

U-shaped Structure: The encoder part progressively reduces the spatial resolution of feature maps while increasing their channel dimensions to extract multi-scale contextual information. The decoder part reverses this process, upsampling features and decreasing channels, using skip connections to integrate fine-grained details from the encoder for reconstruction.
Isotropic Design Integration: Inspired by DiT observations (e.g., DiT can handle various patch sizes with the same depth/channels, and high-resolution DiTs scale better), DiT-SR introduces an isotropic design into its multi-scale U-shaped framework.
- Standardized Channel Dimension: Within each transformer stage (which operates at a fixed resolution), all transformer blocks are configured to have the same channel dimension.
- Reallocated Computational Resources: Instead of uniformly scaling channels, DiT-SR reallocates computational resources. It sets the standardized channel dimension in high-resolution stages to be larger than typically seen in standard U-Net configurations at those resolutions, but much smaller than the channel dimensions in low-resolution stages (which usually have the widest channels in U-Nets). This strategic reallocation means more computational power is directed to processing high-resolution features, which are critical for image super-resolution as they carry fine details.
- Benefit: This approach boosts the capacity of the transformer architecture in multi-scale paradigms with a more efficient use of the computational budget, often with fewer parameters than conventional U-Nets.
  
  The following figure (Figure 4 from the original paper) illustrates how this isotropic design, when applied to a U-shaped DiT, reallocates FLOPs and parameters.
  
  该图像是论文中图5的示意图，展示了DiT-SR中的Transformer块结构及自适应频率调制模块（AdaFM）。AdaFM通过将时间步注入频率域，实现对不同频率成分的自适应加权。

Figure 4 from the original paper displays the distribution of FLOPs and parameters across different stages of a U-shaped DiT, comparing a non-isotropic design with the isotropic design. The graphs show that the isotropic design effectively reallocates computational resources, increasing FLOPs and parameters in higher-resolution stages relative to lower-resolution stages, thereby focusing resources on critical layers. This figure is critical in understanding how the isotropic principle directs computational resources. The C1, C2, C3, C4 in Figure 3(c) represent the channel dimensions at different stages. The text mentions allocating computational resource to high-resolution layers with\Delta (4C_2 > C_3 > C_2)to boost the model capacity, which refers to the relative sizing of channels, not that $4C_2$ is the channel count. It implies that channels for higher resolution stages (like $C_2$ ) might be scaled up or kept larger relative to $C_3$ or $C_4$ than a typical U-Net would. The table 4 provides the specific reallocated channel, which is 192 for the DiT-SR base version.

4.2.4. Frequency-Adaptive Time Step Conditioning (AdaFM)

The paper identifies a limitation in Adaptive Layer Normalization (AdaLN) for SR tasks and proposes Adaptive Frequency Modulation (AdaFM).

Limitation of AdaLN: AdaLN, commonly used in DiT models, modulates features channel-wise. This means it applies the same scale and shift parameters across all spatial locations within a channel. However, image super-resolution demands strong frequency perception, as the diffusion model's denoising process involves different frequency components at distinct denoising phases. As observed in Figure 2 from the original paper, the model first reconstructs low-frequency components (overall structure) and then progressively refines high-frequency details (textures, edges).

Figure 2 from the original paper visualizes images and their Fourier spectrums at different denoising stages of a diffusion-based SR model. It shows that in early stages (e.g., $t=T$ ), low-frequency components (center of the spectrum) are more prominent, while in later stages (e.g., $t=1$ ), high-frequency components (periphery of the spectrum) become more refined. This demonstrates the model's reliance on different frequency information at different time steps.

AdaLN cannot adaptively modulate features based on their spatial frequency content (e.g., apply different modulation to a smooth region vs. an edge region) because it operates uniformly across spatial locations within a channel. Generating spatial-wise modulation parameters from a single time-step vector for distinct high/low-frequency spatial positions is challenging.

Adaptive Frequency Modulation (AdaFM): To overcome this, AdaFM replaces AdaLN after each normalization layer and shifts time-step modulation from the spatial domain to the frequency domain.

The process of AdaFM is formulated as follows: $\begin{array}{r l} & S_{spec} = \mathrm{reshape}(f_{time}, p \times p), \\ & f_{spec} = \mathrm{FFT}(\mathcal{P}(f_{spat})), \\ & f_{spec}' = S_{spec} \odot f_{spec}, \\ & f_{out} = \mathcal{P}^{-1}(\mathrm{iFFT}(f_{spec}')). \end{array}$ Let's break down each step and symbol:

$f_{time}$ : This is a time-step feature vector (e.g., $f_{time}^1$ or $f_{time}^2$ from the transformer block equation). It's a 1-dimensional vector derived from the time step $t$ .
$S_{spec} = \mathrm{reshape}(f_{time}, p \times p)$ : The time-step feature vector $f_{time}$ is reshaped into a $p \times p$ matrix, $S_{spec}$ , which serves as the frequency scale matrix. This matrix will be used to adaptively reweight different frequency components. The value $p$ is the FFT window size, empirically set to 8.
$f_{spat} \in \mathbb{R}^{C \times H \times W}$ : Represents the spatial domain feature map after normalization (e.g., $\mathrm{Norm}(X)$ ).
$\mathcal{P}(f_{spat})$ : This denotes the patch unfolding operation. To handle various input resolutions and enhance efficiency, the spatial domain feature map $f_{spat}$ is segmented into non-overlapping $p \times p$ windows. This operation creates segments (or patches) of the feature map.
$\mathrm{FFT}(\cdot)$ : The Fast Fourier Transform operation. This transforms each $p \times p$ window (segment) from the spatial domain into the frequency domain, resulting in spectrograms $f_{spec}$ . The spectrogram $f_{spec} \in \dot{\mathbb{R}}^{\frac{H \times W}{p^2} \times C \times p \times p}$ contains the frequency components for each window and channel.
$f_{spec}' = S_{spec} \odot f_{spec}$ : This is the core adaptive modulation step. The frequency scale matrix $S_{spec}$ (derived from the time step) is element-wise multiplied ( $\odot$ ) with the spectrograms $f_{spec}$ . This reweights different frequency components according to the time step, enhancing the model's ability to emphasize specific frequencies.
$\mathrm{iFFT}(f_{spec}')$ : The inverse Fast Fourier Transform operation. This transforms the modulated spectrograms $f_{spec}'$ back into the spatial domain.
$\mathcal{P}^{-1}(\cdot)$ : This denotes the patch folding operation. The processed windows are reassembled into the full spatial domain feature map $f_{out}$ .
$f_{out}$ : The output feature map, which has been adaptively modulated in the frequency domain based on the time step.

Frequency-Spatial Correspondence: The paper highlights that in a spectrum, each pixel at a specific spatial position corresponds to a predetermined frequency component. This relationship is solely defined by the feature map's spatial dimension, not its content. The frequency corresponding to a pixel at spatial position (u, v) in a spectrum $\in \mathbb{R}^{H \times W}$ is formulated as: $f_u = \frac{u - H/2}{H} \times F_s, \quad f_v = \frac{v - W/2}{H} \times F_s$ where:
$f_u, f_v$ : Denote the vertical and horizontal frequencies, respectively.
u, v: The coordinates of the pixel in the spectrum.
H, W: The height and width of the spectrum (or the $p \times p$ window size in AdaFM).
$F_s$ : The sampling frequency.

This consistency (that frequency at a spatial position in the spectrum is fixed) allows the same frequency scale matrix $S_{spec}$ to be applied across all windows and channels, making AdaFM highly efficient.

Efficiency: Compared to AdaLN, which requires $dim_{f_{time}} \times C \times 3 \times 2$ mapping parameters (for scale, shift, and gate for both self-attention and MLPs), AdaFM requires only $dim_{f_{time}} \times p^2 \times 2$ parameters. This results in significantly fewer parameters for AdaFM while boosting performance. The paper also points out that since different frequencies correspond to distinct spatial locations on the feature map within the frequency domain, AdaFM effectively provides spatial-wise modulation in an indirect but efficient manner.

The following figure (Figure 6 from the original paper) visually demonstrates the effect of AdaFM at different time steps.

Figure 7. Qualitative comparisons of different methods on both synthetic and real-world datasets.
该图像是一张图表，展示了不同方法在合成数据集和真实世界数据集上的超分辨率重建结果对比。每个子图中，左侧为低分辨率输入(LR)，右侧依次为各方法输出及高分辨率真实图像(HR)或参考。该图强调了Ours方法在细节恢复上的优越表现。

Figure 6 from the original paper visualizes feature maps and their spectrums (after FFT) before and after applying AdaFM at different time steps ( $t=T, t=T/2, t=1$ ). In early stages ( $t=T$ ), AdaFM enhances low-frequency components (making the peripheral part of the spectrum darker, indicating suppression of high frequencies). In later stages ( $t=1$ ), it enhances high-frequency components (making the peripheral part of the spectrum brighter, indicating emphasis on high frequencies). This adaptive behavior establishes a correlation between time step and frequency.

5. Experimental Setup

5.1. Datasets

The experiments evaluate the proposed model on the x4 real-world SR task.

Training Data: The training dataset is a comprehensive collection comprising:

LSDIR [26]: A large-scale dataset for image restoration.
DIV2K [1]: A high-quality dataset widely used for super-resolution and image restoration.
DIV8K [11]: Another dataset focusing on 8K resolution images.
OutdoorSceneTraining [46]: A dataset of outdoor scenes.
Flicker2K [41]: A dataset of images from Flickr.
FFHQ [20]: The first 10,000 face images from the Flickr-Faces-HQ dataset, which contains high-quality human faces.

During training, HR images are randomly cropped to $256 \times 256$ pixels. A degradation pipeline, specifically that of RealESRGAN [48], is used to synthesize LR/HR pairs from these HR images.

Test Datasets:

LSDIR-Test: A synthetic dataset created from LSDIR. The test set images are center-cropped to $512 \times 512$ pixels and subjected to the same RealESRGAN degradation pipeline used during training.
RealSR [4]: A real-world dataset comprising 100 images captured by Canon 5D3 and Nikon D810 cameras. This dataset provides LR/HR pairs obtained from real-world camera settings, making it suitable for evaluating performance on real-world degradations.
RealSet65 [58]: A collection of 65 low-resolution images gathered from widely used datasets and the internet. This also serves as a benchmark for real-world SR.

Blind Face Restoration (Appendix E): For Blind Face Restoration, a separate set of datasets and degradation is used:

Training Data: FFHQ [20] dataset, containing 70,000 high-quality face images at $1024 \times 1024$ resolution. These images are resized to $512 \times 512$ , and LQ images are synthesized using a typical degradation pipeline described in [47].
Test Datasets:
- CelebA-HQ [19]: 2,000 HR images randomly selected from its validation set are used to synthesize LQ images following GFPGAN [47]'s degradation.
- LFW [17]: Labeled Faces in the Wild, a dataset of 1,711 face images collected from diverse real-world sources, used to evaluate face recognition in unconstrained environments.
- WebPhoto [47]: A dataset of 407 web-crawled face images, including older photos with significant degradation.
- WIDER [62]: A subset of 970 face images from the WIDER Face dataset, featuring heavy degradation like occlusions, varying poses, scales, and lighting.

5.2. Evaluation Metrics

The paper employs a combination of reference-based and non-reference (no-reference) metrics to comprehensively evaluate the super-resolution performance. Reference-based metrics require a ground-truth HR image, while non-reference metrics do not, making them suitable for real-world scenarios where ground truth is unavailable.

5.2.1. Reference-Based Metrics (for synthetic datasets)

Peak Signal-to-Noise Ratio (PSNR):
- Conceptual Definition: PSNR is a quality metric that measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. It's most easily defined via the Mean Squared Error (MSE). A higher PSNR generally indicates a higher quality reconstruction. It's widely used but does not always correlate well with human perceptual quality.
- Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}I^2}{\mathrm{MSE}} \right) $ where $ \mathrm{MSE} = \frac{1}{MN} \sum{i=1}^{M} \sum_{j=1}^{N} (I(i,j) - K(i,j))^2 $
- Symbol Explanation:
  - $I$ : The original (ground truth) HR image.
  - $K$ : The reconstructed SR image.
  - M, N: The dimensions (height and width) of the images.
  - I(i,j), K(i,j): The pixel values at coordinates (i,j) in images $I$ and $K$ , respectively.
  - $\mathrm{MAX}_I$ : The maximum possible pixel value of the image. For 8-bit images, this is 255.
  - $\mathrm{MSE}$ : Mean Squared Error between the original and reconstructed images.
Learned Perceptual Image Patch Similarity (LPIPS) [61]:
- Conceptual Definition: LPIPS measures the perceptual difference between two images. Instead of comparing pixels directly, it computes the distance between features extracted from a pre-trained deep neural network (like VGG or AlexNet). A lower LPIPS score indicates higher perceptual similarity (i.e., the images look more alike to humans).
- Mathematical Formula: The exact mathematical formula for LPIPS involves comparing features from intermediate layers of a pre-trained network. It's typically calculated as: $ \mathrm{LPIPS}(x, x_0) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} | w_l \odot (\phi_l(x){h,w} - \phi_l(x_0){h,w}) |_2^2 $
- Symbol Explanation:
  - $x$ : The reconstructed SR image.
  - $x_0$ : The ground-truth HR image.
  - $l$ : Index over selected layers of the pre-trained network.
  - $\phi_l$ : Feature maps extracted from the $l$ -th layer of the pre-trained network.
  - $w_l$ : Learned channel-wise weights for the $l$ -th layer.
  - $\odot$ : Element-wise multiplication.
  - $H_l, W_l$ : Height and width of the feature map at layer $l$ .
  - $\|\cdot\|_2^2$ : Squared L2 norm (Euclidean distance).

5.2.2. Non-Reference Metrics (for both synthetic and real-world datasets)

These metrics are used when ground truth images are unavailable (e.g., for real-world SR evaluation). They attempt to quantify image quality in a way that aligns with human perception.

CLIPIQA [44]:
- Conceptual Definition: CLIPIQA leverages the CLIP (Contrastive Language-Image Pre-training) model to assess image quality. It measures the "look and feel" of images by evaluating their alignment with semantic descriptions of quality, aiming for better correlation with human perception than traditional metrics. A higher CLIPIQA score indicates better image quality.
- Mathematical Formula: The exact formula is complex and involves CLIP's internal feature representations and similarity calculations between image embeddings and quality-related text embeddings. It's not a simple algebraic formula. Conceptually, it quantifies how well an image aligns with "high quality" textual descriptions in CLIP's joint embedding space.
MUSIQ [21]:
- Conceptual Definition: MUSIQ (Multi-scale Image Quality Transformer) is a no-reference image quality assessment (NR-IQA) metric that uses a Transformer network to evaluate image quality. It processes image patches at multiple scales and aggregates information to predict a quality score that is highly consistent with human perception. A higher MUSIQ score indicates better image quality.
- Mathematical Formula: MUSIQ is a deep learning model; its "formula" is the trained Transformer architecture with its learned weights. It takes an image as input and outputs a scalar quality score.
MANIQA [54]:
- Conceptual Definition: MANIQA (Multi-dimension Attention Network for No-Reference Image Quality Assessment) is another NR-IQA metric. It employs a multi-dimension attention network to capture quality-aware features across different dimensions (e.g., spatial, channel). It aims to provide robust quality predictions that align well with human judgments. A higher MANIQA score indicates better image quality.
- Mathematical Formula: Similar to MUSIQ, MANIQA is a neural network. Its formula is implicitly defined by its architecture and learned parameters.

Identity Score (IDS): Measures how well the identity of the face is preserved after restoration. A lower IDS is generally better, indicating closer identity to the original.
Landmark Distance (LMD): Quantifies the distance between facial landmarks (e.g., eyes, nose, mouth corners) in the restored image compared to the ground truth. A lower LMD indicates better structural alignment.
Fréchet Inception Distance (FID) [15]:
- Conceptual Definition: FID measures the similarity between the feature distributions of generated images and real images. It's often used to evaluate the quality of generative models. A lower FID score indicates that the generated images are more similar to real images (higher quality and diversity).
- Mathematical Formula: $ \mathrm{FID} = |\mu_1 - \mu_2|_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
- Symbol Explanation:
  - $\mu_1, \mu_2$ : The mean feature vectors of real and generated images, respectively, extracted from a specific layer of a pre-trained Inception-v3 model.
  - $\Sigma_1, \Sigma_2$ : The covariance matrices of the feature vectors for real and generated images, respectively.
  - $\|\cdot\|_2^2$ : Squared L2 norm.
  - $\mathrm{Tr}(\cdot)$ : The trace of a matrix.

5.3. Baselines

The paper compares DiT-SR against several state-of-the-art SR methods categorized by their approach:

GAN-based Methods: These methods use Generative Adversarial Networks for SR.
- RealSR-JPEG [18]
- BSRGAN [60]
- RealESRGAN [48]
- SwinIR [28]: While SwinIR uses Transformers, it's often grouped with GAN-based or CNN-based perceptual SR methods due to its training objectives and typical performance characteristics.
Prior-based Methods: These leverage large pre-trained diffusion models (like Stable Diffusion) as a generative prior.
- StableSR-200 [45]
- DiffBIR-50 [29]
- PASD-20 [55]
- SeeSR-50 [52]
Training-from-Scratch Diffusion-based Methods: These are diffusion models trained on SR data without relying on large pre-trained text-to-image models.
- LDM-100 [36]
- ResShift-15 [58]
  
  The numbers (e.g., -200, -50, -15, -100) indicate the number of denoising steps used by these methods.

5.4. Implementation Details

Latent Space Operation: Following LDM [36], the DiT-SR architecture operates in the latent space. It uses a Vector Quantized GAN (VQGAN) [7] for encoding and decoding images to and from the latent space, with a downsampling factor of 4.
Training Schedule: The model is trained for 300,000 iterations.
Batch Size: A batch size of 64 is used.
Hardware: Training is performed using 8 NVIDIA Tesla V100 GPUs.
Optimizer: Adam [23] is used as the optimizer.
Learning Rate: The initial learning rate is $5e^{-5}$ .
FFT Window Size: The FFT window size $p$ for AdaFM is empirically set to 8 [24, 43].
Transformer Block Configuration (from Appendix A):
- The transformer block number is set to 6 for each stage (for the base model).
- The base channel (initial channel dimension) is configured to 160.
- For the U-shaped global architecture, there are 4 stages, with a channel increase factor set to [1, 2, 2, 4]. This implies that the channel dimensions for the four stages are $160 \times 1, 160 \times 2, 160 \times 2, 160 \times 4$ .
Blind Face Restoration (Appendix E):
- VQGAN with a downsampling factor of 8 is used.
- Diffusion steps are set to 4.
- Learning rate grows to $5e^{-5}$ in 5,000 iterations, then decays from $5e^{-5}$ to $2e^{-5}$ using an annealing cosine schedule, ending at 200,000 iterations.
- Diffusion loss [16] in latent space and LPIPS [61] loss in pixel space are adopted.

6. Results & Analysis

This section presents the experimental results, comparing DiT-SR with state-of-the-art methods and analyzing its various components through ablation studies.

6.1. Core Results Analysis

The results demonstrate that DiT-SR significantly outperforms existing training-from-scratch diffusion-based SR methods and even achieves competitive or superior performance compared to some prior-based methods, despite using substantially fewer parameters.

The following are the results from Table 1 of the original paper, showing comparisons on RealSR and RealSet65 datasets:

Methods	#Params	RealSR			RealSet65
Methods	#Params	CLIPIQA↑	MUSIQ↑	MANIQA↑	CLIPIQA↑	MUSIQ↑	MANIQA↑
GAN based Methods
RealSR-JPEG	17M	0.3611	36.068	0.1772	0.5278	50.5394	0.2943
BSRGAN	17M	0.5438	63.5819	0.3685	0.616	65.5774	0.3897
RealESRGAN	17M	0.4898	59.6766	0.3679	0.5987	63.2228	0.3871
SwinIR	12M	0.4653	59.6316	0.3454	0.5778	63.8212	0.3816
Prior based Methods
StableSR-200	919M	0.5207	59.4264	0.3563	0.5338	56.9207	0.3387
DiffBIR-50	1670M	0.7142	66.843	0.4802	0.7398	69.7260	0.5000
PASD-20	1469M	0.5170	58.4394	0.3682	0.5731	61.8813	0.3893
SeeSR-50	1619M	0.6819	66.3461	0.5035	0.7030	68.9803	0.5084
Training-from-Scratch Diff. based Methods
LDM-100	114M	0.5969	55.4359	0.3071	0.5936	56.112	0.356
ResShift-15	119M	0.6028	58.8790	0.3891	0.6376	58.0400	0.4048
Ours-15	61M	0.7161	65.8334	0.5022	0.7120	66.7413	0.4821

Analysis of Table 1:

Training-from-Scratch vs. Prior-based: Ours-15 (DiT-SR with 15 steps) achieves the highest CLIPIQA score (0.7161 on RealSR, 0.7120 on RealSet65) among all listed methods. It also performs very strongly on MUSIQ and MANIQA. This is a crucial finding, as Ours-15 is a training-from-scratch method, yet it outperforms LDM-100 and ResShift-15 significantly and even surpasses several prior-based methods like StableSR-200, PASD-20, and SeeSR-50 in CLIPIQA (and competitive with DiffBIR-50 which has vastly more parameters).
Parameter Efficiency: Ours-15 achieves this with only 61M parameters. This is remarkably efficient compared to prior-based methods like DiffBIR-50 (1670M parameters), PASD-20 (1469M), or SeeSR-50 (1619M). Even LDM-100 and ResShift-15 have roughly twice the parameters (114M and 119M, respectively) but yield inferior results. This validates the paper's claim of superior performance with significantly fewer parameters.
Visual Quality (Non-Reference Metrics): The strong performance on CLIPIQA, MUSIQ, and MANIQA indicates that DiT-SR produces images with high perceptual quality that align well with human judgment, which is a key goal for generative SR.

The following figure (Figure 1 from the original paper) provides a visual comparison of CLIPIQA scores against Parameters and FLOPs.

该图像是图表，展示了在RealSR数据集上，所提方法与最新图像超分辨率方法在CLIPIQA指标与参数数量（上图）及FLOPs（下图）间的对比。图中区分了基于GAN、扩散模型及先验模型的方法，显示所提方法在性能与资源消耗上的优势。

Figure 1 from the original paper visually reinforces the quantitative results. The top graph (CLIPIQA vs. Parameters) clearly shows that DiT-SR (labeled as "Ours") achieves the highest CLIPIQA score with a significantly lower parameter count than prior-based methods and better performance than other diff-based SR methods. The bottom graph (CLIPIQA vs. FLOPs) shows a similar trend, where DiT-SR offers a strong CLIPIQA score for its FLOPs count, further highlighting its efficiency.

The following are the results from Table 3 of the original paper, showing performance comparison on the synthetic LSDIR-Test dataset:

Methods	LSDIR-Test
Methods	PSNR↑	LPIPS↓	CLIPIQA↑	MUSIQ↑	MANIQA↑
GAN based Methods
RealSR-JPEG	22.16	0.360	0.546	59.02	0.342
BSRGAN	23.74	0.274	0.570	67.94	0.394
RealESRGAN	23.15	0.259	0.568	68.23	0.414
SwinIR	23.17	0.247	0.598	68.20	0.414
Prior based Methods
StableSR-200	22.68	0.267	0.660	68.91	0.416
DiffBIR-50	22.84	0.274	0.709	70.05	0.455
PASD-20	23.57	0.279	0.624	69.07	0.440
SeeSR-50	22.90	0.251	0.718	72.47	0.559
Training-from-Scratch Diff. based Methods
LDM-100	23.34	0.255	0.601	66.84	0.413
ResShift-15	23.83	0.247	0.640	67.74	0.464
Ours-15	23.60	0.244	0.646	69.32	0.483

Analysis of Table 3:

PSNR vs. Perceptual Metrics: On LSDIR-Test, ResShift-15 achieves the highest PSNR (23.83), which is a pixel-wise fidelity metric. However, Ours-15 achieves the best LPIPS (0.244, lower is better), CLIPIQA (0.646), MUSIQ (69.32), and MANIQA (0.483), which are perceptual quality metrics. This indicates DiT-SR excels at generating perceptually pleasing results, even if pixel-level fidelity (as measured by PSNR) is slightly lower than ResShift-15. This is a common trade-off in generative SR methods.
Comparison with Prior-based Methods: Ours-15 remains highly competitive, surpassing StableSR-200 and PASD-20 on all perceptual metrics and is comparable to DiffBIR-50. SeeSR-50 shows a very strong MANIQA score (0.559) but is generally weaker on other perceptual metrics compared to Ours-15. Again, DiT-SR does this with significantly fewer parameters.

The following figure (Figure 7 from the original paper) provides qualitative comparisons.

该图像是论文中真实世界数据集上的超分辨率视觉结果对比图，展示了LR图像及多种方法的重建效果。左侧红框表示关注区域，不同方法包括BSRGAN、StableSR、DiffBIR、PASD、SeeSR、RealESRGAN、SwinIR、LDM、ResShift和本文提出的方法。

Figure 7 from the original paper qualitatively demonstrates the superior performance of DiT-SR. The image displays several examples comparing DiT-SR against other methods on both synthetic and real-world datasets. It highlights that DiT-SR produces images with sharper details and better texture reproduction, reducing artifacts compared to baselines. For instance, in the example images, DiT-SR is shown to recover finer text details and more natural textures than other methods.

6.2. Ablation Studies / Parameter Analysis

The paper conducts ablation studies to validate the effectiveness of its proposed architectural components and AdaFM.

The following are the results from Table 2 of the original paper, showing ablation study results on U-shaped DiT and time conditioning:

Configuration		#Params	FLOPs	RealSR		RealSet65
DiT Arch.	Time Conditioning	#Params	FLOPs	CLIPIQA↑	MUSIQ↑	CLIPIQA↑	MUSIQ↑
Isotropic	AdaLN	42.38M	122.99G	0.655	64.194	0.664	64.263
U-shape	AdaLN	264.39M	122.87G	0.688	64.062	0.693	65.604
Ours	AdaLN	100.64M(-62%)	93.11G(-24%)	0.700	64.676	0.699	67.634
Ours	AdaFM	60.79M(-77%)	93.03G(-24%)	0.716	65.833	0.712	66.741

6.2.1. U-shaped DiT with Isotropic Design

Isotropic (Standard DiT) with AdaLN: A reimplemented standard DiT (isotropic) achieves decent performance (e.g., CLIPIQA 0.655 on RealSR) with 42.38M parameters and 122.99G FLOPs. This serves as a baseline for DiT-style architectures.
U-shape with AdaLN: A U-shaped DiT (without the specific isotropic channel reallocation of DiT-SR) shows improved CLIPIQA (0.688 on RealSR) for similar FLOPs (122.87G) but with a massive increase in parameters (264.39M), about six times more than the isotropic DiT. This highlights the U-shape's performance benefits but also its potential parameter inefficiency in a naive DiT integration.
Ours with AdaLN: The proposed DiT-SR architecture (combining U-shape with isotropic design and AdaLN for fair comparison) achieves even better CLIPIQA (0.700 on RealSR) with significantly fewer parameters (100.64M, a 62% reduction compared to the U-shape DiT baseline) and fewer FLOPs (93.11G, a 24% reduction). This clearly demonstrates the effectiveness of the isotropic design for resource reallocation: it boosts performance while being much more parameter-efficient than a straightforward U-shaped DiT.

6.2.2. Adaptive-Frequency Modulation (AdaFM)

Ours with AdaFM vs. Ours with AdaLN: Replacing AdaLN with AdaFM in the DiT-SR architecture leads to further performance gains (CLIPIQA jumps from 0.700 to 0.716 on RealSR). Crucially, this improvement comes with a significant reduction in parameters (from 100.64M to 60.79M, a 77% reduction compared to the U-shape DiT baseline, and a 39.6% reduction compared to Ours with AdaLN) and slightly reduced FLOPs. This confirms AdaFM's effectiveness: it provides a more parameter-efficient and performance-boosting way to inject time-step conditioning, especially for SR tasks that benefit from frequency-adaptive modulation. Figure 6 provides visual evidence of how AdaFM adaptively emphasizes low-frequency components in early denoising stages and high-frequency components in later stages.

The following are the results from Table 5 of the original paper, showing the results of compressing U-shaped DiT on real-world datasets:

Methods	#Params	FLOPs	RealSR		RealSet65
Methods	#Params	FLOPs	CLIPIQA↑	MUSIQ↑	CLIPIQA↑	MUSIQ↑
U-shaped DiT	264.39M	122.87G	0.688	64.062	0.693	65.604
Shallower U-DiT	196.65M(-26%)	96.30G(-22%)	0.671	63.319	0.683	64.097
Narrower U-DiT	214.20M(-19%)	99.56G(-19%)	0.682	63.631	0.692	65.469
Ours w/ AdaLN	100.64M(-62%)	93.11G(-24%)	0.700	64.676	0.699	67.634

Analysis of Table 5 (Compressing U-shaped DiT): This table explores whether a large U-shaped DiT is redundant.

Shallower U-DiT: Reducing the number of transformer blocks from 6 to 4 per stage (a 26% parameter reduction) leads to a performance drop (e.g., CLIPIQA from 0.688 to 0.671 on RealSR).
Narrower U-DiT: Decreasing the base channel from 160 to 144 (a 19% parameter reduction) also results in a performance drop (e.g., CLIPIQA from 0.688 to 0.682 on RealSR). These results suggest that the baseline U-shaped DiT (d6c160) is not overly redundant and that naive compression degrades performance. In contrast, Ours w/ AdaLN achieves better performance with a much larger parameter and FLOPs reduction (62% and 24% respectively), validating its strategic resource reallocation approach.

The following are the results from Table 6 of the original paper, showing the performance of the lightweight version:

Methods	#Params	RealSR			RealSet65
Methods	#Params	CLIPIQA↑	MUSIQ↑	MANIQA↑	CLIPIQA↑	MUSIQ↑	MANIQA↑
LDM-100	114M	0.5969	55.4359	0.3071	0.5936	56.1120	0.3560
ResShift-15	119M	0.6028	58.8790	0.3891	0.6376	58.0400	0.4048
Ours-15	61M	0.7161	65.8334	0.5022	0.7120	66.7413	0.4821
Ours-Lite-15	31M	0.6670	63.0544	0.4565	0.6694	64.3387	0.4420
Ours-Lite-1	31M	0.6993	63.3759	0.4262	0.7092	64.8329	0.4299

Analysis of Table 6 (Lightweight Version):

Ours-Lite-15: This lightweight version (31M parameters) is created by reducing transformer blocks (from 6 to 4), base channels (160 to 128), and removing the deepest layer. It significantly outperforms LDM-100 and ResShift-15 on all metrics, despite having only about 25% of their parameters. This highlights the substantial model capacity and efficiency of the DiT-SR design.

Ours-Lite-1: This version uses step distillation (specifically SinSR [49]) to achieve single-step denoising from Ours-Lite-15. It shows an increase in CLIPIQA and MUSIQ but a decrease in MANIQA. This indicates a trade-off: distillation can improve inference speed and some perceptual metrics, but might not generalize perfectly across all IQA metrics.

The following are the results from Table 4 of the original paper, showing Diffusion Architecture Hyper-parameters:

DiT Arch.	Time Conditioning	#Params	FLOPs	Number of Blocks	Channels	Reallocated Channel
Isotropic	AdaLN	42.38M	122.99G	[6,6,6,6,6]	160	-
U-shape	AdaLN	264.39M	122.87G	[6,6,6,6]	[160,320,320,640]	-
Ours	AdaLN	100.64M	93.11G	[6,6,6,6]	[160,320,320,640]	192
Ours	AdaFM	60.79M	93.03G	[6,6,6,6]	[160,320,320,640]	192
Ours-Lite	AdaFM	30.89M	49.17G	[4,4,4]	[128,256,256]	160

Analysis of Table 4 (Diffusion Architecture Hyper-parameters): This table provides the detailed architectural configurations for the various DiT models used in the ablation studies and main experiments.

Isotropic (Standard DiT): Has 5 stages with 6 blocks each, a constant channel of 160. This is a purely isotropic DiT without U-shaped up/downsampling.
U-shape: Has 4 stages with 6 blocks each. Channels increase hierarchically (160, 320, 320, 640). This represents a U-shaped architecture. Note the parameter count (264.39M) is much higher than the Isotropic DiT, even for similar FLOPs.
Ours (DiT-SR): Uses the U-shape channel progression but applies the isotropic design logic, specifically by using a reallocated channel of 192 (for the internal transformer block operations). This leads to fewer parameters (100.64M) and FLOPs (93.11G) compared to the U-shape baseline, demonstrating efficiency.
Ours with AdaFM: The AdaFM module further reduces parameters (60.79M) and FLOPs slightly (93.03G) while improving performance, confirming its efficiency.
Ours-Lite: A highly compressed version with fewer blocks (4 per stage, for 3 stages) and smaller base channels (128). The reallocated channel is 160. This dramatically reduces parameters (30.89M) and FLOPs (49.17G), showcasing the model's scalability for lightweight applications.

The paper also evaluates DiT-SR on Blind Face Restoration, demonstrating its generalization capability beyond general SR.

The following are the results from Table 7 of the original paper, showing quantitative results on CelebA-Test:

Methods	CelebA-Test
Methods	LPIPS↓	IDS↓	LMD↓	FID↓	CLIPIQA↑	MUSIQ↑	ManIQA
DFDNet	0.739	86.323	20.784	76.118	0.619	51.173	0.433
PSFRGAN	0.475	74.025	10.168	60.748	0.630	69.910	0.477
GFPGAN	0.416	66.820	8.886	27.698	0.671	75.388	0.626
VQFR	0.411	65.538	8.910	25.234	0.685	73.155	0.568
CodeFormer	0.324	59.136	5.035	26.160	0.698	75.900	0.571
DiffFace-100	0.338	63.033	5.301	23.212	0.527	66.042	0.475
ResShift-4	0.309	59.623	5.056	17.564	0.613	73.214	0.541
Ours-4	0.337	61.4644	5.235	19.648	0.725	75.848	0.634

Analysis of Table 7 (CelebA-Test):

Ours-4 (DiT-SR with 4 diffusion steps) achieves the best CLIPIQA, MUSIQ, and MANIQA scores, indicating superior perceptual quality for face restoration.

ResShift-4 shows slightly better LPIPS, IDS, LMD, and FID scores, suggesting slightly better pixel-level fidelity and identity preservation. However, Ours-4 is very competitive on these metrics as well. This again demonstrates a trade-off where DiT-SR leans towards higher perceptual quality.

The following are the results from Table 8 of the original paper, showing quantitative results on real-world datasets for blind face restoration:

Methods	LFW			WebPhoto			Wider
Methods	CLIPIQA↑	MUSIQ↑	MANIQA↑	CLIPIQA↑	MUSIQ↑	MANIQA↑	CLIPIQA↑	MUSIQ↑	MANIQA↑
DFDNet	0.716	73.109	0.6062	0.654	69.024	0.550	0.625	63.210	0.514
PSFRGAN	0.647	73.602	0.5148	0.637	71.674	0.476	0.648	71.507	0.489
GFPGAN	0.687	74.836	0.5908	0.651	73.367	0.577	0.663	74.694	0.602
VQFR	0.710	74.386	0.5488	0.677	70.904	0.511	0.707	71.411	0.520
CoderFormer	0.689	75.480	0.5394	0.692	74.004	0.522	0.699	73.404	0.510
DiffFace-100	0.593	70.362	0.4716	0.555	65.379	0.436	0.561	64.970	0.436
ResShift-4	0.626	70.643	0.4893	0.621	71.007	0.495	0.629	71.084	0.494
Ours-4	0.727	73.187	0.564	0.717	73.921	0.571	0.743	74.477	0.589

Analysis of Table 8 (Real-World Face Restoration):

Ours-4 consistently achieves the highest CLIPIQA, MUSIQ, and MANIQA scores across all three real-world face datasets (LFW, WebPhoto, WIDER). This strong performance on no-reference metrics is particularly important for real-world applications where ground truth is absent. It indicates that DiT-SR generates highly realistic and perceptually pleasing facial details, even under challenging real-world degradations.

The following figure (Figure 10 from the original paper) provides qualitative comparisons for blind face restoration.

$Figure 2. Analysis of images generated at different stages with a diffusion-based super-resolution model \[58\]. The first row shows the predicted clean images at various steps, while the second row di…$ 该图像是图像超分辨率扩散模型在不同扩散步骤预测结果的插图。第一行显示各步骤的预测清晰图像，第二行展示对应图像的傅里叶频谱，体现模型由低频逐步生成高频细节。

Figure 10 from the original paper provides qualitative comparisons of facial super-resolution across different methods. It visually demonstrates that DiT-SR (Ours) consistently produces more realistic and detailed facial reconstructions, often recovering expressions and fine textures more accurately than other methods, especially for degraded inputs from LFW, WebPhoto, and WIDER datasets.

6.3. More Visualization Results

The following figure (Figure 8 from the original paper) presents additional visualization results on real-world datasets.

该图像是图表，展示了多种超分辨率方法在人脸图像上的对比效果。每行包括低分辨率图像、通过不同算法恢复的图像（如GFPGAN、VQFR、CodeFormer等）以及高清(HR)原图，直观呈现了各方法在视觉质量上的差异。

Figure 8 from the original paper shows additional qualitative comparisons of super-resolution results on real-world datasets. The images highlight DiT-SR's ability to produce sharp textures and fine details, outperforming other methods in rendering realistic outputs for various natural scenes.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces DiT-SR, an effective diffusion transformer architecture specifically tailored for image super-resolution. The core innovation lies in its hybrid design, combining a U-shaped global architecture for multi-scale hierarchical feature extraction with a uniform isotropic design for transformer blocks across different stages. This strategic combination reallocates computational resources to critical high-resolution layers, enhancing performance and efficiency. Furthermore, DiT-SR addresses a key limitation of traditional time-step conditioning in diffusion models by proposing Adaptive Frequency Modulation (AdaFM). AdaFM efficiently and adaptively modulates frequency components based on the time step, better aligning with the frequency-dependent denoising process observed in SR tasks.

Extensive experiments demonstrate that DiT-SR sets a new state-of-the-art for training-from-scratch diffusion-based SR methods, significantly outperforming existing approaches. Remarkably, it even beats some prior-based methods (which rely on massive pre-trained Stable Diffusion models) while using only a fraction of their parameters. This success proves the superiority and efficiency of the diffusion transformer approach in image super-resolution, offering a flexible and high-performing solution without the heavy computational overhead of fine-tuning gigantic models.

7.2. Limitations & Future Work

The authors acknowledge a few limitations and suggest future research directions:

Scalability: Image super-resolution models generally do not exhibit the same level of scalability as text-to-image models due to task differences and limited data. While DiT-SR is highly parameter-efficient and achieves competitive performance, it still has room to fully surpass the absolute upper bound performance of the most powerful prior-based models (though it already beats some).
AdaFM's Potential: The authors believe AdaFM holds significant potential to establish a new time-step conditioning paradigm for diffusion models. They suggest extending its application beyond SR to various low-level visual tasks and even to text-to-image generation, particularly in scenarios where the generation process also follows a low-frequency to high-frequency progression.
Ethical Considerations: Similar to other content generation methods, the authors highlight the need for cautious use of their approach to prevent potential misuse, acknowledging the broader ethical implications of powerful generative AI.

7.3. Personal Insights & Critique

This paper presents a highly insightful and effective approach to image super-resolution.

Architectural Synergy: The core idea of combining the U-shaped architecture with isotropic transformer blocks is a clever hybrid that leverages the best of both worlds. U-Nets are proven for low-level vision due to multi-scale feature handling, and DiTs offer scalability and efficient token processing. The strategic reallocation of computational resources, as shown in the ablation studies, is a key enabler for this synergy, allowing for better performance with fewer parameters. This moves beyond simply swapping CNNs for transformers in a U-Net and instead rethinks how transformer resources should be distributed across scales.
Frequency-Aware Conditioning: The introduction of AdaFM is a significant contribution. It addresses a fundamental mismatch between the frequency-dependent nature of image super-resolution (and denoising in general) and the channel-wise modulation of AdaLN. By explicitly operating in the frequency domain, AdaFM provides a more semantically relevant and efficient way to guide the denoising process. The visual evidence in Figure 6 is compelling and clearly demonstrates its adaptive behavior. This frequency-aware conditioning paradigm could indeed have broader implications for many generative models.
Efficiency and Flexibility: The ability to achieve state-of-the-art performance for training-from-scratch methods while significantly reducing parameter count compared to prior-based models is a major strength. This opens doors for more flexible research, architectural modifications, and deployment in resource-constrained environments or edge devices, where prior-based models are often too cumbersome. The Ours-Lite version further underscores this potential for lightweight applications.

Potential Areas for Improvement/Further Exploration:

Generalizability of Isotropic Channel Allocation: While the paper demonstrates the benefit of its isotropic design's channel allocation strategy for SR, it would be interesting to see a more detailed theoretical analysis or empirical study on how this reallocation generalizes to other low-level vision tasks or even high-level generation tasks.
AdaFM in other domains: The authors correctly identify AdaFM's potential for text-to-image generation. Exploring this in detail could be a valuable extension, especially for models that also progressively refine details.
Adaptive Windowing for FFT: The FFT window size $p$ is empirically set to 8. Investigating if an adaptive or learned windowing strategy could further optimize AdaFM might be beneficial.
Comparison of Parameter Counts: While the paper highlights the parameter efficiency against prior-based methods very well, a direct comparison of FLOPs or parameters for AdaFM vs. AdaLN within the context of the same full architecture (not just the module itself) could further solidify the efficiency argument. The current tables do show this for Ours with AdaLN vs. Ours with AdaFM, which is good.

Overall, DiT-SR represents a substantial advancement in diffusion-based image super-resolution, offering a robust, efficient, and perceptually superior solution built on solid architectural and conditioning innovations. Its training-from-scratch nature positions it as a highly flexible and valuable contribution to the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Effective Diffusion Transformer Architecture for Image Super-Resolution

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~38 min read · 51,245 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Diffusion Models and Residual Shifting Context

4.2.2. Overall Architecture (DiT-SR)

4.2.3. Isotropic Design in U-shaped DiT

4.2.4. Frequency-Adaptive Time Step Conditioning (AdaFM)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Reference-Based Metrics (for synthetic datasets)

5.2.2. Non-Reference Metrics (for both synthetic and real-world datasets)

5.2.3. Blind Face Restoration Specific Metrics (Appendix E)

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. U-shaped DiT with Isotropic Design

6.2.2. Adaptive-Frequency Modulation (AdaFM)

6.2.3. Experiments on Blind Face Restoration (Appendix E)

6.3. More Visualization Results

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers