Effective Diffusion Transformer Architecture for Image Super-Resolution
TL;DR Summary
DiT-SR introduces a U-shaped diffusion transformer with frequency-adaptive conditioning, enhancing multi-scale feature extraction and resource allocation, achieving superior super-resolution without pretraining compared to prior-based methods.
Abstract
Recent advances indicate that diffusion models hold great promise in image super-resolution. While the latest methods are primarily based on latent diffusion models with convolutional neural networks, there are few attempts to explore transformers, which have demonstrated remarkable performance in image generation. In this work, we design an effective diffusion transformer for image super-resolution (DiT-SR) that achieves the visual quality of prior-based methods, but through a training-from-scratch manner. In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks across different stages. The former facilitates multi-scale hierarchical feature extraction, while the latter reallocates the computational resources to critical layers to further enhance performance. Moreover, we thoroughly analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module, enhancing the model's capacity to process distinct frequency information at different time steps. Extensive experiments demonstrate that DiT-SR outperforms the existing training-from-scratch diffusion-based SR methods significantly, and even beats some of the prior-based methods on pretrained Stable Diffusion, proving the superiority of diffusion transformer in image super-resolution.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Effective Diffusion Transformer Architecture for Image Super-Resolution
1.2. Authors
The authors are Kun Cheng, Lei Yu, Zhijun Tu, Xiao He, Liyu Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, and Jie Hu. Their affiliations include:
- State Key Laboratory of Integrated Services Networks, Xidian University
- Huawei Noah's Ark Lab
- Consumer Business Group, Huawei
- Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications
1.3. Journal/Conference
This paper is published as a preprint on arXiv. While arXiv is not a peer-reviewed journal or conference in itself, it is a widely recognized platform for sharing cutting-edge research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers published on arXiv are often submitted to prestigious conferences or journals later. The publication date suggests it is a very recent work in the field.
1.4. Publication Year
2024
1.5. Abstract
This paper introduces DiT-SR (Diffusion Transformer for Image Super-Resolution), a novel diffusion model architecture designed to achieve high visual quality in image super-resolution tasks through training from scratch, rather than relying on pre-trained models. The core innovations of DiT-SR include a U-shaped architecture for multi-scale hierarchical feature extraction and a uniform isotropic design for all transformer blocks, which optimizes computational resource allocation to critical layers. Furthermore, the paper identifies limitations in the widely used Adaptive Layer Normalization (AdaLN) for time-step conditioning and proposes a new frequency-adaptive time-step conditioning module called Adaptive Frequency Modulation (AdaFM). This AdaFM module enhances the model's ability to process distinct frequency information at different time steps, crucial for image super-resolution. Extensive experiments demonstrate that DiT-SR significantly outperforms existing training-from-scratch diffusion-based SR methods and even surpasses some prior-based methods (which leverage large pre-trained models like Stable Diffusion), thereby proving the superiority of the diffusion transformer approach in image super-resolution.
1.6. Original Source Link
Official Source Link: https://arxiv.org/abs/2409.19589 PDF Link: https://arxiv.org/pdf/2409.19589v1.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the performance gap between training-from-scratch diffusion-based super-resolution (SR) methods and prior-based SR methods. While diffusion models (DMs) have shown great promise in image super-resolution, prior-based methods (which fine-tune large pre-trained generative models like Stable Diffusion) generally achieve superior visual quality due to their extensive training on vast datasets. However, these prior-based methods suffer from slow inference speeds and lack flexibility for architectural modifications without massive retraining. Training-from-scratch methods, on the other hand, offer significant flexibility and are ideal for lightweight applications but have historically struggled to match the performance of their prior-based counterparts.
The motivation is to bridge this performance gap. The paper asks: "Can we develop a diffusion architecture trained from scratch while rivaling the performance of prior-based methods, balancing both performance and flexibility?" The advent of the Diffusion Transformer (DiT) architecture, known for its scalability and performance in image generation, makes this question feasible to explore.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
DiT-SR Architecture: The introduction of
DiT-SR, an effectiveDiffusion Transformerspecifically designed forimage super-resolution. It is the first work to seamlessly combine aU-shaped global architecture(for multi-scale feature extraction) withisotropic designsfor its transformer blocks (for efficient computational resource allocation to critical layers). This design allows it to achieve visual quality comparable to or better thanprior-based methodswhile being trained from scratch. -
Adaptive Frequency Modulation (AdaFM): The development of an efficient and effective
frequency-wise time-step conditioning modulecalledAdaFM. This module replaces the widely usedAdaLNand is designed to adaptively reweight different frequency components at varying time steps, addressing the specific frequency-perception requirements ofimage super-resolutiontasks.AdaFMuses significantly fewer parameters thanAdaLNwhile boosting performance. -
Superior Performance with Fewer Parameters: Extensive experiments demonstrate that
DiT-SRdramatically outperforms existingtraining-from-scratch diffusion-based SR methods. Furthermore, it even surpasses someprior-based SR methods(which leverage pre-trained Stable Diffusion) while using only about 5% of their parameters. This proves the superiority of thediffusion transformerinimage super-resolutionand achieves a better balance between performance and flexibility.The key findings are that by carefully designing the transformer architecture to leverage
U-shapedmulti-scale processing andisotropicresource allocation, and by introducing a frequency-aware time-step conditioning mechanism, it is possible to train adiffusion modelfrom scratch that rivals the performance of computationally intensiveprior-based methods, leading to more flexible and efficientsuper-resolutionsolutions.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational understanding of several key concepts in deep learning and image processing is essential:
-
Image Super-Resolution (SR): The task of reconstructing a high-resolution (HR) image from a given low-resolution (LR) input image. It's a classic problem in computer vision, aiming to recover lost details and improve image clarity.
-
Diffusion Models (DMs): A class of generative models that learn to reverse a gradual diffusion process.
- Forward Diffusion Process: In this process, noise is progressively added to a data sample (e.g., an image) over a series of time steps, transforming it into pure Gaussian noise. For an image , a noisy version at time step is generated by: $ q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I) $ where is the noisy image at time , is the original image, denotes a normal distribution, , and is a predefined variance schedule.
- Reverse Denoising Process: The model learns to reverse this process, starting from pure noise and iteratively predicting the original data distribution by removing noise at each time step. This is typically done by training a
denoisernetwork to predict the noise added at time . The denoising step is typically formulated as sampling from , where the mean is predicted by the denoiser.
-
Transformers: An architecture introduced in 2017 primarily for natural language processing, but later adapted for computer vision.
- Self-Attention: The core mechanism of transformers, allowing the model to weigh the importance of different parts of the input sequence (or image patches) when processing a particular element. It calculates a weighted sum of
value (V)vectors, where the weights are determined by the similarity between aquery (Q)vector andkey (K)vectors. - Multi-Head Self-Attention (MHSA): Multiple
self-attentionmechanisms run in parallel, each learning different relationships, and their outputs are concatenated. - Multi-Layer Perceptron (MLP): A feed-forward neural network applied independently to each position in the sequence, typically consisting of two linear layers with an activation function (e.g.,
GELU) in between. - Vision Transformer (ViT): An adaptation of
Transformersfor image data, where images are split into fixed-size patches, linearly embedded, and then processed as sequences by aTransformer encoder. - Diffusion Transformer (DiT): A variant of
Transformersused as thedenoiserindiffusion models. It replaces the convolutionalU-Nettypically used inDMswith anisotropic,full-transformerarchitecture, often maintaining constant resolution and channel dimensions across layers.
- Self-Attention: The core mechanism of transformers, allowing the model to weigh the importance of different parts of the input sequence (or image patches) when processing a particular element. It calculates a weighted sum of
-
U-Net Architecture: A convolutional neural network (CNN) architecture characterized by its
U-shaped encoder-decoderstructure withskip connections.- Encoder: Downsamples the input feature maps, extracting hierarchical features at different scales.
- Decoder: Upsamples the features to the original resolution, combining them with corresponding features from the
encoderviaskip connectionsto recover fine-grained details. It's widely used in image segmentation and low-level vision tasks due to its ability to capture both contextual and fine-grained information.
-
Latent Diffusion Models (LDM): A type of
diffusion modelthat performs the diffusion process in a compressedlatent spacerather than directly in the pixel space. This significantly reduces computational costs, especially for high-resolution images, by using anencoder-decoder(e.g.,VQGAN) to map images to and from thelatent space. -
Frequency Analysis (Fourier Transform): A mathematical tool that decomposes a signal (like an image) into its constituent frequencies.
- Low Frequencies: Represent the overall structure and smooth variations in an image.
- High Frequencies: Represent fine details, textures, and edges in an image.
- Fast Fourier Transform (FFT): An efficient algorithm to compute the
Discrete Fourier Transform.
3.2. Previous Works
The paper contextualizes its work by discussing existing diffusion-based SR methods, categorizing them into train-from-scratch and prior-based approaches, and also reviewing diffusion model architectures.
-
Diffusion-based Image Super-Resolution:
- SR3 [38]: A pioneer in applying
diffusion modelstoimage super-resolution, demonstrating their potential. - LDM [36]: Improves efficiency by performing the diffusion process in a
latent space, making it more practical for high-resolution images. It uses aU-Netas itsdenoiser. - ResShift [58]: Reformulates the diffusion process to create a
Markov chainbetweenHRandLRimages directly, rather thanHRandGaussian noise. This reduces the number of denoising steps required, improving inference speed. The paper adopts this paradigm.- Formula context for ResShift:
ResShiftintroduces a residual betweenLRandHRimages, and a shifting sequence . The forward process is: where is the noisy image at time , is the cleanHRimage, is theLRimage (though the paper abstract uses for LR and for HR, in section 3.1, it uses for LR and for HR, then in 3.2, it uses for LR and for HR. I will stick to for LR and for HR for clarity when discussing ResShift specific formulas, consistent with the formula in 3.2.), is the shifting coefficient, controls noise variance. This formulation directly connects theLRimage into the diffusion process. The reverse process predicts from using a denoiser . This paradigm effectively reduces the required Markov chain length.
- Formula context for ResShift:
- Prior-based Methods: These methods, including StableSR [45], DiffBIR [29], PASD [55], and SeeSR [52], exploit the generative prior from large pre-trained
diffusion modelslike Stable Diffusion [35, 36]. While achieving remarkable results, they suffer from slow inference and limited flexibility for architectural changes. The paper notes attempts like SinSR [49] and AddSR [53] to reduce denoising steps viaknowledge distillation, but these typically don't allow fundamental architecture alterations without retraining.
- SR3 [38]: A pioneer in applying
-
Diffusion Model Architecture:
- U-Net [37]: Traditionally, most
diffusion models(e.g., [6, 16, 33, 36, 40]) have used theU-Netarchitecture as theirdenoiser, often incorporatingResBlocks [14]andTransformer blocks [42]. ItsU-shapeddesign is known for hierarchical feature extraction. - Diffusion Transformer (DiT) [34]: This marked a significant departure by proposing an
isotropic full-transformerarchitecture fordenoising. It maintains constant resolution and channel dimensions across its transformer blocks, demonstrating superior scalability and establishing a new paradigm.- Formula context for AdaLN (from original DiT paper, though not explicitly in this paper, it's crucial for context):
AdaLN(Adaptive Layer Normalization) is a common conditioning mechanism inDiTmodels. It modulates the normalized features using learned scale and shift parameters derived from the conditioning input (e.g., time step ). If is the input feature to anormalization layer,AdaLNcomputes , where and are learned functions (often MLPs) of the time step , which produce per-channel scale and shift parameters.
- Formula context for AdaLN (from original DiT paper, though not explicitly in this paper, it's crucial for context):
- Subsequent DiT-based works: [9, 10, 13, 27, 31, 32] have adopted or built upon the
DiTarchitecture, showing strong performance across various tasks. - U-ViT [2]: A hybrid approach that retains
U-Net's longskip connectionsbut without explicit upsampling or downsampling operations within the main processing path.
- U-Net [37]: Traditionally, most
3.3. Technological Evolution
The field of image super-resolution has evolved significantly:
- Traditional Methods: Early methods relied on interpolation or hand-crafted priors.
- CNN-based Methods: The advent of
Convolutional Neural Networks (CNNs)revolutionizedSR, with models likeSRCNNandESRGANachieving impressive results. - GAN-based Methods:
Generative Adversarial Networks (GANs)further pushed visual quality, especially for perceptual metrics, by learning to generate realistic details. - Diffusion Models (DMs): More recently,
diffusion modelshave emerged as powerful generative models, offering superior sample quality and diversity compared toGANs. Their application toSRhas shown exceptional potential. - Latent Diffusion Models (LDMs): To address the computational cost of
DMson high-resolution images,LDMswere introduced, performing diffusion in a compressedlatent space. - Transformer Architectures:
Transformersmoved fromNLPtovision(ViT,SwinIR) and then adapted fordiffusion models(DiT), replacing theU-Netdenoiserwith a purelytransformer-basedarchitecture, known for scalability. - Hybrid and Optimized Architectures: This paper's work (
DiT-SR) fits into the latest stage, seeking to combine the best aspects ofU-Net(multi-scale feature extraction) andDiT(scalable,isotropic transformerdesign) within thediffusion modelframework, specifically forsuper-resolution, while also addressing limitations intime-step conditioning.
3.4. Differentiation Analysis
Compared to the main methods in related work, DiT-SR offers several core differences and innovations:
-
From Standard DiT to U-shaped DiT with Isotropic Design:
- Standard DiT:
DiT[34] is anisotropic full-transformerarchitecture, meaning it typically maintains constant resolution and channel dimensions throughout its layers. While scalable, it lacks the explicitmulti-scale hierarchical feature extractioncapabilities thatU-Netsprovide, which are often beneficial forlow-level vision taskslikeSR. - U-shaped DiT (traditional): Conventional
U-Netbaseddiffusion denoisersuseCNNsortransformersin aU-shapedmanner (downsampling, upsampling,skip connections), but theirtransformer blocksare not necessarilyisotropicallydesigned or optimized for resource reallocation across different stages in the wayDiTis. - DiT-SR's Innovation:
DiT-SRis the first to marry these two paradigms: it adopts an overallU-shaped encoder-decoderstructure for multi-scale handling, but crucially, it employs auniform isotropic designfor alltransformer blocksacross different stages. This means that while the overall feature map resolution changes across stages (like a U-Net), the transformer blocks within each stage areisotropicallydesigned and, more importantly,DiT-SRreallocates computational resources by standardizing channel dimensions in an optimized way. Specifically, it uses a larger channel dimension for high-resolution stages and a smaller one for low-resolution stages, which is a key difference from a purelyisotropic DiT(constant channels) or a typicalU-Net(channels scale with depth).
- Standard DiT:
-
From AdaLN to Adaptive Frequency Modulation (AdaFM):
- AdaLN: The widely used
Adaptive Layer NormalizationinDiT-based models modulates features in achannel-wisemanner. While effective for general image generation, the paper argues it's inefficient forSRbecauseSRrequires strongfrequency perception, andAdaLNdoesn'tadaptivelyhandle different frequency components at different denoising stages (e.g., low frequencies early, high frequencies late). - DiT-SR's Innovation (
AdaFM):AdaFMdirectly addresses this limitation by movingtime-step conditioningfrom thespatial domainto thefrequency domain. It adaptively reweights differentfrequency componentsbased on thetime step, making it more suitable forSRtasks that progressively recoverhigh-frequency details. This is a novel and moreparameter-efficientapproach totime-step conditioning.
- AdaLN: The widely used
-
From Prior-based to Training-from-Scratch (with comparable performance):
-
Prior-based Methods: These methods (
StableSR,DiffBIR,PASD,SeeSR) achieve high quality by fine-tuning massivediffusion models(Stable Diffusion) pre-trained on enormous datasets. Their advantage comes fromgenerative priors. -
DiT-SR's Innovation:
DiT-SRis trained entirely from scratch. Its innovative architectural design andAdaFMallow it to achieve performance that significantlyoutperformsothertraining-from-scratch methodsand evenbeatssomeprior-based methods, all while using substantiallyfewer parameters(e.g., 5% ofprior-basedmodels). This offers a highly flexible and efficient alternative without sacrificing quality.In essence,
DiT-SRinnovates by combining architectural strengths, introducing a specialized conditioning mechanism, and achieving a new state-of-the-art fortraining-from-scratch diffusion SRthat challenges the dominance ofprior-based methods.
-
4. Methodology
The proposed DiT-SR (Diffusion Transformer for Image Super-Resolution) model aims to combine the strengths of U-shaped architectures (for multi-scale processing) with isotropic transformer designs (for efficient resource allocation) and introduces a novel frequency-adaptive time-step conditioning module called AdaFM. This section details its principles and core methodology.
4.1. Principles
The core idea behind DiT-SR is to leverage the best of both U-Net and Diffusion Transformer (DiT) paradigms, tailored specifically for image super-resolution, while addressing a critical shortcoming in time-step conditioning.
- Multi-scale Feature Extraction:
Image super-resolutioninherently benefits from processing information at multiple scales, capturing both coarse structural context and fine-grained details. TheU-shaped architectureis adept at this, progressively downsampling to extract high-level features and then upsampling to reconstruct details. - Efficient Transformer Scaling:
DiTarchitectures have shown remarkable scalability and performance. The paper observes thathigh-resolution DiTs(e.g., those processing larger feature maps) benefit more from scaling up computational resources. By applying anisotropic designprinciple within aU-shaped framework,DiT-SRstrategically reallocates computational resources, concentrating more capacity on critical high-resolution layers. This allows for a powerfultransformer architecturewithin a given computational budget, avoiding thetedious scheduling policiesoften needed in traditionalU-Nets. - Frequency-Adaptive Time-step Conditioning:
Diffusion modelsforSRtasks demonstrate a temporal evolution in their reconstruction process: they first generatelow-frequency components(structure) and thenhigh-frequency components(details). Standardtime-step conditioningmechanisms likeAdaLN, which operatechannel-wisein thespatial domain, are notfrequency-awareand thus inefficient forSR.DiT-SRintroducesAdaFMto explicitly adaptmodulationbased onfrequency, allowing the model to emphasize differentfrequency componentsat differentdenoising stages.
4.2. Core Methodology In-depth (Layer by Layer)
DiT-SR is a denoiser that follows the Residual Shifting (ResShift) paradigm for image super-resolution.
4.2.1. Diffusion Models and Residual Shifting Context
The goal of diffusion-based SR methods is to model the conditional distribution , where is the low-resolution (LR) image and is its corresponding high-resolution (HR) image.
Forward Diffusion Process (Standard DMs):
The standard forward process gradually adds noise to to obtain at time step . This can be expressed in a single step using the reparameterization trick:
Here, is the noisy image at time , is the original HR image, denotes a normal distribution, is the cumulative product of (predefined variance schedule parameters up to time ), and is the identity matrix. This equation describes how a clean image is transformed into a noisy image by adding Gaussian noise with a mean scaled by and a variance of .
Reverse Denoising Process (Standard DMs):
The reverse process starts from pure Gaussian noise and iteratively generates from . The model learns to approximate the posterior distribution :
In this equation, is the mean predicted by the denoiser (parameterized by ), which estimates the noise (or the clean image) given the current noisy image , the LR condition , and the time step . is a constant variance that depends on .
Residual Shifting (ResShift) Paradigm:
DiT-SR adopts the ResShift paradigm, which constructs a Markov chain directly between HR and LR images. This reformulation is more efficient for SR tasks.
Let represent the residual between the LR image and the HR image . A shifting sequence is introduced, increasing from to .
The forward process in ResShift is formulated as:
Here, is the noisy state at time , is the HR image, is the LR image, is the residual . The mean of the normal distribution is , which means the noisy image is shifted towards the LR image based on the shifting coefficient . The variance is , where is a hyperparameter controlling the noise variance. The parameter is defined as for and .
The reverse (denoising) process in ResShift is formulated to predict directly:
In this equation, the denoiser directly predicts the clean HR image . This design simplifies the Markov chain and reduces the number of required time steps for SR.
4.2.2. Overall Architecture (DiT-SR)
The DiT-SR architecture (depicted in Figure 3 from the original paper) is an encoder-decoder network with an overall U-shaped global framework. However, it uniquely combines this with an isotropic design for all transformer blocks within different stages.

该图像是论文中关于三种不同Diffusion Transformer架构的示意图,分别为(a)标准DiT,(b)U型DiT,以及(c)本文提出的架构。图中展示了各自Transformer块的层次和特征图尺寸变化,体现了方法在多尺度特征提取上的差异。
Figure 3 from the original paper illustrates three Diffusion Transformer architectures: (a) Standard DiT, (b) U-shaped DiT, and (c) the proposed DiT-SR. The Standard DiT maintains constant feature map resolution. The U-shaped DiT (b) and DiT-SR (c) utilize downsampling and upsampling. DiT-SR's key visual difference is the strategic reallocation of channels (C1, C2, C3, C4) to high-resolution layers by making them wider, while other U-shaped DiTs might scale channels differently.
Input: The LR image and the noisy image (or a representation derived from it) are concatenated along the channel dimension. This concatenated input, along with the time step , is fed into the denoiser. The denoiser then predicts and iteratively refines it as per the ResShift reverse process (Eq. 4).
Transformer Block: As shown in Figure 5 from the original paper, the fundamental building block of DiT-SR is a transformer block.

该图像是论文中图6展示的示意图,呈现了应用AdaFM前后不同时间步下的特征图及其频谱。AdaFM在去噪早期增强低频(频谱外围变暗),晚期增强高频(频谱外围变亮),提升模型对不同时间步频率的响应能力。
Figure 5 from the original paper depicts the internal structure of a transformer block in DiT-SR and the Adaptive Frequency Modulation (AdaFM) module. The block contains Multi-Head Self-Attention (MHSA) and an MLP. AdaFM is integrated after each normalization layer to inject the time step into the frequency domain.
Each transformer block consists of:
-
Multi-Head Self-Attention (MHSA): This mechanism acts as a
spatial mixer. Due to the high computational cost and memory constraints of globalself-attentionfor high-resolution inputs,DiT-SRemployslocal attention with window shifting, inspired bySwin Transformer[30, 42]. This restrictsself-attentioncalculations to non-overlapping local windows and then enables cross-window interaction through shifted window partitions in successive blocks. -
Multi-Layer Perceptron (MLP): This serves as a
channel mixer, composed of two fully-connected layers separated by aGELU activation function. -
Normalization Layers:
Group normalizationlayers are applied before both theMHSAandMLPoperations. -
Adaptive Frequency Modulation (AdaFM): Critically, the proposed
AdaFMmodule is integrated immediately after each normalization layer to inject thetime stepcondition.The computation within a
transformer blockcan be formulated as follows: In this formulation: -
: Represents the current
time step. -
: A
Multi-Layer Perceptronthat processes thetime stepto generate two distincttime-step feature vectors, and . These vectors are used to condition theMHSAandMLPbranches, respectively. -
: Represents the feature map input to and output from the
transformer block. -
: A
normalization layer(specificallyGroup Normalizationin this case) applied to the input feature map . -
: The proposed
Adaptive Frequency Modulationmodule, which takes the normalized feature map and thetime-step feature vector(or ) toadaptively modulatethe features in thefrequency domain. -
: The
Multi-Head Self-Attentionmechanism, operating on theAdaFM-conditioned features. The indicates aresidual connection(adding the original input to the output ofMHSA). -
: The
Multi-Layer Perceptron, operating on theAdaFM-conditioned features. The again indicates aresidual connection.
4.2.3. Isotropic Design in U-shaped DiT
The paper's DiT-SR merges the advantages of U-shaped architectures and isotropic designs.
- U-shaped Structure: The
encoderpart progressively reduces the spatial resolution of feature maps while increasing their channel dimensions to extract multi-scale contextual information. Thedecoderpart reverses this process, upsampling features and decreasing channels, usingskip connectionsto integrate fine-grained details from theencoderfor reconstruction. - Isotropic Design Integration: Inspired by
DiTobservations (e.g.,DiTcan handle various patch sizes with the same depth/channels, andhigh-resolution DiTsscale better),DiT-SRintroduces anisotropic designinto itsmulti-scale U-shaped framework.-
Standardized Channel Dimension: Within each transformer stage (which operates at a fixed resolution), all
transformer blocksare configured to have the same channel dimension. -
Reallocated Computational Resources: Instead of uniformly scaling channels,
DiT-SRreallocates computational resources. It sets the standardized channel dimension in high-resolution stages to be larger than typically seen in standardU-Netconfigurations at those resolutions, but much smaller than the channel dimensions in low-resolution stages (which usually have the widest channels inU-Nets). This strategic reallocation means more computational power is directed to processing high-resolution features, which are critical forimage super-resolutionas they carry fine details. -
Benefit: This approach boosts the capacity of the
transformer architectureinmulti-scale paradigmswith a more efficient use of the computational budget, often with fewer parameters than conventionalU-Nets.The following figure (Figure 4 from the original paper) illustrates how this isotropic design, when applied to a U-shaped DiT, reallocates FLOPs and parameters.
该图像是论文中图5的示意图,展示了DiT-SR中的Transformer块结构及自适应频率调制模块(AdaFM)。AdaFM通过将时间步注入频率域,实现对不同频率成分的自适应加权。
-
Figure 4 from the original paper displays the distribution of FLOPs and parameters across different stages of a U-shaped DiT, comparing a non-isotropic design with the isotropic design. The graphs show that the isotropic design effectively reallocates computational resources, increasing FLOPs and parameters in higher-resolution stages relative to lower-resolution stages, thereby focusing resources on critical layers. This figure is critical in understanding how the isotropic principle directs computational resources. The C1, C2, C3, C4 in Figure 3(c) represent the channel dimensions at different stages. The text mentions allocating computational resource to high-resolution layers with\Delta (4C_2 > C_3 > C_2)to boost the model capacity, which refers to the relative sizing of channels, not that is the channel count. It implies that channels for higher resolution stages (like ) might be scaled up or kept larger relative to or than a typical U-Net would. The table 4 provides the specific reallocated channel, which is 192 for the DiT-SR base version.
4.2.4. Frequency-Adaptive Time Step Conditioning (AdaFM)
The paper identifies a limitation in Adaptive Layer Normalization (AdaLN) for SR tasks and proposes Adaptive Frequency Modulation (AdaFM).
Limitation of AdaLN:
AdaLN, commonly used in DiT models, modulates features channel-wise. This means it applies the same scale and shift parameters across all spatial locations within a channel. However, image super-resolution demands strong frequency perception, as the diffusion model's denoising process involves different frequency components at distinct denoising phases. As observed in Figure 2 from the original paper, the model first reconstructs low-frequency components (overall structure) and then progressively refines high-frequency details (textures, edges).

该图像是论文中关于三种不同Diffusion Transformer架构的示意图,分别为(a)标准DiT,(b)U型DiT,以及(c)本文提出的架构。图中展示了各自Transformer块的层次和特征图尺寸变化,体现了方法在多尺度特征提取上的差异。
Figure 2 from the original paper visualizes images and their Fourier spectrums at different denoising stages of a diffusion-based SR model. It shows that in early stages (e.g., ), low-frequency components (center of the spectrum) are more prominent, while in later stages (e.g., ), high-frequency components (periphery of the spectrum) become more refined. This demonstrates the model's reliance on different frequency information at different time steps.
AdaLN cannot adaptively modulate features based on their spatial frequency content (e.g., apply different modulation to a smooth region vs. an edge region) because it operates uniformly across spatial locations within a channel. Generating spatial-wise modulation parameters from a single time-step vector for distinct high/low-frequency spatial positions is challenging.
Adaptive Frequency Modulation (AdaFM):
To overcome this, AdaFM replaces AdaLN after each normalization layer and shifts time-step modulation from the spatial domain to the frequency domain.
The process of AdaFM is formulated as follows:
Let's break down each step and symbol:
-
: This is a
time-step feature vector(e.g., or from thetransformer blockequation). It's a 1-dimensional vector derived from thetime step. -
: The
time-step feature vectoris reshaped into a matrix, , which serves as thefrequency scale matrix. This matrix will be used toadaptively reweightdifferentfrequency components. The value is theFFT window size, empirically set to 8. -
: Represents the
spatial domain feature mapafternormalization(e.g., ). -
: This denotes the
patch unfoldingoperation. To handle various input resolutions and enhance efficiency, thespatial domain feature mapis segmented into non-overlapping windows. This operation creates segments (or patches) of the feature map. -
: The
Fast Fourier Transformoperation. This transforms each window (segment) from thespatial domaininto thefrequency domain, resulting in spectrograms . The spectrogram contains thefrequency componentsfor each window and channel. -
: This is the core
adaptive modulationstep. Thefrequency scale matrix(derived from thetime step) is element-wise multiplied () with thespectrograms. Thisreweightsdifferentfrequency componentsaccording to thetime step, enhancing the model's ability to emphasize specific frequencies. -
: The
inverse Fast Fourier Transformoperation. This transforms themodulated spectrogramsback into thespatial domain. -
: This denotes the
patch foldingoperation. The processed windows are reassembled into the fullspatial domain feature map. -
: The output feature map, which has been
adaptively modulatedin thefrequency domainbased on thetime step.Frequency-Spatial Correspondence: The paper highlights that in a
spectrum, each pixel at a specific spatial position corresponds to a predeterminedfrequency component. This relationship is solely defined by the feature map's spatial dimension, not its content. Thefrequencycorresponding to a pixel at spatial position(u, v)in a spectrum is formulated as: where: -
: Denote the vertical and horizontal frequencies, respectively.
-
u, v: The coordinates of the pixel in thespectrum. -
H, W: The height and width of thespectrum(or the window size inAdaFM). -
: The
sampling frequency.This consistency (that frequency at a spatial position in the spectrum is fixed) allows the same
frequency scale matrixto be applied across all windows and channels, makingAdaFMhighly efficient.
Efficiency: Compared to AdaLN, which requires mapping parameters (for scale, shift, and gate for both self-attention and MLPs), AdaFM requires only parameters. This results in significantly fewer parameters for AdaFM while boosting performance. The paper also points out that since different frequencies correspond to distinct spatial locations on the feature map within the frequency domain, AdaFM effectively provides spatial-wise modulation in an indirect but efficient manner.
The following figure (Figure 6 from the original paper) visually demonstrates the effect of AdaFM at different time steps.

该图像是一张图表,展示了不同方法在合成数据集和真实世界数据集上的超分辨率重建结果对比。每个子图中,左侧为低分辨率输入(LR),右侧依次为各方法输出及高分辨率真实图像(HR)或参考。该图强调了Ours方法在细节恢复上的优越表现。
Figure 6 from the original paper visualizes feature maps and their spectrums (after FFT) before and after applying AdaFM at different time steps (). In early stages (), AdaFM enhances low-frequency components (making the peripheral part of the spectrum darker, indicating suppression of high frequencies). In later stages (), it enhances high-frequency components (making the peripheral part of the spectrum brighter, indicating emphasis on high frequencies). This adaptive behavior establishes a correlation between time step and frequency.
5. Experimental Setup
5.1. Datasets
The experiments evaluate the proposed model on the x4 real-world SR task.
Training Data: The training dataset is a comprehensive collection comprising:
-
LSDIR [26]: A large-scale dataset for
image restoration. -
DIV2K [1]: A high-quality dataset widely used for
super-resolutionandimage restoration. -
DIV8K [11]: Another dataset focusing on
8K resolutionimages. -
OutdoorSceneTraining [46]: A dataset of outdoor scenes.
-
Flicker2K [41]: A dataset of images from Flickr.
-
FFHQ [20]: The first 10,000 face images from the
Flickr-Faces-HQdataset, which contains high-quality human faces.During training,
HRimages are randomly cropped to pixels. A degradation pipeline, specifically that ofRealESRGAN [48], is used to synthesizeLR/HRpairs from theseHRimages.
Test Datasets:
- LSDIR-Test: A synthetic dataset created from
LSDIR. The test set images are center-cropped to pixels and subjected to the sameRealESRGANdegradation pipeline used during training. - RealSR [4]: A real-world dataset comprising 100 images captured by Canon 5D3 and Nikon D810 cameras. This dataset provides
LR/HRpairs obtained from real-world camera settings, making it suitable for evaluating performance onreal-world degradations. - RealSet65 [58]: A collection of 65
low-resolution imagesgathered from widely used datasets and the internet. This also serves as a benchmark forreal-world SR.
Blind Face Restoration (Appendix E):
For Blind Face Restoration, a separate set of datasets and degradation is used:
- Training Data:
FFHQ [20]dataset, containing 70,000 high-quality face images at resolution. These images are resized to , andLQimages are synthesized using a typical degradation pipeline described in [47]. - Test Datasets:
- CelebA-HQ [19]: 2,000
HRimages randomly selected from its validation set are used to synthesizeLQimages followingGFPGAN [47]'s degradation. - LFW [17]:
Labeled Faces in the Wild, a dataset of 1,711 face images collected from diverse real-world sources, used to evaluate face recognition in unconstrained environments. - WebPhoto [47]: A dataset of 407
web-crawled face images, including older photos with significant degradation. - WIDER [62]: A subset of 970 face images from the
WIDER Facedataset, featuring heavy degradation like occlusions, varying poses, scales, and lighting.
- CelebA-HQ [19]: 2,000
5.2. Evaluation Metrics
The paper employs a combination of reference-based and non-reference (no-reference) metrics to comprehensively evaluate the super-resolution performance. Reference-based metrics require a ground-truth HR image, while non-reference metrics do not, making them suitable for real-world scenarios where ground truth is unavailable.
5.2.1. Reference-Based Metrics (for synthetic datasets)
-
Peak Signal-to-Noise Ratio (PSNR):
- Conceptual Definition:
PSNRis a quality metric that measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. It's most easily defined via the Mean Squared Error (MSE). A higherPSNRgenerally indicates a higher quality reconstruction. It's widely used but does not always correlate well with human perceptual quality. - Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}I^2}{\mathrm{MSE}} \right) $ where $ \mathrm{MSE} = \frac{1}{MN} \sum{i=1}^{M} \sum_{j=1}^{N} (I(i,j) - K(i,j))^2 $
- Symbol Explanation:
- : The original (ground truth)
HRimage. - : The reconstructed
SRimage. M, N: The dimensions (height and width) of the images.I(i,j), K(i,j): The pixel values at coordinates(i,j)in images and , respectively.- : The maximum possible pixel value of the image. For 8-bit images, this is 255.
- : Mean Squared Error between the original and reconstructed images.
- : The original (ground truth)
- Conceptual Definition:
-
Learned Perceptual Image Patch Similarity (LPIPS) [61]:
- Conceptual Definition:
LPIPSmeasures the perceptual difference between two images. Instead of comparing pixels directly, it computes the distance between features extracted from a pre-trained deep neural network (like VGG or AlexNet). A lowerLPIPSscore indicates higher perceptual similarity (i.e., the images look more alike to humans). - Mathematical Formula: The exact mathematical formula for
LPIPSinvolves comparing features from intermediate layers of a pre-trained network. It's typically calculated as: $ \mathrm{LPIPS}(x, x_0) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} | w_l \odot (\phi_l(x){h,w} - \phi_l(x_0){h,w}) |_2^2 $ - Symbol Explanation:
- : The reconstructed
SRimage. - : The ground-truth
HRimage. - : Index over selected layers of the pre-trained network.
- : Feature maps extracted from the -th layer of the pre-trained network.
- : Learned channel-wise weights for the -th layer.
- : Element-wise multiplication.
- : Height and width of the feature map at layer .
- : Squared L2 norm (Euclidean distance).
- : The reconstructed
- Conceptual Definition:
5.2.2. Non-Reference Metrics (for both synthetic and real-world datasets)
These metrics are used when ground truth images are unavailable (e.g., for real-world SR evaluation). They attempt to quantify image quality in a way that aligns with human perception.
-
CLIPIQA [44]:
- Conceptual Definition:
CLIPIQAleverages theCLIP (Contrastive Language-Image Pre-training)model to assess image quality. It measures the "look and feel" of images by evaluating their alignment with semantic descriptions of quality, aiming for better correlation with human perception than traditional metrics. A higherCLIPIQAscore indicates better image quality. - Mathematical Formula: The exact formula is complex and involves
CLIP's internal feature representations and similarity calculations between image embeddings and quality-related text embeddings. It's not a simple algebraic formula. Conceptually, it quantifies how well an image aligns with "high quality" textual descriptions inCLIP's joint embedding space.
- Conceptual Definition:
-
MUSIQ [21]:
- Conceptual Definition:
MUSIQ(Multi-scale Image Quality Transformer) is ano-reference image quality assessment (NR-IQA)metric that uses aTransformernetwork to evaluate image quality. It processes image patches at multiple scales and aggregates information to predict a quality score that is highly consistent with human perception. A higherMUSIQscore indicates better image quality. - Mathematical Formula:
MUSIQis a deep learning model; its "formula" is the trainedTransformerarchitecture with its learned weights. It takes an image as input and outputs a scalar quality score.
- Conceptual Definition:
-
MANIQA [54]:
- Conceptual Definition:
MANIQA(Multi-dimension Attention Network for No-Reference Image Quality Assessment) is anotherNR-IQAmetric. It employs amulti-dimension attention networkto capture quality-aware features across different dimensions (e.g., spatial, channel). It aims to provide robust quality predictions that align well with human judgments. A higherMANIQAscore indicates better image quality. - Mathematical Formula: Similar to
MUSIQ,MANIQAis a neural network. Its formula is implicitly defined by its architecture and learned parameters.
- Conceptual Definition:
5.2.3. Blind Face Restoration Specific Metrics (Appendix E)
- Identity Score (IDS): Measures how well the identity of the face is preserved after restoration. A lower
IDSis generally better, indicating closer identity to the original. - Landmark Distance (LMD): Quantifies the distance between facial landmarks (e.g., eyes, nose, mouth corners) in the restored image compared to the ground truth. A lower
LMDindicates better structural alignment. - Fréchet Inception Distance (FID) [15]:
- Conceptual Definition:
FIDmeasures the similarity between the feature distributions of generated images and real images. It's often used to evaluate the quality ofgenerative models. A lowerFIDscore indicates that the generated images are more similar to real images (higher quality and diversity). - Mathematical Formula: $ \mathrm{FID} = |\mu_1 - \mu_2|_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
- Symbol Explanation:
- : The mean feature vectors of real and generated images, respectively, extracted from a specific layer of a pre-trained
Inception-v3model. - : The covariance matrices of the feature vectors for real and generated images, respectively.
- : Squared L2 norm.
- : The trace of a matrix.
- : The mean feature vectors of real and generated images, respectively, extracted from a specific layer of a pre-trained
- Conceptual Definition:
5.3. Baselines
The paper compares DiT-SR against several state-of-the-art SR methods categorized by their approach:
-
GAN-based Methods: These methods use
Generative Adversarial NetworksforSR.- RealSR-JPEG [18]
- BSRGAN [60]
- RealESRGAN [48]
- SwinIR [28]: While
SwinIRusesTransformers, it's often grouped withGAN-basedorCNN-based perceptual SRmethods due to its training objectives and typical performance characteristics.
-
Prior-based Methods: These leverage large pre-trained
diffusion models(like Stable Diffusion) as a generative prior.- StableSR-200 [45]
- DiffBIR-50 [29]
- PASD-20 [55]
- SeeSR-50 [52]
-
Training-from-Scratch Diffusion-based Methods: These are
diffusion modelstrained onSRdata without relying on large pre-trainedtext-to-image models.-
LDM-100 [36]
-
ResShift-15 [58]
The numbers (e.g., -200, -50, -15, -100) indicate the number of denoising steps used by these methods.
-
5.4. Implementation Details
- Latent Space Operation: Following
LDM [36], theDiT-SRarchitecture operates in thelatent space. It uses aVector Quantized GAN (VQGAN) [7]for encoding and decoding images to and from thelatent space, with adownsampling factor of 4. - Training Schedule: The model is trained for 300,000 iterations.
- Batch Size: A batch size of 64 is used.
- Hardware: Training is performed using 8
NVIDIA Tesla V100 GPUs. - Optimizer:
Adam [23]is used as the optimizer. - Learning Rate: The initial learning rate is .
- FFT Window Size: The
FFT window sizeforAdaFMis empirically set to 8 [24, 43]. - Transformer Block Configuration (from Appendix A):
- The
transformer block numberis set to 6 for each stage (for the base model). - The
base channel(initial channel dimension) is configured to 160. - For the
U-shaped global architecture, there are 4 stages, with achannel increase factorset to [1, 2, 2, 4]. This implies that the channel dimensions for the four stages are .
- The
- Blind Face Restoration (Appendix E):
VQGANwith adownsampling factor of 8is used.Diffusion stepsare set to 4.- Learning rate grows to in 5,000 iterations, then decays from to using an annealing cosine schedule, ending at 200,000 iterations.
Diffusion loss [16]inlatent spaceandLPIPS [61]loss inpixel spaceare adopted.
6. Results & Analysis
This section presents the experimental results, comparing DiT-SR with state-of-the-art methods and analyzing its various components through ablation studies.
6.1. Core Results Analysis
The results demonstrate that DiT-SR significantly outperforms existing training-from-scratch diffusion-based SR methods and even achieves competitive or superior performance compared to some prior-based methods, despite using substantially fewer parameters.
The following are the results from Table 1 of the original paper, showing comparisons on RealSR and RealSet65 datasets:
| Methods | #Params | RealSR | RealSet65 | ||||
| CLIPIQA↑ | MUSIQ↑ | MANIQA↑ | CLIPIQA↑ | MUSIQ↑ | MANIQA↑ | ||
| GAN based Methods | |||||||
| RealSR-JPEG | 17M | 0.3611 | 36.068 | 0.1772 | 0.5278 | 50.5394 | 0.2943 |
| BSRGAN | 17M | 0.5438 | 63.5819 | 0.3685 | 0.616 | 65.5774 | 0.3897 |
| RealESRGAN | 17M | 0.4898 | 59.6766 | 0.3679 | 0.5987 | 63.2228 | 0.3871 |
| SwinIR | 12M | 0.4653 | 59.6316 | 0.3454 | 0.5778 | 63.8212 | 0.3816 |
| Prior based Methods | |||||||
| StableSR-200 | 919M | 0.5207 | 59.4264 | 0.3563 | 0.5338 | 56.9207 | 0.3387 |
| DiffBIR-50 | 1670M | 0.7142 | 66.843 | 0.4802 | 0.7398 | 69.7260 | 0.5000 |
| PASD-20 | 1469M | 0.5170 | 58.4394 | 0.3682 | 0.5731 | 61.8813 | 0.3893 |
| SeeSR-50 | 1619M | 0.6819 | 66.3461 | 0.5035 | 0.7030 | 68.9803 | 0.5084 |
| Training-from-Scratch Diff. based Methods | |||||||
| LDM-100 | 114M | 0.5969 | 55.4359 | 0.3071 | 0.5936 | 56.112 | 0.356 |
| ResShift-15 | 119M | 0.6028 | 58.8790 | 0.3891 | 0.6376 | 58.0400 | 0.4048 |
| Ours-15 | 61M | 0.7161 | 65.8334 | 0.5022 | 0.7120 | 66.7413 | 0.4821 |
Analysis of Table 1:
-
Training-from-Scratch vs. Prior-based:
Ours-15(DiT-SRwith 15 steps) achieves the highestCLIPIQAscore (0.7161 on RealSR, 0.7120 on RealSet65) among all listed methods. It also performs very strongly onMUSIQandMANIQA. This is a crucial finding, asOurs-15is atraining-from-scratchmethod, yet it outperformsLDM-100andResShift-15significantly and even surpasses severalprior-based methodslikeStableSR-200,PASD-20, andSeeSR-50inCLIPIQA(and competitive withDiffBIR-50which has vastly more parameters). -
Parameter Efficiency:
Ours-15achieves this with only 61M parameters. This is remarkably efficient compared toprior-based methodslikeDiffBIR-50(1670M parameters),PASD-20(1469M), orSeeSR-50(1619M). EvenLDM-100andResShift-15have roughly twice the parameters (114M and 119M, respectively) but yield inferior results. This validates the paper's claim of superior performance with significantly fewer parameters. -
Visual Quality (Non-Reference Metrics): The strong performance on
CLIPIQA,MUSIQ, andMANIQAindicates thatDiT-SRproduces images with high perceptual quality that align well with human judgment, which is a key goal forgenerative SR.The following figure (Figure 1 from the original paper) provides a visual comparison of
CLIPIQAscores againstParametersandFLOPs.
该图像是图表,展示了在RealSR数据集上,所提方法与最新图像超分辨率方法在CLIPIQA指标与参数数量(上图)及FLOPs(下图)间的对比。图中区分了基于GAN、扩散模型及先验模型的方法,显示所提方法在性能与资源消耗上的优势。
Figure 1 from the original paper visually reinforces the quantitative results. The top graph (CLIPIQA vs. Parameters) clearly shows that DiT-SR (labeled as "Ours") achieves the highest CLIPIQA score with a significantly lower parameter count than prior-based methods and better performance than other diff-based SR methods. The bottom graph (CLIPIQA vs. FLOPs) shows a similar trend, where DiT-SR offers a strong CLIPIQA score for its FLOPs count, further highlighting its efficiency.
The following are the results from Table 3 of the original paper, showing performance comparison on the synthetic LSDIR-Test dataset:
| Methods | LSDIR-Test | ||||
| PSNR↑ | LPIPS↓ | CLIPIQA↑ | MUSIQ↑ | MANIQA↑ | |
| GAN based Methods | |||||
| RealSR-JPEG | 22.16 | 0.360 | 0.546 | 59.02 | 0.342 |
| BSRGAN | 23.74 | 0.274 | 0.570 | 67.94 | 0.394 |
| RealESRGAN | 23.15 | 0.259 | 0.568 | 68.23 | 0.414 |
| SwinIR | 23.17 | 0.247 | 0.598 | 68.20 | 0.414 |
| Prior based Methods | |||||
| StableSR-200 | 22.68 | 0.267 | 0.660 | 68.91 | 0.416 |
| DiffBIR-50 | 22.84 | 0.274 | 0.709 | 70.05 | 0.455 |
| PASD-20 | 23.57 | 0.279 | 0.624 | 69.07 | 0.440 |
| SeeSR-50 | 22.90 | 0.251 | 0.718 | 72.47 | 0.559 |
| Training-from-Scratch Diff. based Methods | |||||
| LDM-100 | 23.34 | 0.255 | 0.601 | 66.84 | 0.413 |
| ResShift-15 | 23.83 | 0.247 | 0.640 | 67.74 | 0.464 |
| Ours-15 | 23.60 | 0.244 | 0.646 | 69.32 | 0.483 |
Analysis of Table 3:
-
PSNR vs. Perceptual Metrics: On
LSDIR-Test,ResShift-15achieves the highestPSNR(23.83), which is apixel-wise fidelitymetric. However,Ours-15achieves the bestLPIPS(0.244, lower is better),CLIPIQA(0.646),MUSIQ(69.32), andMANIQA(0.483), which areperceptual qualitymetrics. This indicatesDiT-SRexcels at generating perceptually pleasing results, even if pixel-level fidelity (as measured byPSNR) is slightly lower thanResShift-15. This is a common trade-off ingenerative SRmethods. -
Comparison with Prior-based Methods:
Ours-15remains highly competitive, surpassingStableSR-200andPASD-20on allperceptual metricsand is comparable toDiffBIR-50.SeeSR-50shows a very strongMANIQAscore (0.559) but is generally weaker on otherperceptual metricscompared toOurs-15. Again,DiT-SRdoes this with significantly fewer parameters.The following figure (Figure 7 from the original paper) provides qualitative comparisons.
该图像是论文中真实世界数据集上的超分辨率视觉结果对比图,展示了LR图像及多种方法的重建效果。左侧红框表示关注区域,不同方法包括BSRGAN、StableSR、DiffBIR、PASD、SeeSR、RealESRGAN、SwinIR、LDM、ResShift和本文提出的方法。
Figure 7 from the original paper qualitatively demonstrates the superior performance of DiT-SR. The image displays several examples comparing DiT-SR against other methods on both synthetic and real-world datasets. It highlights that DiT-SR produces images with sharper details and better texture reproduction, reducing artifacts compared to baselines. For instance, in the example images, DiT-SR is shown to recover finer text details and more natural textures than other methods.
6.2. Ablation Studies / Parameter Analysis
The paper conducts ablation studies to validate the effectiveness of its proposed architectural components and AdaFM.
The following are the results from Table 2 of the original paper, showing ablation study results on U-shaped DiT and time conditioning:
| Configuration | #Params | FLOPs | RealSR | RealSet65 | |||
| DiT Arch. | Time Conditioning | CLIPIQA↑ | MUSIQ↑ | CLIPIQA↑ | MUSIQ↑ | ||
| Isotropic | AdaLN | 42.38M | 122.99G | 0.655 | 64.194 | 0.664 | 64.263 |
| U-shape | AdaLN | 264.39M | 122.87G | 0.688 | 64.062 | 0.693 | 65.604 |
| Ours | AdaLN | 100.64M(-62%) | 93.11G(-24%) | 0.700 | 64.676 | 0.699 | 67.634 |
| Ours | AdaFM | 60.79M(-77%) | 93.03G(-24%) | 0.716 | 65.833 | 0.712 | 66.741 |
6.2.1. U-shaped DiT with Isotropic Design
- Isotropic (Standard DiT) with AdaLN: A reimplemented standard
DiT(isotropic) achieves decent performance (e.g.,CLIPIQA0.655 on RealSR) with 42.38M parameters and 122.99GFLOPs. This serves as a baseline forDiT-style architectures. - U-shape with AdaLN: A
U-shaped DiT(without the specific isotropic channel reallocation ofDiT-SR) shows improvedCLIPIQA(0.688 on RealSR) for similarFLOPs(122.87G) but with a massive increase in parameters (264.39M), about six times more than theisotropic DiT. This highlights theU-shape's performance benefits but also its potential parameter inefficiency in a naiveDiTintegration. - Ours with AdaLN: The proposed
DiT-SRarchitecture (combiningU-shapewithisotropic designandAdaLNfor fair comparison) achieves even betterCLIPIQA(0.700 on RealSR) with significantly fewer parameters (100.64M, a 62% reduction compared to theU-shapeDiTbaseline) and fewerFLOPs(93.11G, a 24% reduction). This clearly demonstrates the effectiveness of theisotropic designfor resource reallocation: it boosts performance while being much moreparameter-efficientthan a straightforwardU-shaped DiT.
6.2.2. Adaptive-Frequency Modulation (AdaFM)
-
Ours with AdaFM vs. Ours with AdaLN: Replacing
AdaLNwithAdaFMin theDiT-SRarchitecture leads to further performance gains (CLIPIQAjumps from 0.700 to 0.716 on RealSR). Crucially, this improvement comes with a significant reduction in parameters (from 100.64M to 60.79M, a 77% reduction compared to theU-shape DiTbaseline, and a 39.6% reduction compared toOurs with AdaLN) and slightly reducedFLOPs. This confirmsAdaFM's effectiveness: it provides a moreparameter-efficientandperformance-boostingway to injecttime-step conditioning, especially forSRtasks that benefit fromfrequency-adaptive modulation. Figure 6 provides visual evidence of howAdaFMadaptively emphasizeslow-frequency componentsin earlydenoising stagesandhigh-frequency componentsin later stages.The following are the results from Table 5 of the original paper, showing the results of compressing U-shaped DiT on real-world datasets:
Methods #Params FLOPs RealSR RealSet65 CLIPIQA↑ MUSIQ↑ CLIPIQA↑ MUSIQ↑ U-shaped DiT 264.39M 122.87G 0.688 64.062 0.693 65.604 Shallower U-DiT 196.65M(-26%) 96.30G(-22%) 0.671 63.319 0.683 64.097 Narrower U-DiT 214.20M(-19%) 99.56G(-19%) 0.682 63.631 0.692 65.469 Ours w/ AdaLN 100.64M(-62%) 93.11G(-24%) 0.700 64.676 0.699 67.634
Analysis of Table 5 (Compressing U-shaped DiT):
This table explores whether a large U-shaped DiT is redundant.
- Shallower U-DiT: Reducing the number of
transformer blocksfrom 6 to 4 per stage (a 26% parameter reduction) leads to a performance drop (e.g.,CLIPIQAfrom 0.688 to 0.671 on RealSR). - Narrower U-DiT: Decreasing the
base channelfrom 160 to 144 (a 19% parameter reduction) also results in a performance drop (e.g.,CLIPIQAfrom 0.688 to 0.682 on RealSR). These results suggest that the baselineU-shaped DiT(d6c160) is not overly redundant and that naive compression degrades performance. In contrast,Ours w/ AdaLNachieves better performance with a much larger parameter andFLOPsreduction (62% and 24% respectively), validating its strategic resource reallocation approach.
The following are the results from Table 6 of the original paper, showing the performance of the lightweight version:
| Methods | #Params | RealSR | RealSet65 | ||||
| CLIPIQA↑ | MUSIQ↑ | MANIQA↑ | CLIPIQA↑ | MUSIQ↑ | MANIQA↑ | ||
| LDM-100 | 114M | 0.5969 | 55.4359 | 0.3071 | 0.5936 | 56.1120 | 0.3560 |
| ResShift-15 | 119M | 0.6028 | 58.8790 | 0.3891 | 0.6376 | 58.0400 | 0.4048 |
| Ours-15 | 61M | 0.7161 | 65.8334 | 0.5022 | 0.7120 | 66.7413 | 0.4821 |
| Ours-Lite-15 | 31M | 0.6670 | 63.0544 | 0.4565 | 0.6694 | 64.3387 | 0.4420 |
| Ours-Lite-1 | 31M | 0.6993 | 63.3759 | 0.4262 | 0.7092 | 64.8329 | 0.4299 |
Analysis of Table 6 (Lightweight Version):
-
Ours-Lite-15: This lightweight version (31M parameters) is created by reducing
transformer blocks(from 6 to 4),base channels(160 to 128), and removing the deepest layer. It significantly outperformsLDM-100andResShift-15on all metrics, despite having only about 25% of their parameters. This highlights the substantialmodel capacityand efficiency of theDiT-SRdesign. -
Ours-Lite-1: This version uses
step distillation(specificallySinSR [49]) to achievesingle-step denoisingfromOurs-Lite-15. It shows an increase inCLIPIQAandMUSIQbut a decrease inMANIQA. This indicates a trade-off: distillation can improve inference speed and someperceptual metrics, but might not generalize perfectly across allIQAmetrics.The following are the results from Table 4 of the original paper, showing Diffusion Architecture Hyper-parameters:
DiT Arch. Time Conditioning #Params FLOPs Number of Blocks Channels Reallocated Channel Isotropic AdaLN 42.38M 122.99G [6,6,6,6,6] 160 - U-shape AdaLN 264.39M 122.87G [6,6,6,6] [160,320,320,640] - Ours AdaLN 100.64M 93.11G [6,6,6,6] [160,320,320,640] 192 Ours AdaFM 60.79M 93.03G [6,6,6,6] [160,320,320,640] 192 Ours-Lite AdaFM 30.89M 49.17G [4,4,4] [128,256,256] 160
Analysis of Table 4 (Diffusion Architecture Hyper-parameters):
This table provides the detailed architectural configurations for the various DiT models used in the ablation studies and main experiments.
Isotropic(Standard DiT): Has 5 stages with 6 blocks each, a constant channel of 160. This is a purelyisotropicDiTwithoutU-shapedup/downsampling.U-shape: Has 4 stages with 6 blocks each. Channels increase hierarchically (160, 320, 320, 640). This represents aU-shapedarchitecture. Note the parameter count (264.39M) is much higher than theIsotropicDiT, even for similarFLOPs.Ours(DiT-SR): Uses theU-shapechannel progression but applies theisotropic designlogic, specifically by using areallocated channelof 192 (for the internaltransformer blockoperations). This leads to fewer parameters (100.64M) andFLOPs(93.11G) compared to theU-shapebaseline, demonstrating efficiency.OurswithAdaFM: TheAdaFMmodule further reduces parameters (60.79M) andFLOPsslightly (93.03G) while improving performance, confirming its efficiency.Ours-Lite: A highly compressed version with fewer blocks (4 per stage, for 3 stages) and smallerbase channels(128). Thereallocated channelis 160. This dramatically reduces parameters (30.89M) andFLOPs(49.17G), showcasing the model's scalability forlightweight applications.
6.2.3. Experiments on Blind Face Restoration (Appendix E)
The paper also evaluates DiT-SR on Blind Face Restoration, demonstrating its generalization capability beyond general SR.
The following are the results from Table 7 of the original paper, showing quantitative results on CelebA-Test:
| Methods | CelebA-Test | ||||||
| LPIPS↓ | IDS↓ | LMD↓ | FID↓ | CLIPIQA↑ | MUSIQ↑ | ManIQA | |
| DFDNet | 0.739 | 86.323 | 20.784 | 76.118 | 0.619 | 51.173 | 0.433 |
| PSFRGAN | 0.475 | 74.025 | 10.168 | 60.748 | 0.630 | 69.910 | 0.477 |
| GFPGAN | 0.416 | 66.820 | 8.886 | 27.698 | 0.671 | 75.388 | 0.626 |
| VQFR | 0.411 | 65.538 | 8.910 | 25.234 | 0.685 | 73.155 | 0.568 |
| CodeFormer | 0.324 | 59.136 | 5.035 | 26.160 | 0.698 | 75.900 | 0.571 |
| DiffFace-100 | 0.338 | 63.033 | 5.301 | 23.212 | 0.527 | 66.042 | 0.475 |
| ResShift-4 | 0.309 | 59.623 | 5.056 | 17.564 | 0.613 | 73.214 | 0.541 |
| Ours-4 | 0.337 | 61.4644 | 5.235 | 19.648 | 0.725 | 75.848 | 0.634 |
Analysis of Table 7 (CelebA-Test):
-
Ours-4(DiT-SR with 4 diffusion steps) achieves the bestCLIPIQA,MUSIQ, andMANIQAscores, indicating superiorperceptual qualityforface restoration. -
ResShift-4shows slightly betterLPIPS,IDS,LMD, andFIDscores, suggesting slightly betterpixel-level fidelityandidentity preservation. However,Ours-4is very competitive on these metrics as well. This again demonstrates a trade-off whereDiT-SRleans towards higherperceptual quality.The following are the results from Table 8 of the original paper, showing quantitative results on real-world datasets for blind face restoration:
Methods LFW WebPhoto Wider CLIPIQA↑ MUSIQ↑ MANIQA↑ CLIPIQA↑ MUSIQ↑ MANIQA↑ CLIPIQA↑ MUSIQ↑ MANIQA↑ DFDNet 0.716 73.109 0.6062 0.654 69.024 0.550 0.625 63.210 0.514 PSFRGAN 0.647 73.602 0.5148 0.637 71.674 0.476 0.648 71.507 0.489 GFPGAN 0.687 74.836 0.5908 0.651 73.367 0.577 0.663 74.694 0.602 VQFR 0.710 74.386 0.5488 0.677 70.904 0.511 0.707 71.411 0.520 CoderFormer 0.689 75.480 0.5394 0.692 74.004 0.522 0.699 73.404 0.510 DiffFace-100 0.593 70.362 0.4716 0.555 65.379 0.436 0.561 64.970 0.436 ResShift-4 0.626 70.643 0.4893 0.621 71.007 0.495 0.629 71.084 0.494 Ours-4 0.727 73.187 0.564 0.717 73.921 0.571 0.743 74.477 0.589
Analysis of Table 8 (Real-World Face Restoration):
-
Ours-4consistently achieves the highestCLIPIQA,MUSIQ, andMANIQAscores across all three real-world face datasets (LFW, WebPhoto, WIDER). This strong performance onno-reference metricsis particularly important for real-world applications where ground truth is absent. It indicates thatDiT-SRgenerates highly realistic and perceptually pleasing facial details, even under challengingreal-world degradations.The following figure (Figure 10 from the original paper) provides qualitative comparisons for
blind face restoration.
该图像是图像超分辨率扩散模型在不同扩散步骤预测结果的插图。第一行显示各步骤的预测清晰图像,第二行展示对应图像的傅里叶频谱,体现模型由低频逐步生成高频细节。
Figure 10 from the original paper provides qualitative comparisons of facial super-resolution across different methods. It visually demonstrates that DiT-SR (Ours) consistently produces more realistic and detailed facial reconstructions, often recovering expressions and fine textures more accurately than other methods, especially for degraded inputs from LFW, WebPhoto, and WIDER datasets.
6.3. More Visualization Results
The following figure (Figure 8 from the original paper) presents additional visualization results on real-world datasets.

该图像是图表,展示了多种超分辨率方法在人脸图像上的对比效果。每行包括低分辨率图像、通过不同算法恢复的图像(如GFPGAN、VQFR、CodeFormer等)以及高清(HR)原图,直观呈现了各方法在视觉质量上的差异。
Figure 8 from the original paper shows additional qualitative comparisons of super-resolution results on real-world datasets. The images highlight DiT-SR's ability to produce sharp textures and fine details, outperforming other methods in rendering realistic outputs for various natural scenes.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces DiT-SR, an effective diffusion transformer architecture specifically tailored for image super-resolution. The core innovation lies in its hybrid design, combining a U-shaped global architecture for multi-scale hierarchical feature extraction with a uniform isotropic design for transformer blocks across different stages. This strategic combination reallocates computational resources to critical high-resolution layers, enhancing performance and efficiency. Furthermore, DiT-SR addresses a key limitation of traditional time-step conditioning in diffusion models by proposing Adaptive Frequency Modulation (AdaFM). AdaFM efficiently and adaptively modulates frequency components based on the time step, better aligning with the frequency-dependent denoising process observed in SR tasks.
Extensive experiments demonstrate that DiT-SR sets a new state-of-the-art for training-from-scratch diffusion-based SR methods, significantly outperforming existing approaches. Remarkably, it even beats some prior-based methods (which rely on massive pre-trained Stable Diffusion models) while using only a fraction of their parameters. This success proves the superiority and efficiency of the diffusion transformer approach in image super-resolution, offering a flexible and high-performing solution without the heavy computational overhead of fine-tuning gigantic models.
7.2. Limitations & Future Work
The authors acknowledge a few limitations and suggest future research directions:
- Scalability:
Image super-resolution modelsgenerally do not exhibit the same level of scalability astext-to-image modelsdue to task differences and limited data. WhileDiT-SRis highly parameter-efficient and achieves competitive performance, it still has room to fully surpass the absolute upper bound performance of the most powerfulprior-based models(though it already beats some). - AdaFM's Potential: The authors believe
AdaFMholds significant potential to establish a newtime-step conditioning paradigmfordiffusion models. They suggest extending its application beyondSRto variouslow-level visual tasksand even totext-to-image generation, particularly in scenarios where the generation process also follows alow-frequency to high-frequencyprogression. - Ethical Considerations: Similar to other
content generation methods, the authors highlight the need for cautious use of their approach to prevent potential misuse, acknowledging the broader ethical implications of powerful generative AI.
7.3. Personal Insights & Critique
This paper presents a highly insightful and effective approach to image super-resolution.
- Architectural Synergy: The core idea of combining the
U-shaped architecturewithisotropic transformer blocksis a clever hybrid that leverages the best of both worlds.U-Netsare proven forlow-level visiondue tomulti-scale feature handling, andDiTsofferscalabilityandefficient token processing. The strategic reallocation of computational resources, as shown in the ablation studies, is a key enabler for this synergy, allowing for better performance with fewer parameters. This moves beyond simply swappingCNNsfortransformersin aU-Netand instead rethinks howtransformer resourcesshould be distributed across scales. - Frequency-Aware Conditioning: The introduction of
AdaFMis a significant contribution. It addresses a fundamental mismatch between thefrequency-dependent natureofimage super-resolution(anddenoisingin general) and thechannel-wise modulationofAdaLN. By explicitly operating in thefrequency domain,AdaFMprovides a more semantically relevant and efficient way to guide thedenoising process. The visual evidence in Figure 6 is compelling and clearly demonstrates its adaptive behavior. Thisfrequency-aware conditioningparadigm could indeed have broader implications for manygenerative models. - Efficiency and Flexibility: The ability to achieve
state-of-the-art performancefortraining-from-scratch methodswhile significantly reducingparameter countcompared toprior-based modelsis a major strength. This opens doors for more flexible research, architectural modifications, and deployment inresource-constrained environmentsoredge devices, whereprior-based modelsare often too cumbersome. TheOurs-Liteversion further underscores this potential forlightweight applications.
Potential Areas for Improvement/Further Exploration:
-
Generalizability of Isotropic Channel Allocation: While the paper demonstrates the benefit of its
isotropic design's channel allocation strategy forSR, it would be interesting to see a more detailed theoretical analysis or empirical study on how this reallocation generalizes to otherlow-level vision tasksor evenhigh-level generation tasks. -
AdaFM in other domains: The authors correctly identify
AdaFM's potential fortext-to-image generation. Exploring this in detail could be a valuable extension, especially for models that also progressively refine details. -
Adaptive Windowing for FFT: The
FFT window sizeis empirically set to 8. Investigating if an adaptive or learnedwindowing strategycould further optimizeAdaFMmight be beneficial. -
Comparison of Parameter Counts: While the paper highlights the parameter efficiency against
prior-based methodsvery well, a direct comparison ofFLOPsorparametersforAdaFMvs.AdaLNwithin the context of the same full architecture (not just the module itself) could further solidify the efficiency argument. The current tables do show this forOurs with AdaLNvs.Ours with AdaFM, which is good.Overall,
DiT-SRrepresents a substantial advancement indiffusion-based image super-resolution, offering a robust, efficient, and perceptually superior solution built on solid architectural and conditioning innovations. Itstraining-from-scratchnature positions it as a highly flexible and valuable contribution to the field.
Similar papers
Recommended via semantic vector search.