Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
TL;DR Summary
This paper introduces Radial Attention, a sparse attention mechanism with $O(n ext{log}n)$ complexity, addressing high computational costs in long video generation. It leverages spatiotemporal energy decay to enhance efficiency and maintain performance during extended generation.
Abstract
Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9 speedup over the original dense attention. With minimal tuning, it enables video generation up to 4 longer while reducing training costs by up to 4.4 compared to direct fine-tuning and accelerating inference by up to 3.7 compared to dense attention inference. Code is released at \href{https://github.com/mit-han-lab/radial-attention}{https://github.com/mit-han-lab/radial-attention}.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Radial Attention: Sparse Attention with Energy Decay for Long Video Generation
1.2. Authors
Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han.
Their affiliations include MIT, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence. This diverse group of affiliations suggests a collaborative effort involving prominent institutions in AI research, particularly in areas like efficient deep learning, computer vision, and large model development.
1.3. Journal/Conference
The paper is a preprint published on arXiv (arXiv:2506.19852). V2 is indicated as the NeurIPS 2025 camera-ready version. NeurIPS (Conference on Neural Information Processing Systems) is one of the most prestigious and highly selective conferences in machine learning and artificial intelligence, widely recognized for presenting cutting-edge research. Its acceptance signifies high quality and significant contribution to the field.
1.4. Publication Year
2025 (Published at 2025-06-24T17:59:59.000Z)
1.5. Abstract
The paper addresses the challenge of high computational costs in training and inference for long video generation using diffusion models, which is primarily due to the quadratic scaling of standard attention mechanisms with the temporal dimension. The authors identify a phenomenon called Spatiotemporal Energy Decay in video diffusion models, where post-softmax attention scores decrease as the spatial and temporal distance between tokens increases. Inspired by this, they propose Radial Attention, a scalable sparse attention mechanism with complexity. This mechanism translates the observed energy decay into exponentially decaying compute density by employing a simple, static attention mask. This mask allows each token to attend to spatially nearby tokens, with the attention window size shrinking as temporal distance grows. A key feature is its ability to extend the generation length of pre-trained video diffusion models through efficient LoRA-based fine-tuning. Extensive experiments demonstrate that Radial Attention maintains video quality across various models (Wan2.1-14B, HunyuanVideo, Mochi 1), achieving up to a speedup over original dense attention. For generating videos up to longer, it reduces training costs by up to and accelerates inference by up to compared to direct fine-tuning and dense attention inference, respectively, with minimal tuning.
1.6. Original Source Link
https://arxiv.org/abs/2506.19852 PDF Link: https://arxiv.org/pdf/2506.19852v2.pdf Publication Status: Preprint (V2 is the NeurIPS 2025 camera-ready version).
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the prohibitively high computational cost associated with generating long videos using advanced diffusion models. While diffusion models have achieved remarkable success in high-quality video generation, the inherent nature of video, which introduces an additional temporal dimension, significantly increases the total number of tokens that need to be processed. This expansion in token count clashes directly with the quadratic scaling () of the standard self-attention mechanism with respect to the sequence length . This quadratic complexity makes both training and inference on long videos exceptionally expensive, thereby restricting the scalability and practical deployment of these powerful models.
Prior research has attempted to mitigate this challenge through various approaches:
-
Sparse attention mechanisms: Methods like
Sparse VideoGen (SVG)[8] dynamically classify attention heads as spatial or temporal and apply corresponding masks. However, these often struggle during training, especially for unseen longer video distributions, due to potential misclassification and error reinforcement. -
Linear attention alternatives: Approaches that replace
softmax attentionwith linear attention [9, 10] offer faster computation () but typically require substantial architectural changes and significant retraining, often failing to recover the original video quality with modest fine-tuning. They also tend to struggle with capturing local details [60].The paper's entry point and innovative idea stem from an observation the authors term
Spatiotemporal Energy Decay. They note that, analogous to physical signals or waves losing energy over distance and time, the post-softmax attention scores in video diffusion models diminish as the spatial and temporal distance between tokens increases. This phenomenon suggests that not all token interactions are equally important, and computational resources could be more efficiently allocated to interactions that contribute most significantly to the final output. This observation provides a principled motivation for designing a more efficient, yet effective, sparse attention mechanism.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Identification of Spatiotemporal Energy Decay: The authors rigorously identify and characterize the
Spatiotemporal Energy Decayphenomenon in video diffusion models, where attention scores decay exponentially with increasing spatial and temporal distance. This empirical observation forms the theoretical foundation forRadial Attention. -
Proposal of Radial Attention: They propose
Radial Attention, a novel sparse attention mechanism with computational complexity. This mechanism is designed to directly translate the principle of energy decay into an exponentially decaying compute density. It utilizes a simple, static attention mask where each token primarily attends to spatially nearby tokens, and the attention window size progressively shrinks with increasing temporal distance. This static and structured design offers significant efficiency improvements over dense attention while being more expressive than linear attention. -
Efficient Length Extension via LoRA Fine-tuning:
Radial Attentionenables pre-trained video diffusion models to extend their generation length effectively. Because it prunes unimportant token relations without altering the coresoftmax attentionmechanism, it allows for efficient adaptation using lightweightLoRA-based fine-tuning, which focuses on refining critical parameters for temporal coherence and visual fidelity. This approach is also compatible with existingstyle LoRAs. -
Substantial Performance Gains:
- Default Video Length:
Radial Attentionmaintains high video quality across leading models (Wan2.1-14B, HunyuanVideo, Mochi 1) and achieves up to a speedup over the original dense attention. - Longer Video Generation: For videos up to longer, it significantly reduces training costs by up to (compared to direct fine-tuning with dense attention) and accelerates inference by up to (compared to dense attention inference), all while preserving video quality.
- Default Video Length:
-
Code Release: The authors have released their code at https://github.com/mit-han-lab/radial-attention, fostering reproducibility and further research.
These findings collectively solve the problem of high computational costs in long video generation by providing an efficient, scalable, and quality-preserving attention mechanism that can be seamlessly integrated and fine-tuned with existing diffusion models.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand Radial Attention and its implications, a foundational understanding of Diffusion Models, Self-Attention, and LoRA is essential.
3.1.1. Diffusion Models
Diffusion Models are a class of generative models that have achieved state-of-the-art results in synthesizing high-quality data, particularly images and videos. The core idea is inspired by physics:
-
Forward Diffusion Process: This process gradually adds Gaussian noise to an input data point (e.g., a clear image or video frame) over several discrete timesteps, transforming it into pure noise. If we start with a data point , at each step , a small amount of Gaussian noise is added, producing . Eventually, at a final timestep , becomes indistinguishable from pure noise.
-
Reverse Denoising Process: This is the generative part. A neural network (often a
U-Netor aDiffusion Transformer (DiT)) is trained to learn how to reverse the forward process—that is, to predict and remove the noise added at each step. Starting from pure Gaussian noise , the model iteratively denoises it over many steps, gradually transforming the noise back into a coherent, high-quality data point . -
Latent Space: Many modern diffusion models, like
Latent Diffusion Models (LDMs)[3], operate in a compressedlatent space. Instead of denoising raw pixels, they first encode the high-resolution input (image or video) into a lower-dimensional latent representation. The diffusion process then occurs in this latent space, which is computationally more efficient. A pre-trained decoder is used to convert the final denoised latent representation back into the high-resolution output.For videos, an additional temporal dimension is introduced, meaning the latent representation is a sequence of latent frames, significantly increasing the total number of tokens.
3.1.2. Self-Attention and Transformers
Self-Attention is a key mechanism that revolutionized neural networks, particularly with the introduction of the Transformer architecture [18].
-
Purpose:
Self-attentionallows a model to weigh the importance of different parts of the input sequence when processing each element. For a given token (or part of the input), it determines how much attention to pay to all other tokens in the sequence. This is crucial for capturing long-range dependencies, which are vital for understanding context in language and coherence in images/videos. -
Mechanism: It computes three vectors for each token:
-
Query (Q): Represents the current token's request for information from other tokens.
-
Key (K): Represents what information other tokens can offer.
-
Value (V): Contains the actual information from other tokens that will be aggregated.
The core
Attentioncalculation is given by: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V $ Where:
-
-
, , are the query, key, and value matrices, respectively.
-
is the sequence length (number of tokens).
-
is the embedding dimension of each token.
-
computes the similarity (dot product) between each query and all keys.
-
is a scaling factor to prevent large dot products from pushing the
softmaxfunction into regions with very small gradients. -
converts the similarity scores into probability-like weights, indicating how much attention each query token pays to each key token.
-
The resulting attention weights are then multiplied by the matrix to produce the final output, which is a weighted sum of the value vectors.
The computation of involves multiplying two and matrices, resulting in an matrix. This operation has a quadratic time and memory complexity of with respect to the sequence length . For long videos, where can be very large (e.g., 110K tokens for a 5-second video), this quadratic scaling becomes a major computational bottleneck.
Diffusion Transformers (DiT) [18] replace the traditional U-Net backbone in diffusion models with Transformer blocks. They are particularly effective for video generation because 3D dense attention can jointly model spatial (within frame) and temporal (across frames) dynamics, capturing complex spatiotemporal correlations.
3.1.3. Sparse Attention
Sparse Attention is a technique designed to mitigate the complexity of dense attention by restricting the interactions between tokens. Instead of every query token attending to all key tokens, sparse attention uses a predefined or learned mask to allow attention only between a subset of token pairs.
Formally, this is achieved by adding a sparsity mask to the attention logits:
$
\mathrm{SparseAttention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T + M}{\sqrt{d}}\right)V
$
Where:
-
is the sparsity mask.
-
Entries allow attention between query token and key token .
-
Entries (or a very large negative number) effectively block attention between query token and key token , as after
softmax.By reducing the number of non-zero entries in the matrix,
sparse attentioncan significantly lower the computational cost, often achieving complexities like or , depending on the mask pattern. However, the challenge lies in designing a mask that effectively reduces computation without sacrificing the model's ability to capture important dependencies and maintain high output quality.
3.1.4. LoRA (Low-Rank Adaptation)
Low-Rank Adaptation (LoRA) [11] is an efficient fine-tuning technique for large pre-trained models, particularly Large Language Models (LLMs) and Diffusion Models.
- Purpose: Fine-tuning large models on new tasks or datasets often requires updating millions or billions of parameters, which is computationally expensive and memory-intensive.
LoRAaims to achieve comparable performance to full fine-tuning while significantly reducing the number of trainable parameters and computational overhead. - Mechanism:
LoRAinjects small, trainable low-rank matrices into theTransformerlayers of a pre-trained model. Specifically, for an original weight matrix (e.g., aQuery,Key,Value, orOutputprojection matrix in an attention layer),LoRAreplaces the update with a low-rank decompositionBA, where and , and is theLoRA rank. The updated weight matrix becomes . During fine-tuning, is frozen, and only and are trained. This dramatically reduces the number of trainable parameters from to . - Benefits:
- Reduced memory footprint for training.
- Faster training times.
- Ability to store multiple task-specific
LoRAadapters for a single base model. - Can often achieve competitive performance with full fine-tuning.
3.1.5. Spatiotemporal Energy Decay
This is a core phenomenon identified by the authors. It describes the empirical observation that in video diffusion models, the post-softmax attention scores between tokens tend to diminish as the spatial distance (distance within a frame) or temporal distance (distance between frames) between them increases. This suggests that tokens closer in space and time have stronger, more relevant interactions, while distant tokens have weaker, less critical interactions. The decay pattern is empirically found to follow an exponential distribution. This insight is crucial because it provides a principled basis for designing sparse attention mechanisms that can prune less important connections without significantly impacting quality, thereby saving computation.
3.2. Previous Works
The paper contextualizes Radial Attention within several lines of prior research:
3.2.1. Video Diffusion Models
Early approaches [13, 14, 15, 16] for video generation typically adapted 2D U-Nets (originally for image synthesis) by adding temporal modules to handle frame sequences. However, inspired by Sora [4] and DiT [18], the community has largely shifted towards DiT as the backbone. Latte [19] proposed decoupled spatial and temporal attention. More recently, state-of-the-art models [20, 21, 5, 22, 1, 7, 6] have adopted 3D dense attention to jointly model spatial-temporal dynamics and capture long-range dependencies. While powerful, this 3D dense attention introduces the quadratic computational cost that Radial Attention seeks to alleviate.
3.2.2. Efficient Video Generation
Many general techniques for accelerating image diffusion models (e.g., timestep distillation [23, 24], caching [25, 26], quantization [27, 28, 29], distributed inference [30, 31, 32]) are applicable to video. However, for video models relying on 3D dense attention, the bottleneck often shifts to the attention layers.
- Quantized Attention: Works like
SageAttention[33, 34, 35, 36] andFlashAttention-3[37] focus on quantizing attention computations for speedup. - Sparse Attention for LLMs: In
Large Language Models (LLMs),sparse attentionis widely used [38, 39, 40, 41, 42, 43, 44, 45, 46].Long LoRA[39] combines local sparse attention patterns with shifting for global receptive fields in video understanding.PowerAttention[45] restricts attention to power-of-two token distances, achieving complexity. However, these methods are generally not video-specific and ignore the inherent spatiotemporal structure of video data. - Video-Specific Sparse Attention:
STA[47] (Sliding Tile Attention) uses sliding 3D windows for local attention, improving efficiency but limiting long-range dependencies and struggling with long videos due to its fixed receptive field.SVG[8] (Sparse VideoGen) dynamically classifies each attention head as either spatial or temporal and applies corresponding sparse masks. While efficient for inference, its dynamic profiling can misclassify heads on unseen data, making training on longer videos challenging.
3.2.3. Long Video Generation
The quadratic cost of dense attention makes training and inference on long videos very expensive.
RIFLEx[50] extends video length by modifyingRoPE(Rotary Position Embedding) [51] frequencies, allowing extrapolation but often suffering from quality degradation (e.g., blurring) for longer videos.- Other approaches generate short video segments and stitch them [52], or use autoregressive strategies like
Framepack[53]. Linear Attentionmodels [10, 9, 54, 55, 56, 57, 58, 59] offer computation and global receptive fields but often struggle to capture fine local details [60], leading to quality degradation.SANA[10] is an example of a diffusion model backbone with linear attention.
3.2.4. Attention with complexity
Early efforts to achieve sub-quadratic attention complexity include:
Reformer[61]: Approximatesdense attentionusinglocality-sensitive hashing.H-Transformer[62]: Imposes a hierarchical structure on the attention matrix.Multi-resolution attention[63]: Recursively refines high-attention regions.Fast Multipole Attention[64]: Adapts the classical fast multipole method.LogSparse Transformer[65]: Restricts each token to attend to positions per layer for time-series forecasting. Many of these methods can be hardware-unfriendly or have limited scalability.
3.3. Technological Evolution
The field of generative AI has rapidly progressed from generating static images to dynamic, high-quality videos. This evolution has been driven by improvements in diffusion models and the adoption of Transformer architectures (specifically DiT for diffusion). Initially, video models adapted 2D image models, but soon recognized the need for true 3D spatiotemporal modeling, leading to the widespread use of 3D dense attention. However, this power came at a steep computational cost, particularly for longer videos.
This computational bottleneck then spurred research into efficient attention mechanisms. Early solutions for LLMs focused on generic sparse attention patterns, but these often failed to account for the unique spatiotemporal structure of video data. Video-specific sparse methods like STA and SVG emerged, but STA had limited receptive fields, and SVG's dynamic nature posed training challenges. Meanwhile, linear attention offered scalability but often compromised quality.
Radial Attention fits into this timeline as an evolution that specifically leverages the observed Spatiotemporal Energy Decay in video data. It addresses the limitations of prior video-specific sparse methods by using a static, principled mask that is both efficient for training and inference and more expressive than linear attention, while maintaining compatibility with existing model ecosystems via LoRA.
3.4. Differentiation Analysis
Radial Attention differentiates itself from existing methods primarily by:
-
Principled Sparsity based on Energy Decay: Unlike generic sparse attention patterns (
Long LoRA,PowerAttention) which don't consider video's inherent structure, orSTAwith its fixed local windows,Radial Attentionexplicitly models and leveragesSpatiotemporal Energy Decay. It uses this observed phenomenon to intelligently allocate compute density, focusing on important proximal interactions while sparsely sampling distant ones. -
Static vs. Dynamic Mask:
Radial Attentionemploys a static mask, which is a significant advantage overSVG's dynamic, profiling-based approach. WhileSVGdynamically classifies attention heads at runtime, this introduces overhead and makes training on unseen longer video distributions unreliable.Radial Attention's static mask is predictable, hardware-friendly, and enables efficient training and adaptation to longer videos without runtime classification errors. -
Expressiveness vs. Efficiency Trade-off:
Radial Attentionstrikes a balance between the high expressiveness but quadratic cost ofdense attentionand the linear cost but reduced local detail capture oflinear attention(e.g.,SANA). By retainingsoftmax attentionand selectively pruning, it achieves complexity while preserving the fidelity of local interactions critical for video quality. -
Unification of Spatial and Temporal Attention: It unifies the decoupled spatial and temporal attention often seen in other models (and implicitly in
SVG) into a single, cohesive sparsity pattern. The mask naturally captures dense spatial interactions within central bands and adapts temporal sparsity based on distance. -
Efficient Length Extrapolation with LoRA:
Radial Attentionis designed to be highly compatible withLoRAfor fine-tuning pre-trained models to generate longer videos. This allows for significant cost reduction in extending model capabilities, outperforming direct fine-tuning with dense attention or other length extrapolation methods likeRIFLExin quality.In essence,
Radial Attentionoffers a theoretically grounded, practically efficient, and adaptively extendable solution for long video generation, addressing key limitations of existing sparse and linear attention mechanisms.
4. Methodology
4.1. Principles
The core principle behind Radial Attention is the phenomenon of Spatiotemporal Energy Decay observed in video diffusion models. This principle posits that the relevance of an interaction between two tokens, as measured by their post-softmax attention scores, diminishes exponentially as their spatial or temporal distance increases.
The intuition is akin to how physical signals or waves lose energy as they propagate through space and time. In the context of video, a token representing a pixel at a certain location and time is most strongly influenced by nearby pixels in the same frame (spatial locality) and by pixels at similar locations in adjacent frames (temporal coherence). Interactions with pixels far away in space or time are generally less critical for generating high-quality, coherent video.
Radial Attention translates this energy decay into compute density decay. Instead of uniformly computing all pairwise attention scores (as in dense attention), it intelligently allocates computational resources:
-
High Density for Close Tokens: It prioritizes dense attention computation for tokens that are spatially and temporally close, where
attention energyis concentrated. -
Sparsely Sampled for Distant Tokens: For tokens that are far apart in time, the
compute densityis reduced by sparsely sampling interactions, reflecting their diminishedattention energy. This reduction is implemented in a structured, exponential decay manner.By designing a static attention mask based on this principle,
Radial Attentionaims to achieve sub-quadratic complexity () without sacrificing the essential spatiotemporal dependencies needed for high-quality video generation.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Spatiotemporal Energy Decay in Attention
The paper empirically identifies and characterizes Spatiotemporal Energy Decay. As shown in Figure 4(a), spatial attention maps focus on tokens within adjacent frames, while temporal attention maps focus on tokens at the same spatial location across different frames. Figure 4(b) illustrates the distributions of attention scores.
The authors observe that:
-
In Figure 4(b1), the average attention score between tokens at the same spatial location decreases significantly as their temporal distance increases.
-
In Figure 4(b2), the average attention score between tokens within the same frame decreases as their spatial distance increases.
These observations confirm that
attention scoresexhibit a clear decay pattern with increasing distance between query and key tokens. Regression analysis (Figure 8(b)) further suggests that this decay closely follows an exponential distribution, with an value over 0.985, empirically validating the choice of an exponential decay model.
To formally capture this, assume a video latent consists of frames, with tokens per frame, totaling tokens. Consider a query token at the -th spatial position of the -th frame. The post-softmax attention score for this query, denoted by , is given by .
The paper then states that there exist constants and such that for any key token at spatial position in frame , the attention score satisfies:
$
p _ { j s + l } \leq C _ { \mathrm { r e l } } e ^ { - \alpha \left| j - i _ { 0 } \right| - \beta \left| l - k _ { 0 } \right| } p _ { i _ { 0 } s + k _ { 0 } }
$
Where:
-
: The attention score for the key token at spatial position in frame .
-
: The attention score for the query token itself (which serves as a reference point in Appendix A.2).
-
: The temporal distance between the query frame and key frame .
-
: The spatial distance between the query token and key token within their respective frames.
-
: The temporal decay rate. A higher indicates faster decay with temporal distance.
-
: The spatial decay rate. A higher indicates faster decay with spatial distance.
-
: A positive constant that relates the attention score of a distant token to the self-attention score.
This equation formalizes the
Spatiotemporal Energy Decay. When is high and is low, it modelstemporal attention(strong spatial locality, slow temporal decay). Conversely, high and low capturespatial attention(strong temporal locality, slow spatial decay). This unified exponential decay model motivates the design of a single sparsity pattern forRadial Attention.
4.2.2. Radial Attention: Convert the Energy Decay to Compute Density Decay
Radial Attention employs a static mask to dynamically adjust compute density based on spatiotemporal distance, mimicking the observed energy decay.
4.2.2.1. Temporal Density Decay
Along the temporal dimension, Radial Attention applies an exponential decay rule. The compute density between tokens in frame and frame is proportional to .
This creates a structured pattern of diagonal bands on the attention map (as illustrated in Figure 5(a) for frames):
-
Bands: The attention map is divided into diagonal bands.
-
Central Band (Band 0): This band, centered on the main diagonal (where ), retains
compute density. This means all tokens within the same frame attend to each other densely, and tokens in adjacent frames also attend densely. -
Successive Bands: Each successive band moving outward from the center (e.g., band , , etc.) has half the
compute densityof the preceding one. The attention from tokens in frame to frame falls into band . -
Diagonal Width Doubling: Except for band , each band also doubles the diagonal width of its predecessor, distributing the reduced density across more interactions in farther frames.
This radial decay effect is visualized by progressively lighter colors in Figure 5(a), indicating reduced computation for larger temporal distances.
4.2.2.2. Spatial Density Decay
Within each frame-to-frame attention block (i.e., for interactions between a specific frame and frame ), Radial Attention preserves high-energy interactions concentrated at similar spatial locations. This results in diagonal-like structures within these blocks (see Figure 5(b)).
- Shrinking Diagonal Width: Due to the
temporal decaydiscussed above, the width of these spatial diagonals shrinks as the temporal distance between frames increases. Specifically, for attention between frame and frame , the diagonal width is given by . - Reduced Diagonal Frequency: When the calculated diagonal width drops below 1 (i.e., for very large temporal distances), the mechanism further reduces
compute densityby reducing the frequency of diagonals. Instead of keeping a full diagonal, it only retains diagonals in those blocks where . This ensures the same amortized attention density decay is maintained.
4.2.2.3. Formal Definition
The Radial Attention mask is formally defined as a 4D mask . An element means the token at spatial position in frame is allowed to attend to the token at spatial position in frame . Conversely, means attention between this pair is suppressed.
The mask is constructed based on two main conditions:
-
Condition for dense spatial interactions with shrinking window: $ 0 , \quad \mathrm { i f ~ } 2 ^ { \left\lfloor \log _ { 2 } \operatorname* { m a g } ( | i - j | , 1 ) \right\rfloor } \leq s \mathrm { ~ a n d ~ } | k - l | + 1 \leq \frac { s } { 2 ^ { \left\lfloor \log _ { 2 } \operatorname* { m a x } ( | i - j | , 1 ) \right\rfloor } } $
- This condition applies when the temporal distance is relatively small such that the resulting window size is at least 1 (i.e., ).
- It allows attention for tokens whose spatial distance plus one is within the calculated shrinking diagonal width. This captures the dense spatial interactions that decrease in width as temporal distance increases.
-
Condition for sparse sampling of same-position tokens at large temporal distances: $ 0 , \quad \mathrm { i f ~ } | i - j | \bmod \lceil \frac { 2 ^ { \left\lfloor \log _ { 2 } \operatorname* { m a x } ( | i - j | , 1 ) \right\rfloor } } { s } \rceil = 0 \mathrm { ~ a n d ~ } k = l $
- This condition comes into play when the temporal distance is large, and the calculated diagonal width from the first condition would drop below 1 (i.e., ).
- Instead of a dense diagonal, it only permits attention for tokens at the exact same spatial position () but only if the temporal distance satisfies a specific modulus condition. This effectively samples very sparsely along the temporal dimension for tokens at the same spatial location. The modulus term ensures an amortized density decay.
-
Otherwise, attention is suppressed: $
- \infty . \quad \mathrm { o t h e r w i s e } $
The final attention mask used in the attention operation (Equation 2 in the paper) is obtained by flattening the frame and spatial indices: . For better quality, an attention sink [38, 8] is incorporated, typically by allowing the first frame's attention to be crucial (or letting all tokens attend to a special "sink" token). Figure 5(c) provides a visual example of the resulting sparse mask for a 253-frame, 720p video generated by HunyuanVideo. This design preserves high-priority local interactions while sparsely sampling distant ones.
4.2.2.4. Relation to SVG
Radial Attention unifies the separate spatial and temporal attention mechanisms found in SVG [8] into a single, comprehensive mask.
- Spatial Attention: The central band (band 0) of the
Radial Attentionmask inherently captures dense spatial interactions, effectively incorporatingSVG's spatial attention. - Temporal Attention: Unlike
SVG, which might allocate uniform (and thus potentially unnecessary) computation to all temporal interactions,Radial Attentionexplicitly modelstemporal decay. It reduces attention to distant frames that have low relevance and reallocates this computational budget to tokens nearer in time, leading to both improved efficiency and enhanced modeling capacity compared toSVG.
4.2.2.5. Complexity Analysis
The computational cost of masked attention is directly proportional to the number of non-zero entries (zeros) in the attention mask . The paper derives an upper bound for this number.
The total number of zeros in the mask is the sum of zeros from three regions:
-
Central Band and Attention Sink: This includes frames where and the attention sink where . The number of zeros is bounded by .
-
Bands with Diagonal Width : These are bands where . The number of zeros in these bands is bounded by .
-
Bands with Diagonal Width : These are bands where , where only diagonal elements are sparsely sampled. The number of zeros is bounded by .
Combining these, the aggregate upper bound for the number of zeros in
Radial Attention's mask, for large number of frames with fixed resolution , simplifies to: $ # \mathrm { z e r o s ~ i n ~ } \tilde { M } \leq 4 s n ( \log _ { 2 } n - \log _ { 2 } s ) $ Where:
-
is the total number of tokens.
-
is the number of spatial tokens per frame (fixed resolution).
-
is the number of frames.
This equation shows that the computational complexity scales as . This is significantly more efficient than the complexity of dense attention. Empirical results on
HunyuanVideo(Figure 2) confirm this, showing a reduction in attention computation for longer videos.
4.2.2.6. Error Analysis
The paper also provides an error bound for the approximation introduced by Radial Attention's sparsity. Given the exact softmax attention score and the masked attention score , the attention error is bounded as:
$
\left| \tilde { p } - p \right| _ { 1 } \le C _ { \mathrm { r e l } } \left[ \frac { 8 e ^ { - \beta \left( \frac { s } { 2 } + 1 \right) } } { ( 1 - e ^ { - \alpha } ) ( 1 - e ^ { - \beta } ) } + 4 \frac { 1 + e ^ { - \beta } } { 1 - e ^ { - \beta } } \frac { e ^ { - \alpha ( s + 1 ) } } { 1 - e ^ { - \alpha } } \right] = O ( C _ { \mathrm { r e l } } e ^ { - \operatorname* { m i n } ( \beta / 2 , \alpha ) s } )
$
Where:
-
: The norm of the difference between the masked and exact attention distributions, representing the total variation error.
-
, , , : Same as defined for
Spatiotemporal Energy Decay.This bound indicates that the approximation error decreases exponentially with larger spatial decay rate and temporal decay rate , and with increasing spatial resolution . This theoretical result supports the effectiveness of the mask design in
Radial Attention, showing that if attention indeed decays exponentially with distance, then the sparse approximation will have a small and rapidly diminishing error. Empirically,Radial Attentionachieves smallerMean Squared Error (MSE)thanSVGandSTA(Section 5.3), further validating its fidelity.
4.2.3. Hardware-friendly Block Sparsity
To ensure efficient execution on modern GPU hardware, attention computations are typically performed over larger blocks (e.g., blocks) rather than individual tokens [68, 8, 40, 43, 44, 66]. This block-sparse approach allows for better memory access patterns and parallelism, which are crucial for leveraging high-performance attention kernels like FlashAttention [66, 67]. Radial Attention's mask design is compatible with such block-sparse implementations, enabling practical speedups.
4.2.4. Low-Rank Adaptation for Long Videos
Even with an efficient attention mechanism, extending pre-trained models (originally trained on short videos) to long videos presents a challenge. Direct training on long videos is computationally prohibitive, and training-free extrapolation methods (like RIFLEx [50]) often lead to quality degradation due to length distribution mismatch.
Radial Attention addresses this by:
-
Reducing Training Complexity: By achieving complexity,
Radial Attentionmakes fine-tuning on long videos computationally feasible. -
Preserving Critical Relations: The sparse mask of
Radial Attentionis designed to retain essential inter-token relations, allowing the original pre-trained weights to remain largely intact. This means only minimal fine-tuning is required.To further minimize training overhead, the authors incorporate
Low-Rank Adapters (LoRA)[11, 39] into the attention mechanism. -
LoRA Application:
LoRAis applied to thequery,key,value, andoutput projectionsof the attention layers. As explained in Section 3.1.4, this involves adding small, trainable low-rank matrices to these weight matrices, while freezing the original pre-trained weights. -
Benefits:
LoRAsignificantly reduces the number of trainable parameters, leading to:- Reduced memory footprint during training.
- Faster fine-tuning times.
- Improved video quality: Empirically,
LoRAfine-tuning withRadial Attentionnot only minimizes overhead but also refines the most critical weights and attention mechanisms more effectively, often leading to better video quality than full parameter fine-tuning.
-
Compatibility: This
length-extension LoRAis also compatible with existingstyle LoRAs, allowing users to combine long video generation with specific artistic styles.This approach offers a scalable and cost-effective way to adapt existing high-quality video diffusion models to generate much longer sequences.
5. Experimental Setup
5.1. Datasets
For fine-tuning models to generate longer videos, the experiments primarily use videos from OpenVid-1M [74].
- Source and Characteristics:
OpenVid-1Mis a large-scale, high-quality dataset for text-to-video generation. - Selection: The authors specifically sample 2,000 top-scoring videos (based on aesthetic and motion scores) for each extended length (e.g., , longer than default). This selective sampling ensures that the fine-tuning process leverages high-quality, diverse video content relevant to the target longer durations.
- Purpose: These datasets are chosen to provide sufficient high-quality examples for the
LoRA-based fine-tuning process, allowing the models to learn temporal coherence and visual fidelity for extended video lengths, which they were not originally trained for.
5.2. Evaluation Metrics
The paper uses a comprehensive set of metrics to evaluate both the fidelity (similarity to original, perceptual quality) and consistency (for long videos) of the generated videos, as well as computational efficiency.
5.2.1. Vision Reward (↑)
- Conceptual Definition:
Vision Reward[69] is a metric designed to approximate human ratings of generated videos. It quantifies how much a generated video aligns with human preferences, focusing on aspects like visual quality, aesthetic appeal, and adherence to the text prompt. A higherVision Rewardindicates better human perception. - Mathematical Formula: The paper does not provide a specific mathematical formula for
Vision Rewardbut refers to [69]. Generally,Vision Rewardmodels are trained on human preference data and output a scalar score, so there isn't a single, universally defined analytical formula like PSNR. It's an output of a learned reward function. - Symbol Explanation:
↑: Indicates that a higher value ofVision Rewardis better.
5.2.2. PSNR (Peak Signal-to-Noise Ratio) (↑)
- Conceptual Definition:
PSNRis a common image and video quality metric used to quantify the numerical similarity between a compressed or processed image/video and its original, uncompressed version. It is often expressed in decibels (dB). A higher PSNR value generally indicates a higher quality reconstruction, implying less distortion from the original. - Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{\mathrm{MSE}} \right) $ Where: $ \mathrm{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $
- Symbol Explanation:
- : The maximum possible pixel value of the image/video. For 8-bit grayscale images, this is 255.
MSE: Mean Squared Error between the original image and the reconstructed image .I(i,j): The pixel value at coordinates(i,j)in the original image.K(i,j): The pixel value at coordinates(i,j)in the reconstructed image.M, N: The dimensions (height and width) of the image.↑: Indicates that a higher value ofPSNRis better.
5.2.3. SSIM (Structural Similarity Index Measure) (↑)
- Conceptual Definition:
SSIM[70] is a perceptual metric that quantifies the similarity between two images, considering three key components: luminance, contrast, and structure. UnlikePSNRwhich measures absolute error,SSIMaims to assess perceived quality by humans. It ranges from -1 to 1, where 1 indicates perfect structural similarity. - Mathematical Formula: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $
- Symbol Explanation:
x, y: Two image patches being compared.- : The average (mean) of and .
- : The standard deviation of and .
- : The covariance of and .
- : Small constants to prevent division by zero, where is the dynamic range of the pixel values (e.g., 255 for 8-bit images) and .
↑: Indicates that a higher value ofSSIMis better.
5.2.4. LPIPS (Learned Perceptual Image Patch Similarity) (↓)
- Conceptual Definition:
LPIPS[70] is a perceptual distance metric that evaluates the similarity between two images based on the activations of a pre-trained deep neural network (e.g., VGG, AlexNet). Instead of pixel-wise differences,LPIPScompares feature representations from deep layers, which correlates better with human judgment of perceptual similarity. A lowerLPIPSscore indicates higher perceptual similarity (less perceptual difference). - Mathematical Formula: $ \mathrm{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} |w_l \odot (\phi_l(x){hw} - \phi_l(y){hw})|_2^2 $
- Symbol Explanation:
x, y: Two images being compared.- : Feature stack from layer of a pre-trained network.
- : A learned scaling weight for each channel in layer .
- : Element-wise product.
- : Height and width of the feature map at layer .
- : Squared norm.
↓: Indicates that a lower value ofLPIPSis better.
5.2.5. VBench-long [71]
- Conceptual Definition:
VBench-long[71] is a specialized benchmark suite designed to evaluate video generation models, particularly focusing on their ability to generate long, consistent, and high-quality videos. It provides several metrics tailored to address challenges unique to long video generation, such as subject consistency over time and sustained aesthetic quality. - Specific Metrics Reported:
Subject Consistency (S.C.): Measures how well the main subjects/entities in the video maintain their identity and appearance throughout the entire duration.Aesthetic Quality (A.Q.): Evaluates the overall visual appeal, artistic merit, and general aesthetic pleasingness of the generated video.Imaging Quality (I.Q.): Assesses the low-level visual quality, such as clarity, lack of artifacts, resolution, and overall visual fidelity.
- Symbol Explanation:
↑: Indicates that a higher value for theseVBenchmetrics is generally better.
5.2.6. PFLOPs (Peta Floating Point Operations)
- Conceptual Definition:
PFLOPsrefers to Peta Floating Point Operations per second. It is a measure of computational cost or workload. A PetaFLOP is floating-point operations. In this context, it quantifies the total number of floating-point operations required for a given inference or training task. Lower PFLOPs indicate higher computational efficiency.
5.2.7. Latency (s) / Inference Time (s)
- Conceptual Definition:
LatencyorInference Timerefers to the time taken (in seconds) for a model to generate a video from start to finish. Lower latency indicates faster inference.
5.2.8. Speedup (x)
- Conceptual Definition:
Speedupis a performance metric that compares the time taken by an optimized method to the time taken by a baseline method. It is calculated as (Baseline Time / Optimized Time). A speedup of means the optimized method is twice as fast as the baseline.
5.2.9. Training Time (h) / Training Speedup (x)
- Conceptual Definition:
Training Timemeasures the total duration (in hours) required to fine-tune a model.Training Speedupis the ratio of the baseline training time to the optimized training time.
5.3. Baselines
Radial Attention is compared against a comprehensive set of baselines representing different approaches to efficient video generation:
- Original Model (Dense Attention): The full, unmodified video diffusion model with standard
dense attention. This serves as the upper bound for quality (assuming infinite compute) and the baseline for computational cost (). - STA [47] (Sliding Tile Attention): A video-specific sparse attention method that uses sliding 3D windows for local attention. It improves efficiency but has a limited fixed receptive field, making it struggle with long-range dependencies. The paper uses
FlashAttention-3 (FA3)[37] forSTAfor maximum efficiency. - PA [45] (PowerAttention): An sparse attention mechanism for
LLMsthat restricts attention to tokens at power-of-two distances. It's a general-purpose sparse attention but does not specifically account for the spatiotemporal structure of video. - SVG [8] (Sparse VideoGen): Another video-specific sparse attention method that dynamically classifies attention heads as spatial or temporal and applies corresponding masks. It improves inference efficiency but faces challenges during training and with unseen long video distributions.
- Spatial/Temporal (Heads from SVG): These refer to the specific attention masks used in
SVG's dedicated spatial and temporal heads, as described in Section 4.1. They represent specialized sparse patterns. - LongLoRA [39]: An efficient fine-tuning method for
LLMsthat extends context windows using shifted local attention. It provides a global receptive field but is not inherently designed for video's spatiotemporal characteristics. - SANA [10]: An efficient diffusion model backbone that replaces
softmax attentionwithlinear attention. While offering complexity,linear attentionoften struggles to capture local details and requires significant architectural changes, making it difficult to adapt with simple fine-tuning for video length extension. - RIFLEx [50]: A training-free method for video length extrapolation that modifies
RoPE(Rotary Position Embedding) frequencies. It aims to extend video length without retraining but can suffer from quality degradation. - Full (LoRA fine-tuned Dense Attention): For longer video generation experiments, this baseline represents the
original modelfine-tuned usingLoRAbut retaining dense attention. This serves as a strong quality baseline forLoRA-based fine-tuning at extended lengths, allowing direct comparison ofRadial Attention's quality and efficiency. - Harmonic Series Decay Attention (HS): (Ablation study baseline) An alternative attention mask design where the computed diagonal width is inversely proportional to its distance from the main diagonal. This is used to validate the specific design choices of
Radial Attention's sparsity pattern.
5.4. Implementation Details
- Models Benchmarked: Mochi 1 [22] (10B params), HunyuanVideo [1] (13B params), Wan2.1 [7] (14B params).
- Inference Accelerators:
FlashInfer[72] for inference,Block-Sparse-Attention[73] withFlashAttention-2 (FA2)[67] backend during training. - Default-Length Inference Settings:
- HunyuanVideo: 117 frames, 768p resolution (768x1280).
- Wan2.1: 69 frames, 768p resolution.
- Warm-up: Dense attention applied during the first 12 diffusion steps for all models.
- Initial DiT Block: Dense attention kept in the first
DiTblock to maintain quality. - Hardware: Single NVIDIA H100 GPU for latency measurements.
- Longer-Video Generation Fine-tuning Settings:
- Dataset: 2k top-scoring videos from
OpenVid-1M[74] for each extended length. - Hardware: 8 H100 GPUs for training.
- Batch Size: Global batch size of 1 with sequence parallelism for HunyuanVideo and Mochi 1; global batch size of 8 for Wan2.1.
- Initial DiT Blocks: Dense attention kept in the first two
DiTblocks. - Training Steps:
- HunyuanVideo: length (2400 steps), length (1200 steps).
- Mochi 1: and length (5000 steps).
- Wan2.1-14B: length (2500 steps).
- Training Time: Approximately 16-21 hours for HunyuanVideo, 8-17 hours for Mochi 1, and 15 hours for Wan 2.1.
- LoRA Rank: 128 for all training tasks.
- Inference Hardware: Single H100 for Wan 2.1; 8 H100s for HunyuanVideo and Mochi 1.
- Dataset: 2k top-scoring videos from
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Training-free Inference Acceleration at Default Video Length
The following are the results from Table 1 of the original paper:
| Model | Method | PSNR (↑) | SSIM (↑) | LPIPS (↓) | Vision Reward (↑) | PFLOPs | Latency (s) | Speedup |
| Hunyuan Video | Original | 0.141 | 612 | 1649 | − | |||
| STA (FA3) | 26.7 | 0.866 | 0.167 | 0.132 | 331 | 719 | 2.29x | |
| PA | 22.1 | 0.764 | 0.256 | 0.140 | 339 | 1002 | 1.65x | |
| SVG | 27.2 | 0.895 | 0.114 | 0.144 | 340 | 867 | 1.90× | |
| Ours | 27.3 | 0.886 | 0.114 | 0.139 | 339 | 876 | 1.88× | |
| Wan2.1-14B (69 frames) | Original | − | 0.136 | 560 | 1630 | − | ||
| STA (FA3) | 22.9 | 0.830 | 0.171 | 0.132 | 322 | 812 | 2.01x | |
| PA | 22.4 | 0.790 | 0.176 | 0.126 | 324 | 978 | 1.67× | |
| SVG | 23.2 | 0.825 | 0.202 | 0.114 | 324 | 949 | 1.71× | |
| Ours | 23.9 | 0.842 | 0.163 | 0.128 | 323 | 917 | 1.77× |
Table 1 presents a quantitative comparison of Radial Attention against other sparse attention baselines (STA, PA, SVG) and the Original dense attention model when generating videos at their default lengths. The evaluation uses PSNR, SSIM, LPIPS, and Vision Reward for quality, and PFLOPs, Latency, and Speedup for efficiency.
Key Observations:
-
Quality Preservation:
Radial Attention(Ours) consistently performs well in terms of video quality.- For
HunyuanVideo, it achieves the highestPSNR(27.3) and aLPIPS(0.114) matchingSVG, while having competitiveSSIM(0.886) andVision Reward(0.139). This indicates that it effectively preserves the visual quality of dense attention. - For
Wan2.1-14B, it again achieves the highestPSNR(23.9) andSSIM(0.842), and a lowLPIPS(0.163), demonstrating superior quality compared to other sparse baselines.
- For
-
Outperformance over STA and PA:
Radial Attentionsignificantly outperformsSTAandPAacross similarity metrics (PSNR,SSIM,LPIPS) for both models.PA(PowerAttention), despite having a similar complexity, yields considerably lowerPSNRandSSIMand higherLPIPSvalues. This suggests that its generic power-of-two distance sparsity pattern, which ignores the inherent spatiotemporal locality of video data, is suboptimal for video generation.STA(FlashAttention-3) achieves slightly higher speedup forHunyuanVideo(2.29x vs. 1.88x) but at the cost of noticeably degraded visual quality, as evidenced by lowerPSNRandSSIMand higherLPIPS. The authors attributeSTA's higher speedup to its use ofFlashAttention-3, while their current implementation usesFlashAttention-2.
-
Matching SVG Quality:
Radial Attentionmanages to match the video fidelity ofSVGwhile offering competitive efficiency.SVGshows a slightly higherVision RewardforHunyuanVideo(0.144 vs. 0.139) butRadial Attentionhas betterPSNRandSSIM. -
Efficiency:
Radial Attentionachieves substantial end-to-end speedups: forHunyuanVideoand forWan2.1-14Bon a singleH100GPU. These speedups are largely consistent with the theoretical compute budget savings (e.g., fewerPFLOPsforHunyuanVideo).Visually, Figure 6 further illustrates that
Radial Attentionmirrors the video quality of theoriginal Wan2.1-14Bmodel at default length, showing similar visual fidelity for generated videos of fruit and a person.
6.1.2. Long Video Generation
The following are the results from Table 2 of the original paper:
| Model | #Frames | Method | Sparsity | Training Time (h) | Training Speedup | Inference Time (s) | Inference Speedup | Vision Reward (↑) | VBench | ||
| S.C. | A.Q. | I.Q. | |||||||||
| HunyuanVideo | 125 (1×) | Original | 0.00% | 225 | − | 0.119 | 0.959 | 0.643 | 0.672 | ||
| 253 (2×) | Original | 0.00% | − | 797 | 1.00× | 0.122 | 0.953 | 0.603 | 0.611 | ||
| RIFLEx | 0.00% | 797 | 1.00× | 0.128 | 0.969 | 0.622 | 0.614 | ||||
| Spatial | 80.5% | 16.0 | 2.81× | 335 | 2.38× | 0.054 | 0.979 | 0.607 | 0.670 | ||
| Temporal | 80.7% | 16.2 | 2.78× | 338 | 2.36× | 0.104 | 0.963 | 0.620 | 0.658 | ||
| Long LoRA | 80.6% | 16.6 | 2.71× | 363 | 2.20× | 0.112 | 0.958 | 0.620 | 0.685 | ||
| PA [45] | 80.4% | 16.7 | 2.69× | 334 | 2.39× | 0.109 | 0.967 | 0.608 | 0.653 | ||
| SANA | 12.8 | 3.52× | 285 | 2.80× | -0.205 | 0.907 | 0.300 | 0.442 | |||
| Full | 0.00% | 45.0 | 1.00× | 797 | 1.00× | 0.124 | 0.955 | 0.616 | 0.648 | ||
| Ours | 80.8% | 16.2 | 2.78× | 339 | 2.35× | 0.126 | 0.968 | 0.623 | 0.663 | ||
| 509 (4×) | Original | 0.00% | 2895 | 1.00× | 0.054 | 0.988 | 0.545 | 0.451 | |||
| RIFLEx | 0.00% | − | 2895 | 1.00× | 0.037 | 0.989 | 0.539 | 0.456 | |||
| Spatial | 88.3% | 20.7 | 4.52× | 755 | 3.83× | 0.112 | 0.922 | 0.598 | 0.664 | ||
| Temporal | 88.2% | 21.1 | 4.44× | 774 | 3.74× | 0.083 | 0.972 | 0.597 | 0.646 | ||
| Long LoRA | 88.4% | 20.9 | 4.48× | 803 | 3.61× | 0.130 | 0.936 | 0.618 | 0.689 | ||
| PA [45] | 88.2% | 21.8 | 4.29× | 766 | 3.78× | 0.128 | 0.950 | 0.590 | 0.648 | ||
| Full | 0.00% | 93.6 | 1.00× | 2895 | 1.00× | 0.133 | 0.977 | 0.590 | 0.635 | ||
| 163 (1×) | Ours | 88.3% | 21.4 | 4.37× | 781 | 3.71× | 0.134 | 0.973 | 0.623 | 0.672 | |
| Mochi 1 | Original | 0.00% 0.00% | − | 112 | − 1.00× | 0.071 0.040 | 0.973 0.937 | 0.623 | 0.672 | ||
| 331 (2×) | Original Spatial | 76.1% | − 8.57 | 1.75× | 302 186 | 1.62× | 0.088 | 0.935 | 0.551 0.596 | 0.466 0.595 | |
| Temporal | 76.3% | 8.54 | 1.76× | 189 | 1.60× | 0.075 | 0.936 | 0.591 | 0.593 | ||
| Long LoRA | 76.0% | 9.07 | 1.65× | 210 | 1.44× | 0.095 | 0.950 | 0.596 | 0.630 | ||
| PA [45] | 1.76× | 183 | 1.65× | 0.101 | 0.946 | 0.610 | 0.626 | ||||
| SAA | 77.8% | 8.53 | 1.82× | -0.201 | |||||||
| Full Ours | 0.00% 76.4% | 15.0 8.43 | 1.00× 1.78× | 302 185 | 1.00× 1.63× | 0.095 0.110 | 0.923 0.951 | 0.610 0.615 | 0.594 | ||
| Original | 0.602 | ||||||||||
| Wan2.1-14B | 0.00% | 992 | 1.00× | -0.091 | 0.916 | 0.383 | 0.322 | ||||
| Spatial | 85.2% | 17.4 | 2.83× | 382 | 2.60× | 0.091 | 0.930 | 0.611 | 0.585 | ||
| Temporal | 85.4% | 17.6 | 2.80× | 393 | 2.52× | 0.028 | 0.931 | 0.556 | 0.536 | ||
| 667 (4×) | Long LoRA | 86.0% | 19.0 | 2.59× | 426 | 2.33× | 0.086 | 0.944 | 0.584 | 0.543 | |
| PA [45] | 86.5% | 17.3 | 2.84× | 381 | 2.60× | 0.107 | 0.956 | 0.633 | 0.650 | ||
| Full | 0.00% | 49.2 | 1.00× | 992 | 1.00× | 0.099 | 0.934 | 0.613 | 0.613 0.638 | ||
| 81 (1×) | Ours | 85.5% | 17.4 | 2.83× | 386 | 2.57× | 0.113 | 0.958 | 0.618 | ||
Table 2 presents results for video generation at and the original lengths, comparing Radial Attention (Ours) against various baselines (including fine-tuned dense attention) across HunyuanVideo, Mochi 1, and Wan2.1-14B. Quality is assessed using Vision Reward and VBench metrics (Subject Consistency, Aesthetic Quality, Imaging Quality), while efficiency is measured by Training Time, Training Speedup, Inference Time, and Inference Speedup.
Key Observations on Quality:
- Original Models Degradation: Without fine-tuning, the
Originalmodels exhibit significant quality degradation for longer videos, especially at length extension. For example,Original HunyuanVideoat length has a very lowVision Rewardof 0.054. Figure 7 visually confirms thatOriginal HunyuanVideoandRIFLExproduce blurred videos with poor visual quality for length extension. - RIFLEx Limitations:
RIFLEximproves performance slightly at length but shows further quality deterioration beyond that (e.g.,HunyuanVideolengthVision Rewardof 0.037), indicating limited extension capability. - Sparse Baseline Shortcomings:
SpatialandTemporalattention (fromSVG): Suffer from limited reception fields, leading to quality issues.Long LoRAandPA: Despite having global receptive fields, they fail to effectively capturespatiotemporal correlationsspecific to video, resulting in degraded quality compared toRadial Attention. Notably,PAshows a large gain inVision Rewardafter fine-tuning, suggesting its initial generic sparse pattern mismatches the pre-trained model's attention distribution. Fine-tuning helps it adapt.
- SANA Failure:
SANA(which replacessoftmax attentionwithlinear attention) performs poorly with fine-tuning-based video length extension (HunyuanVideolengthVision Rewardof -0.205), indicating it requires massive retraining rather than just fine-tuning. - Radial Attention's Superiority:
Radial Attentionconsistently achieves quality on par with, or even slightly better than,LoRAfine-tuneddense attentionmodels across all tested lengths and models.- For
HunyuanVideoat length,Radial Attentionyields aVision Rewardof 0.134, comparable toFull(dense attention fine-tuned) at 0.133, and significantly higher thanOriginal(0.054) orRIFLEx(0.037). It also maintains strongVBenchscores (S.C. 0.973, A.Q. 0.623, I.Q. 0.672). - For
Mochi 1at length,Radial Attentiongets aVision Rewardof 0.113, again outperformingFull(0.099) and other baselines. - For
Wan2.1-14Bat length,Radial Attentionachieves the highestVision Reward(0.161) and robustVBenchscores.
- For
- Visual Evidence: Figure 7, and Appendix Figures C, D, E visually demonstrate that
Radial Attentionproduces videos with good visual quality and consistency even at extended lengths, superior to baselines that show blurring, distortion, or temporal inconsistencies.
Key Observations on Efficiency:
- Significant Speedups: Thanks to its complexity,
Radial Attentiondelivers substantial speedups.- For
HunyuanVideoat length, it achieves aTraining Speedupof (21.4 hours vs. 93.6 hours forFull) and anInference Speedupof (781s vs. 2895s forFull). This represents a significant reduction in both training and inference costs. - Similar speedups are observed for
Mochi 1(e.g., training speedup at length) andWan2.1-14B( training speedup at length).
- For
- Cost Reduction: The paper highlights that for longer videos,
Radial Attentioncan reduce training costs by up to and accelerate inference by up to .
6.1.3. Compatibility with Existing LoRAs
The paper highlights that Radial Attention offers seamless compatibility with pre-trained task-specific LoRAs, such as those for artistic style transfer. This is demonstrated by directly merging the LoRA weights trained for Radial Attention on long videos with existing style LoRAs.
Figure F in Appendix C.3 visually illustrates this compatibility. It shows that combining the extended-length LoRA from Radial Attention with a style LoRA (e.g., for a specific art style) successfully generates longer videos while preserving the intended artistic style. The Vision Reward remains comparable to that of original-length LoRA videos. The authors note that minor style discrepancies might occur due to the relatively small dataset used for training the extended-length LoRA, suggesting that a more comprehensive dataset could further improve consistency. This feature is crucial for practical applications, allowing users to extend video length without losing the benefits of existing style customizations.
6.2. Ablation Studies & Parameter Analysis
6.2.1. Effectiveness of Low-Rank Adaptation
Figure 8(a) compares the Vision Reward of full fine-tuning versus LoRA fine-tuning as video length increases.
- Dense Attention: For models using
dense attention,LoRA fine-tuninginitially lags behindfull fine-tuning(e.g., at and length) but catches up or slightly surpasses it at length extension. This suggests that for dense attention,LoRAmight require more data or a higher rank to fully match full fine-tuning at shorter extensions, but its benefits become more pronounced for very long sequences. - Radial Attention: Crucially, when combined with
Radial Attention,LoRA fine-tuningconsistently matches or even outperformsfull fine-tuningacross all tested video lengths. This finding is significant:Radial Attentionnot only scales better computationally but also makes the model more amenable to adaptation withLoRA, potentially by focusingLoRAupdates on the most critical connections preserved by the sparse mask. This validates the design choice of combiningRadial AttentionwithLoRAfor efficient and high-quality long-video generation.
6.2.2. Attention Error
To quantify how well Radial Attention approximates dense attention, the authors evaluate the average attention output Mean Squared Error (MSE) on the Wan2.1-14B model.
-
Radial Attention: Achieves anMSEof . -
SVG[8]: Achieves anMSEof . -
STA[47]: Achieves anMSEof .The results show that
Radial Attentionhas a significantly lowerMSEcompared to bothSVGandSTA. This indicates that its static mask design is more effective at preserving the fidelity of the attention output compared to other sparse attention mechanisms, leading to a more accurate approximation of the full attention mechanism. This empirical finding supports the theoretical error bound derived in Section 4.2.2.6.
6.2.3. Regression Results
The authors perform a regression analysis on the empirically observed attention decay curves (Figure 4) using the exponential function y = \exp ( - a x + b ).
-
As shown in Figure 8(b), the fitted curves achieve an value of over 0.985.
-
For temporal distance, .
-
For spatial distance, .
These high values strongly suggest that the exponential function is a very good model for describing the
Spatiotemporal Energy Decayphenomenon observed in video diffusion models. This empirical validation justifies the theoretical assumption of exponential decay (Equation 3) that underpins the design of theRadial Attentionmask.
6.2.4. More Ablations on Radial Attention Design Choices
6.2.4.1. Comparison with Other Sparsity Patterns
The following are the results from Table A of the original paper:
| Model | Method | PSNR (↑) | SSIM (↑) | LPIPS (↓) | VisionReward (↑) |
| HunyuanVideo (117 frames) | HS | 27.0 | 0.881 | 0.119 | 0.136 |
| Ours | 27.3 | 0.886 | 0.114 | 0.139 |
Table A compares Radial Attention with Harmonic Series Decay Attention (HS), an alternative sparsity pattern where diagonal width is inversely proportional to distance. Radial Attention consistently outperforms HS across all quality metrics (PSNR, SSIM, LPIPS, Vision Reward). This demonstrates the effectiveness of Radial Attention's specific mask design, which incorporates both temporal and spatial decay rules, making it superior to other generic sparsity patterns for video data.
6.2.4.2. Ablation on Initial Dense-Attention Steps
The following are the results from Table B of the original paper:
| Model | #Warmup Steps | PSNR (↑) | SSIM(↑) | LPIPS (↓) |
| Wan2.1-14B (69 frames) | 0 | 12.8 | 0.486 | 0.522 |
| 4 | 18.5 | 0.693 | 0.267 | |
| 8 | 21.7 | 0.778 | 0.183 | |
| 11 | 23.2 | 0.813 | 0.151 | |
| 12 (Ours) | 23.6 | 0.823 | 0.146 | |
| 13 | 23.5 | 0.819 | 0.150 |
For default-length video generation, the authors follow SVG [8] by applying full attention during a warm-up phase at the beginning of the diffusion process. Table B ablates the number of these warm-up steps for the Wan2.1-14B model. The overall computation is matched across all settings by adjusting the sparsity of the Radial Attention mask. The results clearly show that a 12-step warm-up phase achieves the best performance across all quality metrics (PSNR, SSIM, LPIPS), with a slight drop at 13 steps. This justifies the chosen warm-up strategy for default-length generation.
The following are the results from Table C of the original paper:
| Model | #Warmup Steps | Vision Reward (↑) |
| HunyuanVideo (117 frames) | 0 | 0.154 |
| 1 | 0.160 | |
| 2 (Ours) | 0.163 | |
| 3 | 0.157 |
For longer video generation, a similar ablation on warm-up steps during inference is conducted for HunyuanVideo. Table C shows that using 2 warm-up steps yields the highest Vision Reward (0.163), with fewer or more steps leading to slightly lower scores. This suggests that a shorter warm-up is sufficient for fine-tuned longer video generation, likely because the fine-tuning process adapts the sparse attention more effectively.
6.2.4.3. Ablation on Initial Dense-Attention Layers
The following are the results from Table D of the original paper:
| Model | #Dense Layers | Vision Reward (↑) |
| HunyuanVideo (117 frames) | 0 | 0.139 |
| 1 | 0.156 | |
| 2 (Ours) | 0.163 | |
| 3 | 0.157 |
During training for long video generation, the authors keep a certain number of initial DiT blocks with dense attention to better capture global information. Table D ablates this choice for HunyuanVideo, comparing 0, 1, 2, or 3 initial dense layers while keeping the overall computation budget consistent. The results indicate that using 2 full-attention layers delivers the best Vision Reward (0.163). Using fewer (0 or 1) layers might miss critical global interactions, while using more (3) might introduce unnecessary computation without further quality benefits, or even slightly harm quality. This validates the design choice of keeping the first two DiT blocks dense.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Radial Attention, an innovative sparse attention mechanism designed to overcome the prohibitive computational costs of generating long videos with diffusion models. The core of its contribution lies in identifying and leveraging the Spatiotemporal Energy Decay phenomenon, where attention scores diminish with increasing spatial and temporal distance between tokens. Radial Attention translates this decay into an computational complexity by employing a static, principled attention mask. This mask intelligently allocates compute density, focusing on dense interactions for nearby tokens and sparsely sampling interactions for distant ones, allowing the attention window to shrink with temporal distance.
The paper demonstrates that Radial Attention effectively maintains high video quality across various leading diffusion models (Wan2.1-14B, HunyuanVideo, Mochi 1) at their default generation lengths, while achieving significant inference speedups (up to ). More importantly, it provides a scalable solution for long video generation: with minimal LoRA-based fine-tuning, it enables the generation of videos up to longer, drastically reducing training costs (up to ) and accelerating inference (up to ) compared to dense attention baselines, all without sacrificing visual fidelity. Its compatibility with existing style LoRAs further enhances its practical utility.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Simplifying Assumption of Exponential Decay: The theoretical analysis and mask design rely on the assumption that
attention scoresexhibit a clear exponential decay withspatiotemporal distance. While empirically validated with high values, this might be a simplification of the complex, nuancedspatiotemporal dependenciesin natural video data. Future work could explore more sophisticated models for these underlying data structures to potentially improve efficiency and performance further. - Quadratic Complexity with Respect to Resolution: As shown in Equation 6, the method's complexity is , which means for a fixed number of frames , the complexity still scales quadratically with the spatial resolution (since ). This implies that while
Radial Attentionsignificantly addresses the temporal dimension scaling, it still faces limitations for generating very high-resolution videos (i.e., videos with a very large ). Future work should explore more efficient attention mechanisms and pre-training strategies, such asNative Sparse Attention (NSA)[75] andMoBA[76], to better support both long and high-resolution videos simultaneously.
7.3. Personal Insights & Critique
Radial Attention presents a compelling and elegant solution to a critical bottleneck in video generation. The core insight of Spatiotemporal Energy Decay is particularly intuitive and well-supported by empirical evidence, providing a principled foundation for sparsity rather than relying on heuristic masking. Its static mask design is a significant practical advantage, making it more robust and easier to train than dynamic alternatives like SVG. The seamless integration with LoRA for length extrapolation is a highly valuable feature, democratizing the ability to extend state-of-the-art models to longer sequences without prohibitive retraining costs.
Transferability and Applications: The underlying principle of attention decay with distance (both temporal and spatial) is likely applicable beyond video generation. In general sequence modeling tasks, especially those with inherent multi-dimensional structures (e.g., medical imaging, 3D point clouds, time-series data with latent spatial dependencies), similar energy decay phenomena might exist. Radial Attention's structured sparsity pattern could be adapted to these domains to achieve similar efficiency gains. Its complexity positions it as a strong candidate for any sequence modeling task where is too expensive, but linear attention lacks expressive power.
Potential Issues and Areas for Improvement:
-
Fixed Decay Parameters: While the exponential decay model fits well, the decay rates and are implicitly fixed by the mask design. Could learning these decay parameters adaptively for different attention heads or layers further optimize the sparsity pattern and improve quality? Different types of content might exhibit different decay characteristics.
-
Hardware-specific Optimizations: The paper mentions that
STAachieved higher speedup withFlashAttention-3. WhileRadial Attentionis orthogonal toFlashAttentionversions, its full potential speedup might only be realized with the latest hardware-specific optimizations. Future work could involve developingFlashAttention-3or custom kernel implementations specifically tailored forRadial Attention's unique block-sparse pattern. -
Generalization to Diverse Motion: The current design assumes a relatively uniform
spatiotemporal energy decay. However, videos can contain highly localized, rapid movements (e.g., an object appearing quickly) that might warrant different attention patterns. DoesRadial Attentionadequately handle highly dynamic or unpredictable motion without losing critical information, or does the sparse sampling sometimes miss crucial but transient interactions? -
Beyond First-Order Decay: While exponential decay is a good fit, higher-order or more complex decay functions might capture nuances that
exp(-ax+b)misses, potentially offering marginal quality improvements at the cost of increased mask complexity.Broader Impact: The environmental benefits of reducing computational costs for training and inference are significant, contributing to "greener AI." This also makes advanced video generation more accessible, potentially fostering creativity in education and arts. However, the authors rightly highlight the dual-use nature of generative AI. The increased efficiency could facilitate the creation and spread of
deepfakesand misinformation. Continued research into robust detection methods and responsible deployment guidelines will be crucial as such powerful tools become more accessible.
Similar papers
Recommended via semantic vector search.