Paper status: completed

Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

Published:06/25/2025

Sparse Attention Mechanism (7)Acceleration of Diffusion Models (2)Long Video Generation (6)Video Generation Mechanism (1)Energy Decay Phenomenon (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Radial Attention, a sparse attention mechanism with $O(n ext{log}n)$ complexity, addressing high computational costs in long video generation. It leverages spatiotemporal energy decay to enhance efficiency and maintain performance during extended generation.

Abstract

Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $\mathcal{O}(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $\mathcal{O}(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9 $\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4 $\times$ longer while reducing training costs by up to 4.4 $\times$ compared to direct fine-tuning and accelerating inference by up to 3.7 $\times$ compared to dense attention inference. Code is released at \href{https://github.com/mit-han-lab/radial-attention}{https://github.com/mit-han-lab/radial-attention}.

Mind Map

In-depth Reading

English Analysis~38 min read · 56,683 chars

1. Bibliographic Information

1.1. Title

Radial Attention: ${ \mathcal { O } } ( n \log n )$ Sparse Attention with Energy Decay for Long Video Generation

1.2. Authors

Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han.

Their affiliations include MIT, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence. This diverse group of affiliations suggests a collaborative effort involving prominent institutions in AI research, particularly in areas like efficient deep learning, computer vision, and large model development.

1.3. Journal/Conference

The paper is a preprint published on arXiv (arXiv:2506.19852). V2 is indicated as the NeurIPS 2025 camera-ready version. NeurIPS (Conference on Neural Information Processing Systems) is one of the most prestigious and highly selective conferences in machine learning and artificial intelligence, widely recognized for presenting cutting-edge research. Its acceptance signifies high quality and significant contribution to the field.

1.4. Publication Year

2025 (Published at 2025-06-24T17:59:59.000Z)

1.5. Abstract

The paper addresses the challenge of high computational costs in training and inference for long video generation using diffusion models, which is primarily due to the quadratic scaling of standard attention mechanisms with the temporal dimension. The authors identify a phenomenon called Spatiotemporal Energy Decay in video diffusion models, where post-softmax attention scores decrease as the spatial and temporal distance between tokens increases. Inspired by this, they propose Radial Attention, a scalable sparse attention mechanism with ${ \mathcal { O } } ( n \log n )$ complexity. This mechanism translates the observed energy decay into exponentially decaying compute density by employing a simple, static attention mask. This mask allows each token to attend to spatially nearby tokens, with the attention window size shrinking as temporal distance grows. A key feature is its ability to extend the generation length of pre-trained video diffusion models through efficient LoRA-based fine-tuning. Extensive experiments demonstrate that Radial Attention maintains video quality across various models (Wan2.1-14B, HunyuanVideo, Mochi 1), achieving up to a $1 . 9 \times$ speedup over original dense attention. For generating videos up to $4 \times$ longer, it reduces training costs by up to $4 . 4 \times$ and accelerates inference by up to $3 . 7 \times$ compared to direct fine-tuning and dense attention inference, respectively, with minimal tuning.

1.6. Original Source Link

https://arxiv.org/abs/2506.19852 PDF Link: https://arxiv.org/pdf/2506.19852v2.pdf Publication Status: Preprint (V2 is the NeurIPS 2025 camera-ready version).

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the prohibitively high computational cost associated with generating long videos using advanced diffusion models. While diffusion models have achieved remarkable success in high-quality video generation, the inherent nature of video, which introduces an additional temporal dimension, significantly increases the total number of tokens that need to be processed. This expansion in token count clashes directly with the quadratic scaling ( $O(n^2)$ ) of the standard self-attention mechanism with respect to the sequence length $n$ . This quadratic complexity makes both training and inference on long videos exceptionally expensive, thereby restricting the scalability and practical deployment of these powerful models.

Prior research has attempted to mitigate this challenge through various approaches:

Sparse attention mechanisms: Methods like Sparse VideoGen (SVG) [8] dynamically classify attention heads as spatial or temporal and apply corresponding masks. However, these often struggle during training, especially for unseen longer video distributions, due to potential misclassification and error reinforcement.
Linear attention alternatives: Approaches that replace softmax attention with linear attention [9, 10] offer faster computation ( $O(n)$ ) but typically require substantial architectural changes and significant retraining, often failing to recover the original video quality with modest fine-tuning. They also tend to struggle with capturing local details [60].

The paper's entry point and innovative idea stem from an observation the authors term Spatiotemporal Energy Decay. They note that, analogous to physical signals or waves losing energy over distance and time, the post-softmax attention scores in video diffusion models diminish as the spatial and temporal distance between tokens increases. This phenomenon suggests that not all token interactions are equally important, and computational resources could be more efficiently allocated to interactions that contribute most significantly to the final output. This observation provides a principled motivation for designing a more efficient, yet effective, sparse attention mechanism.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Identification of Spatiotemporal Energy Decay: The authors rigorously identify and characterize the Spatiotemporal Energy Decay phenomenon in video diffusion models, where attention scores decay exponentially with increasing spatial and temporal distance. This empirical observation forms the theoretical foundation for Radial Attention.
Proposal of Radial Attention: They propose Radial Attention, a novel sparse attention mechanism with ${ \mathcal { O } } ( n \log n )$ computational complexity. This mechanism is designed to directly translate the principle of energy decay into an exponentially decaying compute density. It utilizes a simple, static attention mask where each token primarily attends to spatially nearby tokens, and the attention window size progressively shrinks with increasing temporal distance. This static and structured design offers significant efficiency improvements over dense attention while being more expressive than linear attention.
Efficient Length Extension via LoRA Fine-tuning: Radial Attention enables pre-trained video diffusion models to extend their generation length effectively. Because it prunes unimportant token relations without altering the core softmax attention mechanism, it allows for efficient adaptation using lightweight LoRA-based fine-tuning, which focuses on refining critical parameters for temporal coherence and visual fidelity. This approach is also compatible with existing style LoRAs.
Substantial Performance Gains:
- Default Video Length: Radial Attention maintains high video quality across leading models (Wan2.1-14B, HunyuanVideo, Mochi 1) and achieves up to a $1 . 9 \times$ speedup over the original dense attention.
- Longer Video Generation: For videos up to $4 \times$ longer, it significantly reduces training costs by up to $4 . 4 \times$ (compared to direct fine-tuning with dense attention) and accelerates inference by up to $3 . 7 \times$ (compared to dense attention inference), all while preserving video quality.
Code Release: The authors have released their code at https://github.com/mit-han-lab/radial-attention, fostering reproducibility and further research.

These findings collectively solve the problem of high computational costs in long video generation by providing an efficient, scalable, and quality-preserving attention mechanism that can be seamlessly integrated and fine-tuned with existing diffusion models.

3.1. Foundational Concepts

To understand Radial Attention and its implications, a foundational understanding of Diffusion Models, Self-Attention, and LoRA is essential.

3.1.1. Diffusion Models

Diffusion Models are a class of generative models that have achieved state-of-the-art results in synthesizing high-quality data, particularly images and videos. The core idea is inspired by physics:

Forward Diffusion Process: This process gradually adds Gaussian noise to an input data point (e.g., a clear image or video frame) over several discrete timesteps, transforming it into pure noise. If we start with a data point $x_0$ , at each step $t$ , a small amount of Gaussian noise is added, producing $x_t$ . Eventually, at a final timestep $T$ , $x_T$ becomes indistinguishable from pure noise.
Reverse Denoising Process: This is the generative part. A neural network (often a U-Net or a Diffusion Transformer (DiT)) is trained to learn how to reverse the forward process—that is, to predict and remove the noise added at each step. Starting from pure Gaussian noise $x_T$ , the model iteratively denoises it over many steps, gradually transforming the noise back into a coherent, high-quality data point $x_0$ .
Latent Space: Many modern diffusion models, like Latent Diffusion Models (LDMs) [3], operate in a compressed latent space. Instead of denoising raw pixels, they first encode the high-resolution input (image or video) into a lower-dimensional latent representation. The diffusion process then occurs in this latent space, which is computationally more efficient. A pre-trained decoder is used to convert the final denoised latent representation back into the high-resolution output.

For videos, an additional temporal dimension is introduced, meaning the latent representation is a sequence of latent frames, significantly increasing the total number of tokens.

3.1.2. Self-Attention and Transformers

Self-Attention is a key mechanism that revolutionized neural networks, particularly with the introduction of the Transformer architecture [18].

Purpose: Self-attention allows a model to weigh the importance of different parts of the input sequence when processing each element. For a given token (or part of the input), it determines how much attention to pay to all other tokens in the sequence. This is crucial for capturing long-range dependencies, which are vital for understanding context in language and coherence in images/videos.
Mechanism: It computes three vectors for each token:
- Query (Q): Represents the current token's request for information from other tokens.
- Key (K): Represents what information other tokens can offer.
- Value (V): Contains the actual information from other tokens that will be aggregated.
  
  The core Attention calculation is given by: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V $ Where:
$Q \in \mathbb{R}^{n \times d}$ , $K \in \mathbb{R}^{n \times d}$ , $V \in \mathbb{R}^{n \times d}$ are the query, key, and value matrices, respectively.
$n$ is the sequence length (number of tokens).
$d$ is the embedding dimension of each token.
$QK^T$ computes the similarity (dot product) between each query and all keys.
$\sqrt{d}$ is a scaling factor to prevent large dot products from pushing the softmax function into regions with very small gradients.
$\mathrm{softmax}(\cdot)$ converts the similarity scores into probability-like weights, indicating how much attention each query token pays to each key token.
The resulting attention weights are then multiplied by the $V$ matrix to produce the final output, which is a weighted sum of the value vectors.

The computation of $QK^T$ involves multiplying two $n \times d$ and $d \times n$ matrices, resulting in an $n \times n$ matrix. This operation has a quadratic time and memory complexity of $O(n^2)$ with respect to the sequence length $n$ . For long videos, where $n$ can be very large (e.g., 110K tokens for a 5-second video), this quadratic scaling becomes a major computational bottleneck.

Diffusion Transformers (DiT) [18] replace the traditional U-Net backbone in diffusion models with Transformer blocks. They are particularly effective for video generation because 3D dense attention can jointly model spatial (within frame) and temporal (across frames) dynamics, capturing complex spatiotemporal correlations.

3.1.3. Sparse Attention

Sparse Attention is a technique designed to mitigate the $O(n^2)$ complexity of dense attention by restricting the interactions between tokens. Instead of every query token attending to all key tokens, sparse attention uses a predefined or learned mask to allow attention only between a subset of token pairs. Formally, this is achieved by adding a sparsity mask $M \in \{ - \infty , 0 \} ^ { n \times n }$ to the attention logits: $ \mathrm{SparseAttention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T + M}{\sqrt{d}}\right)V $ Where:

$M$ is the sparsity mask.
Entries $M_{ij} = 0$ allow attention between query token $i$ and key token $j$ .
Entries $M_{ij} = -\infty$ (or a very large negative number) effectively block attention between query token $i$ and key token $j$ , as $\exp(-\infty) \approx 0$ after softmax.

By reducing the number of non-zero entries in the $QK^T$ matrix, sparse attention can significantly lower the computational cost, often achieving complexities like $O(n \log n)$ or $O(n \sqrt{n})$ , depending on the mask pattern. However, the challenge lies in designing a mask that effectively reduces computation without sacrificing the model's ability to capture important dependencies and maintain high output quality.

3.1.4. LoRA (Low-Rank Adaptation)

Low-Rank Adaptation (LoRA) [11] is an efficient fine-tuning technique for large pre-trained models, particularly Large Language Models (LLMs) and Diffusion Models.

Purpose: Fine-tuning large models on new tasks or datasets often requires updating millions or billions of parameters, which is computationally expensive and memory-intensive. LoRA aims to achieve comparable performance to full fine-tuning while significantly reducing the number of trainable parameters and computational overhead.
Mechanism: LoRA injects small, trainable low-rank matrices into the Transformer layers of a pre-trained model. Specifically, for an original weight matrix $W_0 \in \mathbb{R}^{d \times k}$ (e.g., a Query, Key, Value, or Output projection matrix in an attention layer), LoRA replaces the update $\Delta W$ with a low-rank decomposition BA, where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d, k)$ is the LoRA rank. The updated weight matrix becomes $W_0 + BA$ . During fine-tuning, $W_0$ is frozen, and only $A$ and $B$ are trained. This dramatically reduces the number of trainable parameters from $d \times k$ to $d \times r + r \times k$ .
Benefits:
- Reduced memory footprint for training.
- Faster training times.
- Ability to store multiple task-specific LoRA adapters for a single base model.
- Can often achieve competitive performance with full fine-tuning.

3.1.5. Spatiotemporal Energy Decay

This is a core phenomenon identified by the authors. It describes the empirical observation that in video diffusion models, the post-softmax attention scores between tokens tend to diminish as the spatial distance (distance within a frame) or temporal distance (distance between frames) between them increases. This suggests that tokens closer in space and time have stronger, more relevant interactions, while distant tokens have weaker, less critical interactions. The decay pattern is empirically found to follow an exponential distribution. This insight is crucial because it provides a principled basis for designing sparse attention mechanisms that can prune less important connections without significantly impacting quality, thereby saving computation.

3.2. Previous Works

The paper contextualizes Radial Attention within several lines of prior research:

3.2.1. Video Diffusion Models

Early approaches [13, 14, 15, 16] for video generation typically adapted 2D U-Nets (originally for image synthesis) by adding temporal modules to handle frame sequences. However, inspired by Sora [4] and DiT [18], the community has largely shifted towards DiT as the backbone. Latte [19] proposed decoupled spatial and temporal attention. More recently, state-of-the-art models [20, 21, 5, 22, 1, 7, 6] have adopted 3D dense attention to jointly model spatial-temporal dynamics and capture long-range dependencies. While powerful, this 3D dense attention introduces the quadratic computational cost that Radial Attention seeks to alleviate.

3.2.2. Efficient Video Generation

Many general techniques for accelerating image diffusion models (e.g., timestep distillation [23, 24], caching [25, 26], quantization [27, 28, 29], distributed inference [30, 31, 32]) are applicable to video. However, for video models relying on 3D dense attention, the bottleneck often shifts to the attention layers.

Quantized Attention: Works like SageAttention [33, 34, 35, 36] and FlashAttention-3 [37] focus on quantizing attention computations for speedup.
Sparse Attention for LLMs: In Large Language Models (LLMs), sparse attention is widely used [38, 39, 40, 41, 42, 43, 44, 45, 46]. Long LoRA [39] combines local sparse attention patterns with shifting for global receptive fields in video understanding. PowerAttention [45] restricts attention to power-of-two token distances, achieving $O(n \log n)$ complexity. However, these methods are generally not video-specific and ignore the inherent spatiotemporal structure of video data.
Video-Specific Sparse Attention:
- STA [47] (Sliding Tile Attention) uses sliding 3D windows for local attention, improving efficiency but limiting long-range dependencies and struggling with long videos due to its fixed receptive field.
- SVG [8] (Sparse VideoGen) dynamically classifies each attention head as either spatial or temporal and applies corresponding sparse masks. While efficient for inference, its dynamic profiling can misclassify heads on unseen data, making training on longer videos challenging.

3.2.3. Long Video Generation

The quadratic cost of dense attention makes training and inference on long videos very expensive.

RIFLEx [50] extends video length by modifying RoPE (Rotary Position Embedding) [51] frequencies, allowing $2 \times$ extrapolation but often suffering from quality degradation (e.g., blurring) for longer videos.
Other approaches generate short video segments and stitch them [52], or use autoregressive strategies like Framepack [53].
Linear Attention models [10, 9, 54, 55, 56, 57, 58, 59] offer $O(n)$ computation and global receptive fields but often struggle to capture fine local details [60], leading to quality degradation. SANA [10] is an example of a diffusion model backbone with linear attention.

3.2.4. Attention with $O(n \log n)$ complexity

Early efforts to achieve sub-quadratic attention complexity include:

Reformer [61]: Approximates dense attention using locality-sensitive hashing.
H-Transformer [62]: Imposes a hierarchical structure on the attention matrix.
Multi-resolution attention [63]: Recursively refines high-attention regions.
Fast Multipole Attention [64]: Adapts the classical fast multipole method.
LogSparse Transformer [65]: Restricts each token to attend to $O(\log n)$ positions per layer for time-series forecasting. Many of these methods can be hardware-unfriendly or have limited scalability.

3.3. Technological Evolution

The field of generative AI has rapidly progressed from generating static images to dynamic, high-quality videos. This evolution has been driven by improvements in diffusion models and the adoption of Transformer architectures (specifically DiT for diffusion). Initially, video models adapted 2D image models, but soon recognized the need for true 3D spatiotemporal modeling, leading to the widespread use of 3D dense attention. However, this power came at a steep computational cost, particularly for longer videos.

This computational bottleneck then spurred research into efficient attention mechanisms. Early solutions for LLMs focused on generic sparse attention patterns, but these often failed to account for the unique spatiotemporal structure of video data. Video-specific sparse methods like STA and SVG emerged, but STA had limited receptive fields, and SVG's dynamic nature posed training challenges. Meanwhile, linear attention offered scalability but often compromised quality.

Radial Attention fits into this timeline as an evolution that specifically leverages the observed Spatiotemporal Energy Decay in video data. It addresses the limitations of prior video-specific sparse methods by using a static, principled $O(n \log n)$ mask that is both efficient for training and inference and more expressive than linear attention, while maintaining compatibility with existing model ecosystems via LoRA.

3.4. Differentiation Analysis

Radial Attention differentiates itself from existing methods primarily by:

Principled Sparsity based on Energy Decay: Unlike generic sparse attention patterns (Long LoRA, PowerAttention) which don't consider video's inherent structure, or STA with its fixed local windows, Radial Attention explicitly models and leverages Spatiotemporal Energy Decay. It uses this observed phenomenon to intelligently allocate compute density, focusing on important proximal interactions while sparsely sampling distant ones.
Static vs. Dynamic Mask: Radial Attention employs a static mask, which is a significant advantage over SVG's dynamic, profiling-based approach. While SVG dynamically classifies attention heads at runtime, this introduces overhead and makes training on unseen longer video distributions unreliable. Radial Attention's static mask is predictable, hardware-friendly, and enables efficient training and adaptation to longer videos without runtime classification errors.
Expressiveness vs. Efficiency Trade-off: Radial Attention strikes a balance between the high expressiveness but quadratic cost of dense attention and the linear cost but reduced local detail capture of linear attention (e.g., SANA). By retaining softmax attention and selectively pruning, it achieves $O(n \log n)$ complexity while preserving the fidelity of local interactions critical for video quality.
Unification of Spatial and Temporal Attention: It unifies the decoupled spatial and temporal attention often seen in other models (and implicitly in SVG) into a single, cohesive sparsity pattern. The mask naturally captures dense spatial interactions within central bands and adapts temporal sparsity based on distance.
Efficient Length Extrapolation with LoRA: Radial Attention is designed to be highly compatible with LoRA for fine-tuning pre-trained models to generate longer videos. This allows for significant cost reduction in extending model capabilities, outperforming direct fine-tuning with dense attention or other length extrapolation methods like RIFLEx in quality.

In essence, Radial Attention offers a theoretically grounded, practically efficient, and adaptively extendable solution for long video generation, addressing key limitations of existing sparse and linear attention mechanisms.

4. Methodology

4.1. Principles

The core principle behind Radial Attention is the phenomenon of Spatiotemporal Energy Decay observed in video diffusion models. This principle posits that the relevance of an interaction between two tokens, as measured by their post-softmax attention scores, diminishes exponentially as their spatial or temporal distance increases.

The intuition is akin to how physical signals or waves lose energy as they propagate through space and time. In the context of video, a token representing a pixel at a certain location and time is most strongly influenced by nearby pixels in the same frame (spatial locality) and by pixels at similar locations in adjacent frames (temporal coherence). Interactions with pixels far away in space or time are generally less critical for generating high-quality, coherent video.

Radial Attention translates this energy decay into compute density decay. Instead of uniformly computing all $n^2$ pairwise attention scores (as in dense attention), it intelligently allocates computational resources:

High Density for Close Tokens: It prioritizes dense attention computation for tokens that are spatially and temporally close, where attention energy is concentrated.
Sparsely Sampled for Distant Tokens: For tokens that are far apart in time, the compute density is reduced by sparsely sampling interactions, reflecting their diminished attention energy. This reduction is implemented in a structured, exponential decay manner.

By designing a static attention mask based on this principle, Radial Attention aims to achieve sub-quadratic complexity ( $O(n \log n)$ ) without sacrificing the essential spatiotemporal dependencies needed for high-quality video generation.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Spatiotemporal Energy Decay in Attention

The paper empirically identifies and characterizes Spatiotemporal Energy Decay. As shown in Figure 4(a), spatial attention maps focus on tokens within adjacent frames, while temporal attention maps focus on tokens at the same spatial location across different frames. Figure 4(b) illustrates the distributions of attention scores.

The authors observe that:

In Figure 4(b1), the average attention score between tokens at the same spatial location decreases significantly as their temporal distance increases.
In Figure 4(b2), the average attention score between tokens within the same frame decreases as their spatial distance increases.

These observations confirm that attention scores exhibit a clear decay pattern with increasing distance between query and key tokens. Regression analysis (Figure 8(b)) further suggests that this decay closely follows an exponential distribution, with an $R^2$ value over 0.985, empirically validating the choice of an exponential decay model.

To formally capture this, assume a video latent consists of $f$ frames, with $s$ tokens per frame, totaling $n = f \times s$ tokens. Consider a query token at the $k_0$ -th spatial position of the $i_0$ -th frame. The post-softmax attention score for this query, denoted by $\mathbf{p} \in [0, 1]^n$ , is given by $\mathbf{p} = \mathrm{softmax}(\bar{Q}_{i_0 s + k_0} \bar{\mathbf{K}}^T)$ . The paper then states that there exist constants $\alpha, \beta > 0$ and $C_{\mathrm{rel}} > 0$ such that for any key token at spatial position $l$ in frame $j$ , the attention score satisfies: $ p _ { j s + l } \leq C _ { \mathrm { r e l } } e ^ { - \alpha \left| j - i _ { 0 } \right| - \beta \left| l - k _ { 0 } \right| } p _ { i _ { 0 } s + k _ { 0 } } $ Where:

$p_{js+l}$ : The attention score for the key token at spatial position $l$ in frame $j$ .
$p_{i_0 s + k_0}$ : The attention score for the query token itself (which serves as a reference point $a_0$ in Appendix A.2).
$|j - i_0|$ : The temporal distance between the query frame $i_0$ and key frame $j$ .
$|l - k_0|$ : The spatial distance between the query token $k_0$ and key token $l$ within their respective frames.
$\alpha$ : The temporal decay rate. A higher $\alpha$ indicates faster decay with temporal distance.
$\beta$ : The spatial decay rate. A higher $\beta$ indicates faster decay with spatial distance.
$C_{\mathrm{rel}}$ : A positive constant that relates the attention score of a distant token to the self-attention score.

This equation formalizes the Spatiotemporal Energy Decay. When $\beta$ is high and $\alpha$ is low, it models temporal attention (strong spatial locality, slow temporal decay). Conversely, high $\alpha$ and low $\beta$ capture spatial attention (strong temporal locality, slow spatial decay). This unified exponential decay model motivates the design of a single sparsity pattern for Radial Attention.

4.2.2. Radial Attention: Convert the Energy Decay to Compute Density Decay

Radial Attention employs a static mask to dynamically adjust compute density based on spatiotemporal distance, mimicking the observed energy decay.

4.2.2.1. Temporal Density Decay

Along the temporal dimension, Radial Attention applies an exponential decay rule. The compute density between tokens in frame $i$ and frame $j$ is proportional to $\left( \frac { 1 } { 2 } \right) ^ { \lfloor \log _ { 2 } ( \operatorname* { m a x } ( | i - j | , 1 ) ) \rfloor }$ . This creates a structured pattern of diagonal bands on the attention map (as illustrated in Figure 5(a) for $f=12$ frames):

Bands: The attention map is divided into $2 \lceil \log _ { 2 } ( \operatorname* { m a x } ( \tilde { f } , 2 ) ) \rceil - 1$ diagonal bands.
Central Band (Band 0): This band, centered on the main diagonal (where $|i-j|=0$ ), retains $100\%$ compute density. This means all tokens within the same frame attend to each other densely, and tokens in adjacent frames also attend densely.
Successive Bands: Each successive band moving outward from the center (e.g., band $\pm 1$ , $\pm 2$ , etc.) has half the compute density of the preceding one. The attention from tokens in frame $i$ to frame $j$ falls into band $\mathrm{sign}(j-i) \cdot \lfloor \log _ { 2 } \operatorname* { m a x } ( \lvert i - j \rvert , 1 ) \rfloor$ .
Diagonal Width Doubling: Except for band $\pm 1$ , each band also doubles the diagonal width of its predecessor, distributing the reduced density across more interactions in farther frames.

This radial decay effect is visualized by progressively lighter colors in Figure 5(a), indicating reduced computation for larger temporal distances.

4.2.2.2. Spatial Density Decay

Within each frame-to-frame attention block (i.e., for interactions between a specific frame $i$ and frame $j$ ), Radial Attention preserves high-energy interactions concentrated at similar spatial locations. This results in diagonal-like structures within these blocks (see Figure 5(b)).

Shrinking Diagonal Width: Due to the temporal decay discussed above, the width of these spatial diagonals shrinks as the temporal distance between frames increases. Specifically, for attention between frame $i$ and frame $j$ , the diagonal width is given by $\big \lfloor \frac { s } { 2 ^ { \lfloor \log _ { 2 } \operatorname* { m a x } ( \lvert i - j \rvert , 1 ) \rfloor } } \big \rfloor$ .
Reduced Diagonal Frequency: When the calculated diagonal width drops below 1 (i.e., for very large temporal distances), the mechanism further reduces compute density by reducing the frequency of diagonals. Instead of keeping a full diagonal, it only retains diagonals in those blocks where $| i - j | \bmod \lceil \frac { 2 ^ { \lfloor \log _ { 2 } \operatorname* { m a x } ( | i - j | , 1 ) \rfloor } } { s } \rceil = 0$ . This ensures the same amortized attention density decay is maintained.

4.2.2.3. Formal Definition

The Radial Attention mask is formally defined as a 4D mask $\tilde { M } \in \{ - \infty , 0 \} ^ { f \times f \times s \times s }$ . An element $\tilde { M } _ { i , j , k , l } = 0$ means the token at spatial position $k$ in frame $i$ is allowed to attend to the token at spatial position $l$ in frame $j$ . Conversely, $\tilde { M } _ { i , j , k , l } = - \infty$ means attention between this pair is suppressed.

The mask is constructed based on two main conditions:

Condition for dense spatial interactions with shrinking window: $ 0 , \quad \mathrm { i f ~ } 2 ^ { \left\lfloor \log _ { 2 } \operatorname* { m a g } ( | i - j | , 1 ) \right\rfloor } \leq s \mathrm { ~ a n d ~ } | k - l | + 1 \leq \frac { s } { 2 ^ { \left\lfloor \log _ { 2 } \operatorname* { m a x } ( | i - j | , 1 ) \right\rfloor } } $
- This condition applies when the temporal distance $|i-j|$ is relatively small such that the resulting window size $\frac { s } { 2 ^ { \left\lfloor \log _ { 2 } \operatorname* { m a x } ( | i - j | , 1 ) \right\rfloor } }$ is at least 1 (i.e., $2^{\lfloor \log_2 \max(|i-j|, 1) \rfloor} \le s$ ).
- It allows attention for tokens whose spatial distance $|k-l|$ plus one is within the calculated shrinking diagonal width. This captures the dense spatial interactions that decrease in width as temporal distance increases.
Condition for sparse sampling of same-position tokens at large temporal distances: $ 0 , \quad \mathrm { i f ~ } | i - j | \bmod \lceil \frac { 2 ^ { \left\lfloor \log _ { 2 } \operatorname* { m a x } ( | i - j | , 1 ) \right\rfloor } } { s } \rceil = 0 \mathrm { ~ a n d ~ } k = l $
- This condition comes into play when the temporal distance is large, and the calculated diagonal width from the first condition would drop below 1 (i.e., $2^{\lfloor \log_2 \max(|i-j|, 1) \rfloor} > s$ ).
- Instead of a dense diagonal, it only permits attention for tokens at the exact same spatial position ( $k=l$ ) but only if the temporal distance $|i-j|$ satisfies a specific modulus condition. This effectively samples very sparsely along the temporal dimension for tokens at the same spatial location. The modulus term $\lceil \frac { 2 ^ { \lfloor \log _ { 2 } \operatorname* { m a x } ( | i - j | , 1 ) \rfloor } } { s } \rceil$ ensures an amortized density decay.
Otherwise, attention is suppressed: $
- \infty . \quad \mathrm { o t h e r w i s e } $

The final attention mask $M \in \{ - \infty , 0 \} ^ { n \times n }$ used in the attention operation (Equation 2 in the paper) is obtained by flattening the frame and spatial indices: $M _ { i s + k , j s + l } = \tilde { M } _ { i , j , k , l }$ . For better quality, an attention sink [38, 8] is incorporated, typically by allowing the first frame's attention to be crucial (or letting all tokens attend to a special "sink" token). Figure 5(c) provides a visual example of the resulting sparse mask for a 253-frame, 720p video generated by HunyuanVideo. This design preserves high-priority local interactions while sparsely sampling distant ones.

4.2.2.4. Relation to SVG

Radial Attention unifies the separate spatial and temporal attention mechanisms found in SVG [8] into a single, comprehensive mask.

Spatial Attention: The central band (band 0) of the Radial Attention mask inherently captures dense spatial interactions, effectively incorporating SVG's spatial attention.
Temporal Attention: Unlike SVG, which might allocate uniform (and thus potentially unnecessary) computation to all temporal interactions, Radial Attention explicitly models temporal decay. It reduces attention to distant frames that have low relevance and reallocates this computational budget to tokens nearer in time, leading to both improved efficiency and enhanced modeling capacity compared to SVG.

4.2.2.5. Complexity Analysis

The computational cost of masked attention is directly proportional to the number of non-zero entries (zeros) in the attention mask $\tilde { M }$ . The paper derives an upper bound for this number. The total number of zeros in the mask is the sum of zeros from three regions:

Central Band and Attention Sink: This includes frames where $|i-j| \le 1$ and the attention sink where $j=0$ . The number of zeros is bounded by $4s^2 f$ .
Bands with Diagonal Width $\geq 1$ : These are bands where $2^{\lfloor \log_2 \max(|i-j|, 1) \rfloor} \le s$ . The number of zeros in these bands is bounded by $4s^2 f \cdot \lfloor \log_2 s \rfloor$ .
Bands with Diagonal Width $< 1$ : These are bands where $2^{\lfloor \log_2 \max(|i-j|, 1) \rfloor} > s$ , where only diagonal elements are sparsely sampled. The number of zeros is bounded by $(\lfloor \log_2 f \rfloor - \lfloor \log_2 s \rfloor) 4 s^2 f$ .

Combining these, the aggregate upper bound for the number of zeros in Radial Attention's mask, for large number of frames $f$ with fixed resolution $s$ , simplifies to: $ # \mathrm { z e r o s ~ i n ~ } \tilde { M } \leq 4 s n ( \log _ { 2 } n - \log _ { 2 } s ) $ Where:

$n = f \times s$ is the total number of tokens.
$s$ is the number of spatial tokens per frame (fixed resolution).
$f$ is the number of frames.

This equation shows that the computational complexity scales as ${ \mathcal { O } } ( n \log n )$ . This is significantly more efficient than the $O(n^2)$ complexity of dense attention. Empirical results on HunyuanVideo (Figure 2) confirm this, showing a $9 \times$ reduction in attention computation for $4 \times$ longer videos.

4.2.2.6. Error Analysis

The paper also provides an error bound for the approximation introduced by Radial Attention's sparsity. Given the exact softmax attention score $p$ and the masked attention score $\tilde{p}$ , the $\ell_1$ attention error is bounded as: $ \left| \tilde { p } - p \right| _ { 1 } \le C _ { \mathrm { r e l } } \left[ \frac { 8 e ^ { - \beta \left( \frac { s } { 2 } + 1 \right) } } { ( 1 - e ^ { - \alpha } ) ( 1 - e ^ { - \beta } ) } + 4 \frac { 1 + e ^ { - \beta } } { 1 - e ^ { - \beta } } \frac { e ^ { - \alpha ( s + 1 ) } } { 1 - e ^ { - \alpha } } \right] = O ( C _ { \mathrm { r e l } } e ^ { - \operatorname* { m i n } ( \beta / 2 , \alpha ) s } ) $ Where:

$\| \tilde{p} - p \|_1$ : The $\ell_1$ norm of the difference between the masked and exact attention distributions, representing the total variation error.
$C_{\mathrm{rel}}$ , $\alpha$ , $\beta$ , $s$ : Same as defined for Spatiotemporal Energy Decay.

This bound indicates that the approximation error decreases exponentially with larger spatial decay rate $\beta$ and temporal decay rate $\alpha$ , and with increasing spatial resolution $s$ . This theoretical result supports the effectiveness of the mask design in Radial Attention, showing that if attention indeed decays exponentially with distance, then the sparse approximation will have a small and rapidly diminishing error. Empirically, Radial Attention achieves smaller Mean Squared Error (MSE) than SVG and STA (Section 5.3), further validating its fidelity.

4.2.3. Hardware-friendly Block Sparsity

To ensure efficient execution on modern GPU hardware, attention computations are typically performed over larger blocks (e.g., $128 \times 128$ blocks) rather than individual $1 \times 1$ tokens [68, 8, 40, 43, 44, 66]. This block-sparse approach allows for better memory access patterns and parallelism, which are crucial for leveraging high-performance attention kernels like FlashAttention [66, 67]. Radial Attention's mask design is compatible with such block-sparse implementations, enabling practical speedups.

4.2.4. Low-Rank Adaptation for Long Videos

Even with an efficient attention mechanism, extending pre-trained models (originally trained on short videos) to long videos presents a challenge. Direct training on long videos is computationally prohibitive, and training-free extrapolation methods (like RIFLEx [50]) often lead to quality degradation due to length distribution mismatch.

Radial Attention addresses this by:

Reducing Training Complexity: By achieving $O(n \log n)$ complexity, Radial Attention makes fine-tuning on long videos computationally feasible.
Preserving Critical Relations: The sparse mask of Radial Attention is designed to retain essential inter-token relations, allowing the original pre-trained weights to remain largely intact. This means only minimal fine-tuning is required.

To further minimize training overhead, the authors incorporate Low-Rank Adapters (LoRA) [11, 39] into the attention mechanism.
LoRA Application: LoRA is applied to the query, key, value, and output projections of the attention layers. As explained in Section 3.1.4, this involves adding small, trainable low-rank matrices to these weight matrices, while freezing the original pre-trained weights.
Benefits: LoRA significantly reduces the number of trainable parameters, leading to:
- Reduced memory footprint during training.
- Faster fine-tuning times.
- Improved video quality: Empirically, LoRA fine-tuning with Radial Attention not only minimizes overhead but also refines the most critical weights and attention mechanisms more effectively, often leading to better video quality than full parameter fine-tuning.
Compatibility: This length-extension LoRA is also compatible with existing style LoRAs, allowing users to combine long video generation with specific artistic styles.

This approach offers a scalable and cost-effective way to adapt existing high-quality video diffusion models to generate much longer sequences.

5. Experimental Setup

5.1. Datasets

For fine-tuning models to generate longer videos, the experiments primarily use videos from OpenVid-1M [74].

Source and Characteristics: OpenVid-1M is a large-scale, high-quality dataset for text-to-video generation.
Selection: The authors specifically sample 2,000 top-scoring videos (based on aesthetic and motion scores) for each extended length (e.g., $2 \times$ , $4 \times$ longer than default). This selective sampling ensures that the fine-tuning process leverages high-quality, diverse video content relevant to the target longer durations.
Purpose: These datasets are chosen to provide sufficient high-quality examples for the LoRA-based fine-tuning process, allowing the models to learn temporal coherence and visual fidelity for extended video lengths, which they were not originally trained for.

5.2. Evaluation Metrics

The paper uses a comprehensive set of metrics to evaluate both the fidelity (similarity to original, perceptual quality) and consistency (for long videos) of the generated videos, as well as computational efficiency.

5.2.1. Vision Reward (↑)

Conceptual Definition: Vision Reward [69] is a metric designed to approximate human ratings of generated videos. It quantifies how much a generated video aligns with human preferences, focusing on aspects like visual quality, aesthetic appeal, and adherence to the text prompt. A higher Vision Reward indicates better human perception.
Mathematical Formula: The paper does not provide a specific mathematical formula for Vision Reward but refers to [69]. Generally, Vision Reward models are trained on human preference data and output a scalar score, so there isn't a single, universally defined analytical formula like PSNR. It's an output of a learned reward function.
Symbol Explanation:
- ↑: Indicates that a higher value of Vision Reward is better.

5.2.2. PSNR (Peak Signal-to-Noise Ratio) (↑)

Conceptual Definition: PSNR is a common image and video quality metric used to quantify the numerical similarity between a compressed or processed image/video and its original, uncompressed version. It is often expressed in decibels (dB). A higher PSNR value generally indicates a higher quality reconstruction, implying less distortion from the original.
Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{\mathrm{MSE}} \right) $ Where: $ \mathrm{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $
Symbol Explanation:
- $MAX_I$ : The maximum possible pixel value of the image/video. For 8-bit grayscale images, this is 255.
- MSE: Mean Squared Error between the original image $I$ and the reconstructed image $K$ .
- I(i,j): The pixel value at coordinates (i,j) in the original image.
- K(i,j): The pixel value at coordinates (i,j) in the reconstructed image.
- M, N: The dimensions (height and width) of the image.
- ↑: Indicates that a higher value of PSNR is better.

5.2.3. SSIM (Structural Similarity Index Measure) (↑)

Conceptual Definition: SSIM [70] is a perceptual metric that quantifies the similarity between two images, considering three key components: luminance, contrast, and structure. Unlike PSNR which measures absolute error, SSIM aims to assess perceived quality by humans. It ranges from -1 to 1, where 1 indicates perfect structural similarity.
Mathematical Formula: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $
Symbol Explanation:
- x, y: Two image patches being compared.
- $\mu_x, \mu_y$ : The average (mean) of $x$ and $y$ .
- $\sigma_x, \sigma_y$ : The standard deviation of $x$ and $y$ .
- $\sigma_{xy}$ : The covariance of $x$ and $y$ .
- $C_1 = (K_1 L)^2, C_2 = (K_2 L)^2$ : Small constants to prevent division by zero, where $L$ is the dynamic range of the pixel values (e.g., 255 for 8-bit images) and $K_1, K_2 \ll 1$ .
- ↑: Indicates that a higher value of SSIM is better.

5.2.4. LPIPS (Learned Perceptual Image Patch Similarity) (↓)

Conceptual Definition: LPIPS [70] is a perceptual distance metric that evaluates the similarity between two images based on the activations of a pre-trained deep neural network (e.g., VGG, AlexNet). Instead of pixel-wise differences, LPIPS compares feature representations from deep layers, which correlates better with human judgment of perceptual similarity. A lower LPIPS score indicates higher perceptual similarity (less perceptual difference).
Mathematical Formula: $ \mathrm{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} |w_l \odot (\phi_l(x){hw} - \phi_l(y){hw})|_2^2 $
Symbol Explanation:
- x, y: Two images being compared.
- $\phi_l(\cdot)$ : Feature stack from layer $l$ of a pre-trained network.
- $w_l$ : A learned scaling weight for each channel in layer $l$ .
- $\odot$ : Element-wise product.
- $H_l, W_l$ : Height and width of the feature map at layer $l$ .
- $\|\cdot\|_2^2$ : Squared $\ell_2$ norm.
- ↓: Indicates that a lower value of LPIPS is better.

5.2.5. VBench-long [71]

Conceptual Definition: VBench-long [71] is a specialized benchmark suite designed to evaluate video generation models, particularly focusing on their ability to generate long, consistent, and high-quality videos. It provides several metrics tailored to address challenges unique to long video generation, such as subject consistency over time and sustained aesthetic quality.
Specific Metrics Reported:
- Subject Consistency (S.C.): Measures how well the main subjects/entities in the video maintain their identity and appearance throughout the entire duration.
- Aesthetic Quality (A.Q.): Evaluates the overall visual appeal, artistic merit, and general aesthetic pleasingness of the generated video.
- Imaging Quality (I.Q.): Assesses the low-level visual quality, such as clarity, lack of artifacts, resolution, and overall visual fidelity.
Symbol Explanation:
- ↑: Indicates that a higher value for these VBench metrics is generally better.

5.2.6. PFLOPs (Peta Floating Point Operations)

Conceptual Definition: PFLOPs refers to Peta Floating Point Operations per second. It is a measure of computational cost or workload. A PetaFLOP is $10^{15}$ floating-point operations. In this context, it quantifies the total number of floating-point operations required for a given inference or training task. Lower PFLOPs indicate higher computational efficiency.

5.2.7. Latency (s) / Inference Time (s)

Conceptual Definition: Latency or Inference Time refers to the time taken (in seconds) for a model to generate a video from start to finish. Lower latency indicates faster inference.

5.2.8. Speedup (x)

Conceptual Definition: Speedup is a performance metric that compares the time taken by an optimized method to the time taken by a baseline method. It is calculated as (Baseline Time / Optimized Time). A speedup of $2.0 \times$ means the optimized method is twice as fast as the baseline.

5.2.9. Training Time (h) / Training Speedup (x)

Conceptual Definition: Training Time measures the total duration (in hours) required to fine-tune a model. Training Speedup is the ratio of the baseline training time to the optimized training time.

5.3. Baselines

Radial Attention is compared against a comprehensive set of baselines representing different approaches to efficient video generation:

Original Model (Dense Attention): The full, unmodified video diffusion model with standard dense attention. This serves as the upper bound for quality (assuming infinite compute) and the baseline for computational cost ( $O(n^2)$ ).
STA [47] (Sliding Tile Attention): A video-specific sparse attention method that uses sliding 3D windows for local attention. It improves efficiency but has a limited fixed receptive field, making it struggle with long-range dependencies. The paper uses FlashAttention-3 (FA3) [37] for STA for maximum efficiency.
PA [45] (PowerAttention): An $O(n \log n)$ sparse attention mechanism for LLMs that restricts attention to tokens at power-of-two distances. It's a general-purpose $O(n \log n)$ sparse attention but does not specifically account for the spatiotemporal structure of video.
SVG [8] (Sparse VideoGen): Another video-specific sparse attention method that dynamically classifies attention heads as spatial or temporal and applies corresponding masks. It improves inference efficiency but faces challenges during training and with unseen long video distributions.
Spatial/Temporal (Heads from SVG): These refer to the specific attention masks used in SVG's dedicated spatial and temporal heads, as described in Section 4.1. They represent specialized sparse patterns.
LongLoRA [39]: An efficient fine-tuning method for LLMs that extends context windows using shifted local attention. It provides a global receptive field but is not inherently designed for video's spatiotemporal characteristics.
SANA [10]: An efficient diffusion model backbone that replaces softmax attention with linear attention. While offering $O(n)$ complexity, linear attention often struggles to capture local details and requires significant architectural changes, making it difficult to adapt with simple fine-tuning for video length extension.
RIFLEx [50]: A training-free method for video length extrapolation that modifies RoPE (Rotary Position Embedding) frequencies. It aims to extend video length without retraining but can suffer from quality degradation.
Full (LoRA fine-tuned Dense Attention): For longer video generation experiments, this baseline represents the original model fine-tuned using LoRA but retaining dense attention. This serves as a strong quality baseline for LoRA-based fine-tuning at extended lengths, allowing direct comparison of Radial Attention's quality and efficiency.
Harmonic Series Decay Attention (HS): (Ablation study baseline) An alternative $O(n \log n)$ attention mask design where the computed diagonal width is inversely proportional to its distance from the main diagonal. This is used to validate the specific design choices of Radial Attention's sparsity pattern.

5.4. Implementation Details

Models Benchmarked: Mochi 1 [22] (10B params), HunyuanVideo [1] (13B params), Wan2.1 [7] (14B params).
Inference Accelerators: FlashInfer [72] for inference, Block-Sparse-Attention [73] with FlashAttention-2 (FA2) [67] backend during training.
Default-Length Inference Settings:
- HunyuanVideo: 117 frames, 768p resolution (768x1280).
- Wan2.1: 69 frames, 768p resolution.
- Warm-up: Dense attention applied during the first 12 diffusion steps for all models.
- Initial DiT Block: Dense attention kept in the first DiT block to maintain quality.
- Hardware: Single NVIDIA H100 GPU for latency measurements.
Longer-Video Generation Fine-tuning Settings:
- Dataset: 2k top-scoring videos from OpenVid-1M [74] for each extended length.
- Hardware: 8 H100 GPUs for training.
- Batch Size: Global batch size of 1 with sequence parallelism for HunyuanVideo and Mochi 1; global batch size of 8 for Wan2.1.
- Initial DiT Blocks: Dense attention kept in the first two DiT blocks.
- Training Steps:
  - HunyuanVideo: $2 \times$ length (2400 steps), $4 \times$ length (1200 steps).
  - Mochi 1: $2 \times$ and $4 \times$ length (5000 steps).
  - Wan2.1-14B: $2 \times$ length (2500 steps).
- Training Time: Approximately 16-21 hours for HunyuanVideo, 8-17 hours for Mochi 1, and 15 hours for Wan 2.1.
- LoRA Rank: 128 for all training tasks.
- Inference Hardware: Single H100 for Wan 2.1; 8 H100s for HunyuanVideo and Mochi 1.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Training-free Inference Acceleration at Default Video Length

The following are the results from Table 1 of the original paper:

Model	Method	PSNR (↑)	SSIM (↑)	LPIPS (↓)	Vision Reward (↑)	PFLOPs	Latency (s)	Speedup
Hunyuan Video	Original				0.141	612	1649	−
	STA (FA3)	26.7	0.866	0.167	0.132	331	719	2.29x
	PA	22.1	0.764	0.256	0.140	339	1002	1.65x
	SVG	27.2	0.895	0.114	0.144	340	867	1.90×
	Ours	27.3	0.886	0.114	0.139	339	876	1.88×
Wan2.1-14B (69 frames)	Original	−			0.136	560	1630	−
	STA (FA3)	22.9	0.830	0.171	0.132	322	812	2.01x
	PA	22.4	0.790	0.176	0.126	324	978	1.67×
	SVG	23.2	0.825	0.202	0.114	324	949	1.71×
	Ours	23.9	0.842	0.163	0.128	323	917	1.77×

Table 1 presents a quantitative comparison of Radial Attention against other sparse attention baselines (STA, PA, SVG) and the Original dense attention model when generating videos at their default lengths. The evaluation uses PSNR, SSIM, LPIPS, and Vision Reward for quality, and PFLOPs, Latency, and Speedup for efficiency.

Key Observations:

Quality Preservation: Radial Attention (Ours) consistently performs well in terms of video quality.
- For HunyuanVideo, it achieves the highest PSNR (27.3) and a LPIPS (0.114) matching SVG, while having competitive SSIM (0.886) and Vision Reward (0.139). This indicates that it effectively preserves the visual quality of dense attention.
- For Wan2.1-14B, it again achieves the highest PSNR (23.9) and SSIM (0.842), and a low LPIPS (0.163), demonstrating superior quality compared to other sparse baselines.
Outperformance over STA and PA: Radial Attention significantly outperforms STA and PA across similarity metrics (PSNR, SSIM, LPIPS) for both models.
- PA (PowerAttention), despite having a similar $O(n \log n)$ complexity, yields considerably lower PSNR and SSIM and higher LPIPS values. This suggests that its generic power-of-two distance sparsity pattern, which ignores the inherent spatiotemporal locality of video data, is suboptimal for video generation.
- STA (FlashAttention-3) achieves slightly higher speedup for HunyuanVideo (2.29x vs. 1.88x) but at the cost of noticeably degraded visual quality, as evidenced by lower PSNR and SSIM and higher LPIPS. The authors attribute STA's higher speedup to its use of FlashAttention-3, while their current implementation uses FlashAttention-2.
Matching SVG Quality: Radial Attention manages to match the video fidelity of SVG while offering competitive efficiency. SVG shows a slightly higher Vision Reward for HunyuanVideo (0.144 vs. 0.139) but Radial Attention has better PSNR and SSIM.
Efficiency: Radial Attention achieves substantial end-to-end speedups: $1 . 88 \times$ for HunyuanVideo and $1 . 77 \times$ for Wan2.1-14B on a single H100 GPU. These speedups are largely consistent with the theoretical compute budget savings (e.g., $1.8 \times$ fewer PFLOPs for HunyuanVideo).

Visually, Figure 6 further illustrates that Radial Attention mirrors the video quality of the original Wan2.1-14B model at default length, showing similar visual fidelity for generated videos of fruit and a person.

6.1.2. Long Video Generation

The following are the results from Table 2 of the original paper:

Model	#Frames	Method	Sparsity	Training Time (h)	Training Speedup	Inference Time (s)	Inference Speedup	Vision Reward (↑)	VBench
Model	#Frames	Method	Sparsity	Training Time (h)	Training Speedup	Inference Time (s)	Inference Speedup	Vision Reward (↑)	S.C.	A.Q.	I.Q.
HunyuanVideo	125 (1×)	Original	0.00%			225	−	0.119	0.959	0.643	0.672
	253 (2×)	Original	0.00%	−		797	1.00×	0.122	0.953	0.603	0.611
		RIFLEx	0.00%			797	1.00×	0.128	0.969	0.622	0.614
		Spatial	80.5%	16.0	2.81×	335	2.38×	0.054	0.979	0.607	0.670
		Temporal	80.7%	16.2	2.78×	338	2.36×	0.104	0.963	0.620	0.658
		Long LoRA	80.6%	16.6	2.71×	363	2.20×	0.112	0.958	0.620	0.685
		PA [45]	80.4%	16.7	2.69×	334	2.39×	0.109	0.967	0.608	0.653
		SANA		12.8	3.52×	285	2.80×	-0.205	0.907	0.300	0.442
		Full	0.00%	45.0	1.00×	797	1.00×	0.124	0.955	0.616	0.648
		Ours	80.8%	16.2	2.78×	339	2.35×	0.126	0.968	0.623	0.663
	509 (4×)	Original	0.00%			2895	1.00×	0.054	0.988	0.545	0.451
		RIFLEx	0.00%		−	2895	1.00×	0.037	0.989	0.539	0.456
		Spatial	88.3%	20.7	4.52×	755	3.83×	0.112	0.922	0.598	0.664
		Temporal	88.2%	21.1	4.44×	774	3.74×	0.083	0.972	0.597	0.646
		Long LoRA	88.4%	20.9	4.48×	803	3.61×	0.130	0.936	0.618	0.689
	PA [45]	88.2%	21.8	4.29×	766	3.78×	0.128	0.950	0.590	0.648
	Full	0.00%	93.6	1.00×	2895	1.00×	0.133	0.977	0.590	0.635
163 (1×)		Ours	88.3%	21.4	4.37×	781	3.71×	0.134	0.973	0.623	0.672
Mochi 1	Original		0.00% 0.00%	−		112	− 1.00×	0.071 0.040	0.973 0.937	0.623	0.672
	331 (2×)		Original Spatial	76.1%	− 8.57	1.75×	302 186	1.62×	0.088	0.935	0.551 0.596	0.466 0.595
			Temporal	76.3%	8.54	1.76×	189	1.60×	0.075	0.936	0.591	0.593
			Long LoRA	76.0%	9.07	1.65×	210	1.44×	0.095	0.950	0.596	0.630
			PA [45]			1.76×	183	1.65×	0.101	0.946	0.610	0.626
			SAA	77.8%	8.53			1.82×	-0.201
			Full Ours	0.00% 76.4%	15.0 8.43	1.00× 1.78×	302 185	1.00× 1.63×	0.095 0.110	0.923 0.951	0.610 0.615	0.594
			Original								0.602
	Wan2.1-14B			0.00%			992	1.00×	-0.091	0.916	0.383	0.322
			Spatial	85.2%	17.4	2.83×	382	2.60×	0.091	0.930	0.611	0.585
			Temporal	85.4%	17.6	2.80×	393	2.52×	0.028	0.931	0.556	0.536
		667 (4×)	Long LoRA	86.0%	19.0	2.59×	426	2.33×	0.086	0.944	0.584	0.543
			PA [45]	86.5%	17.3	2.84×	381	2.60×	0.107	0.956	0.633	0.650
			Full	0.00%	49.2	1.00×	992	1.00×	0.099	0.934	0.613	0.613 0.638
		81 (1×)	Ours	85.5%	17.4	2.83×	386	2.57×	0.113	0.958	0.618

Table 2 presents results for video generation at $2 \times$ and $4 \times$ the original lengths, comparing Radial Attention (Ours) against various baselines (including fine-tuned dense attention) across HunyuanVideo, Mochi 1, and Wan2.1-14B. Quality is assessed using Vision Reward and VBench metrics (Subject Consistency, Aesthetic Quality, Imaging Quality), while efficiency is measured by Training Time, Training Speedup, Inference Time, and Inference Speedup.

Key Observations on Quality:

Original Models Degradation: Without fine-tuning, the Original models exhibit significant quality degradation for longer videos, especially at $4 \times$ length extension. For example, Original HunyuanVideo at $4 \times$ length has a very low Vision Reward of 0.054. Figure 7 visually confirms that Original HunyuanVideo and RIFLEx produce blurred videos with poor visual quality for $4 \times$ length extension.
RIFLEx Limitations: RIFLEx improves performance slightly at $2 \times$ length but shows further quality deterioration beyond that (e.g., HunyuanVideo $4 \times$ length Vision Reward of 0.037), indicating limited extension capability.
Sparse Baseline Shortcomings:
- Spatial and Temporal attention (from SVG): Suffer from limited reception fields, leading to quality issues.
- Long LoRA and PA: Despite having global receptive fields, they fail to effectively capture spatiotemporal correlations specific to video, resulting in degraded quality compared to Radial Attention. Notably, PA shows a large gain in Vision Reward after fine-tuning, suggesting its initial generic sparse pattern mismatches the pre-trained model's attention distribution. Fine-tuning helps it adapt.
SANA Failure: SANA (which replaces softmax attention with linear attention) performs poorly with fine-tuning-based video length extension (HunyuanVideo $2 \times$ length Vision Reward of -0.205), indicating it requires massive retraining rather than just fine-tuning.
Radial Attention's Superiority: Radial Attention consistently achieves quality on par with, or even slightly better than, LoRA fine-tuned dense attention models across all tested lengths and models.
- For HunyuanVideo at $4 \times$ length, Radial Attention yields a Vision Reward of 0.134, comparable to Full (dense attention fine-tuned) at 0.133, and significantly higher than Original (0.054) or RIFLEx (0.037). It also maintains strong VBench scores (S.C. 0.973, A.Q. 0.623, I.Q. 0.672).
- For Mochi 1 at $4 \times$ length, Radial Attention gets a Vision Reward of 0.113, again outperforming Full (0.099) and other baselines.
- For Wan2.1-14B at $2 \times$ length, Radial Attention achieves the highest Vision Reward (0.161) and robust VBench scores.
Visual Evidence: Figure 7, and Appendix Figures C, D, E visually demonstrate that Radial Attention produces videos with good visual quality and consistency even at $4 \times$ extended lengths, superior to baselines that show blurring, distortion, or temporal inconsistencies.

Key Observations on Efficiency:

Significant Speedups: Thanks to its $O(n \log n)$ $O (n lo g n)$ complexity, Radial Attention delivers substantial speedups.
- For HunyuanVideo at $4 \times$ length, it achieves a Training Speedup of $4 . 37 \times$ (21.4 hours vs. 93.6 hours for Full) and an Inference Speedup of $3 . 71 \times$ (781s vs. 2895s for Full). This represents a significant reduction in both training and inference costs.
- Similar speedups are observed for Mochi 1 (e.g., $2 . 83 \times$ training speedup at $4 \times$ length) and Wan2.1-14B ( $2 . 83 \times$ training speedup at $2 \times$ length).
Cost Reduction: The paper highlights that for $4 \times$ longer videos, Radial Attention can reduce training costs by up to $4 . 4 \times$ and accelerate inference by up to $3 . 7 \times$ .

6.1.3. Compatibility with Existing LoRAs

The paper highlights that Radial Attention offers seamless compatibility with pre-trained task-specific LoRAs, such as those for artistic style transfer. This is demonstrated by directly merging the LoRA weights trained for Radial Attention on long videos with existing style LoRAs.

Figure F in Appendix C.3 visually illustrates this compatibility. It shows that combining the extended-length LoRA from Radial Attention with a style LoRA (e.g., for a specific art style) successfully generates longer videos while preserving the intended artistic style. The Vision Reward remains comparable to that of original-length LoRA videos. The authors note that minor style discrepancies might occur due to the relatively small dataset used for training the extended-length LoRA, suggesting that a more comprehensive dataset could further improve consistency. This feature is crucial for practical applications, allowing users to extend video length without losing the benefits of existing style customizations.

6.2. Ablation Studies & Parameter Analysis

6.2.1. Effectiveness of Low-Rank Adaptation

Figure 8(a) compares the Vision Reward of full fine-tuning versus LoRA fine-tuning as video length increases.

Dense Attention: For models using dense attention, LoRA fine-tuning initially lags behind full fine-tuning (e.g., at $1 \times$ and $2 \times$ length) but catches up or slightly surpasses it at $4 \times$ length extension. This suggests that for dense attention, LoRA might require more data or a higher rank to fully match full fine-tuning at shorter extensions, but its benefits become more pronounced for very long sequences.
Radial Attention: Crucially, when combined with Radial Attention, LoRA fine-tuning consistently matches or even outperforms full fine-tuning across all tested video lengths. This finding is significant: Radial Attention not only scales better computationally but also makes the model more amenable to adaptation with LoRA, potentially by focusing LoRA updates on the most critical connections preserved by the sparse mask. This validates the design choice of combining Radial Attention with LoRA for efficient and high-quality long-video generation.

6.2.2. Attention Error

To quantify how well Radial Attention approximates dense attention, the authors evaluate the average attention output Mean Squared Error (MSE) on the Wan2.1-14B model.

Radial Attention: Achieves an MSE of $3 . 9 \times 1 0 ^ { - 3 }$ .
SVG [8]: Achieves an MSE of $4 . 4 \times 1 0 ^ { - 3 }$ .
STA [47]: Achieves an MSE of $1 . 5 \times 1 0 ^ { - 2 }$ .

The results show that Radial Attention has a significantly lower MSE compared to both SVG and STA. This indicates that its static mask design is more effective at preserving the fidelity of the attention output compared to other sparse attention mechanisms, leading to a more accurate approximation of the full attention mechanism. This empirical finding supports the theoretical error bound derived in Section 4.2.2.6.

6.2.3. Regression Results

The authors perform a regression analysis on the empirically observed attention decay curves (Figure 4) using the exponential function y = \exp ( - a x + b ).

As shown in Figure 8(b), the fitted curves achieve an $R^2$ value of over 0.985.
For temporal distance, $R^2 = 0.999$ .
For spatial distance, $R^2 = 0.985$ .

These high $R^2$ values strongly suggest that the exponential function is a very good model for describing the Spatiotemporal Energy Decay phenomenon observed in video diffusion models. This empirical validation justifies the theoretical assumption of exponential decay (Equation 3) that underpins the design of the Radial Attention mask.

6.2.4. More Ablations on Radial Attention Design Choices

6.2.4.1. Comparison with Other $O(n \log n)$ Sparsity Patterns

The following are the results from Table A of the original paper:

Model	Method	PSNR (↑)	SSIM (↑)	LPIPS (↓)	VisionReward (↑)
HunyuanVideo (117 frames)	HS	27.0	0.881	0.119	0.136
HunyuanVideo (117 frames)	Ours	27.3	0.886	0.114	0.139

Table A compares Radial Attention with Harmonic Series Decay Attention (HS), an alternative $O(n \log n)$ sparsity pattern where diagonal width is inversely proportional to distance. Radial Attention consistently outperforms HS across all quality metrics (PSNR, SSIM, LPIPS, Vision Reward). This demonstrates the effectiveness of Radial Attention's specific mask design, which incorporates both temporal and spatial decay rules, making it superior to other generic $O(n \log n)$ sparsity patterns for video data.

6.2.4.2. Ablation on Initial Dense-Attention Steps

The following are the results from Table B of the original paper:

Model	#Warmup Steps	PSNR (↑)	SSIM(↑)	LPIPS (↓)
Wan2.1-14B (69 frames)	0	12.8	0.486	0.522
	4	18.5	0.693	0.267
	8	21.7	0.778	0.183
	11	23.2	0.813	0.151
	12 (Ours)	23.6	0.823	0.146
	13	23.5	0.819	0.150

For default-length video generation, the authors follow SVG [8] by applying full attention during a warm-up phase at the beginning of the diffusion process. Table B ablates the number of these warm-up steps for the Wan2.1-14B model. The overall computation is matched across all settings by adjusting the sparsity of the Radial Attention mask. The results clearly show that a 12-step warm-up phase achieves the best performance across all quality metrics (PSNR, SSIM, LPIPS), with a slight drop at 13 steps. This justifies the chosen warm-up strategy for default-length generation.

The following are the results from Table C of the original paper:

Model	#Warmup Steps	Vision Reward (↑)
HunyuanVideo (117 frames)	0	0.154
	1	0.160
	2 (Ours)	0.163
	3	0.157

For $4 \times$ longer video generation, a similar ablation on warm-up steps during inference is conducted for HunyuanVideo. Table C shows that using 2 warm-up steps yields the highest Vision Reward (0.163), with fewer or more steps leading to slightly lower scores. This suggests that a shorter warm-up is sufficient for fine-tuned longer video generation, likely because the fine-tuning process adapts the sparse attention more effectively.

6.2.4.3. Ablation on Initial Dense-Attention Layers

The following are the results from Table D of the original paper:

Model	#Dense Layers	Vision Reward (↑)
HunyuanVideo (117 frames)	0	0.139
	1	0.156
	2 (Ours)	0.163
	3	0.157

During training for long video generation, the authors keep a certain number of initial DiT blocks with dense attention to better capture global information. Table D ablates this choice for HunyuanVideo, comparing 0, 1, 2, or 3 initial dense layers while keeping the overall computation budget consistent. The results indicate that using 2 full-attention layers delivers the best Vision Reward (0.163). Using fewer (0 or 1) layers might miss critical global interactions, while using more (3) might introduce unnecessary computation without further quality benefits, or even slightly harm quality. This validates the design choice of keeping the first two DiT blocks dense.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Radial Attention, an innovative sparse attention mechanism designed to overcome the prohibitive computational costs of generating long videos with diffusion models. The core of its contribution lies in identifying and leveraging the Spatiotemporal Energy Decay phenomenon, where attention scores diminish with increasing spatial and temporal distance between tokens. Radial Attention translates this decay into an $O(n \log n)$ computational complexity by employing a static, principled attention mask. This mask intelligently allocates compute density, focusing on dense interactions for nearby tokens and sparsely sampling interactions for distant ones, allowing the attention window to shrink with temporal distance.

The paper demonstrates that Radial Attention effectively maintains high video quality across various leading diffusion models (Wan2.1-14B, HunyuanVideo, Mochi 1) at their default generation lengths, while achieving significant inference speedups (up to $1 . 9 \times$ ). More importantly, it provides a scalable solution for long video generation: with minimal LoRA-based fine-tuning, it enables the generation of videos up to $4 \times$ longer, drastically reducing training costs (up to $4 . 4 \times$ ) and accelerating inference (up to $3 . 7 \times$ ) compared to dense attention baselines, all without sacrificing visual fidelity. Its compatibility with existing style LoRAs further enhances its practical utility.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Simplifying Assumption of Exponential Decay: The theoretical analysis and mask design rely on the assumption that attention scores exhibit a clear exponential decay with spatiotemporal distance. While empirically validated with high $R^2$ values, this might be a simplification of the complex, nuanced spatiotemporal dependencies in natural video data. Future work could explore more sophisticated models for these underlying data structures to potentially improve efficiency and performance further.
Quadratic Complexity with Respect to Resolution: As shown in Equation 6, the method's complexity is $O(s n (\log_2 n - \log_2 s))$ , which means for a fixed number of frames $f$ , the complexity still scales quadratically with the spatial resolution $s$ (since $n=fs$ ). This implies that while Radial Attention significantly addresses the temporal dimension scaling, it still faces limitations for generating very high-resolution videos (i.e., videos with a very large $s$ ). Future work should explore more efficient attention mechanisms and pre-training strategies, such as Native Sparse Attention (NSA) [75] and MoBA [76], to better support both long and high-resolution videos simultaneously.

7.3. Personal Insights & Critique

Radial Attention presents a compelling and elegant solution to a critical bottleneck in video generation. The core insight of Spatiotemporal Energy Decay is particularly intuitive and well-supported by empirical evidence, providing a principled foundation for sparsity rather than relying on heuristic masking. Its static mask design is a significant practical advantage, making it more robust and easier to train than dynamic alternatives like SVG. The seamless integration with LoRA for length extrapolation is a highly valuable feature, democratizing the ability to extend state-of-the-art models to longer sequences without prohibitive retraining costs.

Transferability and Applications: The underlying principle of attention decay with distance (both temporal and spatial) is likely applicable beyond video generation. In general sequence modeling tasks, especially those with inherent multi-dimensional structures (e.g., medical imaging, 3D point clouds, time-series data with latent spatial dependencies), similar energy decay phenomena might exist. Radial Attention's structured sparsity pattern could be adapted to these domains to achieve similar efficiency gains. Its $O(n \log n)$ complexity positions it as a strong candidate for any sequence modeling task where $O(n^2)$ is too expensive, but $O(n)$ linear attention lacks expressive power.

Potential Issues and Areas for Improvement:

Fixed Decay Parameters: While the exponential decay model fits well, the decay rates $\alpha$ and $\beta$ are implicitly fixed by the mask design. Could learning these decay parameters adaptively for different attention heads or layers further optimize the sparsity pattern and improve quality? Different types of content might exhibit different decay characteristics.
Hardware-specific Optimizations: The paper mentions that STA achieved higher speedup with FlashAttention-3. While Radial Attention is orthogonal to FlashAttention versions, its full potential speedup might only be realized with the latest hardware-specific optimizations. Future work could involve developing FlashAttention-3 or custom kernel implementations specifically tailored for Radial Attention's unique block-sparse pattern.
Generalization to Diverse Motion: The current design assumes a relatively uniform spatiotemporal energy decay. However, videos can contain highly localized, rapid movements (e.g., an object appearing quickly) that might warrant different attention patterns. Does Radial Attention adequately handle highly dynamic or unpredictable motion without losing critical information, or does the sparse sampling sometimes miss crucial but transient interactions?
Beyond First-Order Decay: While exponential decay is a good fit, higher-order or more complex decay functions might capture nuances that exp(-ax+b) misses, potentially offering marginal quality improvements at the cost of increased mask complexity.

Broader Impact: The environmental benefits of reducing computational costs for training and inference are significant, contributing to "greener AI." This also makes advanced video generation more accessible, potentially fostering creativity in education and arts. However, the authors rightly highlight the dual-use nature of generative AI. The increased efficiency could facilitate the creation and spread of deepfakes and misinformation. Continued research into robust detection methods and responsible deployment guidelines will be crucial as such powerful tools become more accessible.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.