Paper status: completed

ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Published:01/08/2026
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ReHyAt introduces a Recurrent Hybrid Attention mechanism for video diffusion transformers that reduces attention complexity to linear, enhancing scalability for long sequences. It achieves efficient distillation from existing models at significantly lower training costs, while ma

Abstract

Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

1.2. Authors

Mohsen Ghafoorian, Amirhossein Habibian. Their affiliations are listed as Qualcomm AI Research.

1.3. Journal/Conference

The paper is published at (UTC): 2026-01-07T19:26:30.000Z, and the provided links are to arXiv. This indicates it is currently a preprint, which means it has been submitted to arXiv for open access but may not yet have undergone peer review or been formally published in a journal or conference proceedings. arXiv is a reputable platform for sharing research rapidly within the scientific community.

1.4. Publication Year

2026 (based on the provided publication UTC date: 2026-01-07T19:26:30.000Z)

1.5. Abstract

The paper introduces ReHyAt, a novel Recurrent Hybrid Attention mechanism designed for video diffusion transformers. This mechanism addresses the scalability limitations of existing transformer-based video diffusion models, which suffer from quadratic attention complexity. ReHyAt combines the high fidelity of softmax attention with the computational efficiency of linear attention, enabling a chunk-wise recurrent reformulation that ensures constant memory usage. Unlike other linear attention methods, ReHyAt's hybrid design facilitates efficient distillation from pre-existing softmax-based models, drastically reducing training costs to approximately 160 GPU hours (two orders of magnitude less than comparable models like SANA Video) while maintaining competitive video quality. The proposed lightweight distillation and fine-tuning pipeline offers a practical approach for future state-of-the-art bidirectional softmax-based models. Experimental results on VBench and VBench-2.0 benchmarks, along with a human preference study, confirm that ReHyAt achieves state-of-the-art video quality, transitions from quadratic to linear attention cost, and unlocks practical scalability for long-duration and on-device video generation.

https://arxiv.org/abs/2601.04342 This is an arXiv preprint.

https://arxiv.org/pdf/2601.04342v1.pdf

2. Executive Summary

2.1. Background & Motivation

The ambition in generative video is shifting towards creating long, coherent sequences. While diffusion models have become the go-to for this, the underlying architecture choice is crucial for scalability. Recent state-of-the-art video generation models leverage transformer-based architectures, specifically Diffusion Transformers (DiTs), which process video as a sequence of spatiotemporal patches and provide global context.

However, this progress comes at a significant computational cost: the self-attention mechanism in transformers scales quadratically with the sequence length (O(N2)\mathcal{O}(N^2) in time and memory). For video, the sequence length NN is the product of temporal length and spatial patch count, meaning even moderate video resolutions and durations result in very large token counts. This quadratic complexity makes training and inference prohibitively expensive for longer videos (beyond ~10 seconds) and limits deployment on resource-constrained edge devices like mobile phones.

Linear attention offers a solution by reducing complexity to linear (O(N)\mathcal{O}(N)) and enabling constant memory usage when reformulated as a Recurrent Neural Network (RNN). However, purely linear attention often sacrifices the expressiveness and fidelity of softmax attention, leading to reduced quality unless extensively retrained. Existing hybrid approaches that combine both types of attention often remain quadratic and cannot be reformulated as RNNs, thus failing to solve the scalability issue for long videos. Furthermore, retraining large, high-quality softmax-based video diffusion models from scratch with efficient attention mechanisms is impractically expensive.

The paper's entry point is to find a way to distill the quality of these compute-heavy, softmax-based models into a recurrent, efficient form without prohibitive retraining costs, thereby unlocking practical scalability for long-duration and on-device video generation.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

  • ReHyAt Mechanism: Introduction of ReHyAt, a novel temporally chunked hybrid attention mechanism that integrates local softmax attention with global linear attention. This design ensures high-fidelity modeling of critical dependencies within and across adjacent frames while significantly reducing overall computational complexity to linear time.

  • Recurrent Reformulation: Derivation of a chunk-wise recurrent formulation for ReHyAt. This crucial step enables the generation of arbitrarily long videos with constant memory usage and highly efficient inference, making it suitable for on-device execution.

  • Efficient Training Pipeline: Demonstration of a two-stage training pipeline (attention distillation followed by lightweight fine-tuning) that allows for transforming a state-of-the-art bidirectional softmax attention video diffusion model into a chunk-wise recurrent model within approximately 160 GPU hours. This represents a significant reduction in training cost (two orders of magnitude more efficient than concurrent methods) with only a negligible impact on quality.

    Key findings show that ReHyAt achieves state-of-the-art video quality (competitive with Wan2.1 1.3B and outperforming many other efficient methods on VBench and VBench-2.0), drastically reduces attention cost from quadratic to linear, and offers superior on-device performance (e.g.,  16×~16 \times faster than Flash Attention and  11×~11 \times more memory efficient at 121 frames on Snapdragon8-Gen4).

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Diffusion Models

Diffusion models are a class of generative models that learn to reverse a gradual "noising" process. They start with random noise and progressively denoise it over several steps to generate high-quality data (like images or videos) that resemble the training data. They are known for their stability and ability to generate diverse and high-fidelity content.

Transformers

Transformers are neural network architectures introduced in "Attention Is All You Need" (Vaswani et al., 2017) that have revolutionized natural language processing and increasingly, computer vision. Their core innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence to each other, irrespective of their distance. This global context processing capability makes them powerful for sequential data.

Self-Attention Mechanism

At the heart of a transformer is the self-attention mechanism. For a given input sequence of tokens, it calculates a weighted sum of all other tokens for each token, where the weights (attention scores) determine how much focus each token should place on others. This allows the model to capture long-range dependencies. The self-attention calculation involves three learnable linear projections: Query (Q), Key (K), and Value (V) matrices. The attention scores are computed by taking the dot product of Query and Key vectors, scaled by dk\sqrt{d_k} (where dkd_k is the dimension of the Key vectors) to prevent large dot products from pushing the softmax function into regions with tiny gradients. These scores are then passed through a softmax function to get probability distributions, which are then multiplied by the Value vectors to produce the final output.

Video Diffusion Transformers (DiTs)

Diffusion Transformers (DiTs) adapt the transformer architecture for generative tasks like video generation. Instead of processing individual pixels or raw frames, videos are first broken down into a sequence of spatiotemporal patches. Each patch then becomes a token in the transformer's input sequence. This allows the transformer to model both spatial relationships within frames and temporal relationships across frames. DiTs have shown state-of-the-art performance in video generation due to their ability to capture complex dependencies across long video sequences.

Quadratic Attention Complexity

The standard self-attention mechanism scales quadratically with the sequence length NN. This means that if the sequence length doubles, the computational cost increases fourfold.

  • Time Complexity: O(N2d)\mathcal{O}(N^2 d), where NN is the number of tokens and dd is the hidden dimension. This comes from computing the N×NN \times N attention matrix (QKTQK^T).
  • Memory Complexity: O(N2)\mathcal{O}(N^2), specifically for storing the attention matrix. For video generation, NN is temporal length ×\times spatial patch count. As video duration and resolution increase, NN grows very rapidly, making the quadratic complexity a severe bottleneck for both training and inference, leading to high computational costs, long processing times, and out-of-memory errors on typical hardware.

Linear Attention

Linear attention aims to reduce the quadratic complexity of self-attention to linear complexity, O(N)\mathcal{O}(N). It achieves this by reformulating the softmax operation using a kernel trick. Instead of computing explicit pairwise attention scores, it applies a non-negative feature map (kernel function) ϕ()\phi(\cdot) to the Query and Key vectors. This allows the aggregation of Key and Value information to be computed before multiplication with Query, effectively decoupling the N×NN \times N interaction into N×dN \times d operations, thus achieving linear complexity. This also enables constant memory usage when reformulated as an RNN.

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are a class of neural networks designed to process sequential data. Unlike standard feedforward networks, RNNs have connections that feed activations from previous time steps back into the current time step, allowing them to maintain an internal "state" or "memory" that captures information about past inputs. This recurrent nature makes them suitable for tasks where the order of information matters. A key property of RNNs is that they can process sequences of arbitrary length while maintaining constant memory usage per time step during inference, as they only need to store the current state and the parameters, not the entire sequence history.

3.2. Previous Works

Standard Self-Attention

As described above, standard self-attention, introduced by Vaswani et al. [34], forms the basis of transformer architectures. For an input sequence of tokens xRN×Dx \in \mathbb{R}^{N \times D} where NN is sequence length and DD is hidden dimension, queries qq, keys kk, and values vv are derived as linear projections: $ q = x w_q, \quad k = x w_k, \quad v = x w_v $ with learnable weights wq,wk,wvRD×Dw_q, w_k, w_v \in \mathbb{R}^{D \times D}. The output of the standard attention mechanism is: $ A_l(x) = y = \operatorname{softmax}\left(\frac{qk^\top}{\sqrt{D}}\right)v $ The exponential kernel eqikje^{q_i k_j^\top} within the softmax function gives it high expressiveness.

Linear Attention (Katharopoulos et al.)

Katharopoulos et al. [20] proposed linear attention as a way to reduce the quadratic complexity. They apply the kernel trick to replace the softmax function with a non-negative feature map ϕ()\phi(\cdot). The original work proposed ϕ(x)=1+elu(x)\phi(x) = 1 + \mathrm{elu}(x), where elu is the exponential linear unit activation function. This allows the attention computation to be rewritten in a way that avoids explicit N×NN \times N matrix operations, achieving linear complexity. The linear attention output yiy_i for token ii is given by: $ y_i = \frac{\phi(q_i) \sum_{j=1}^N \phi(k_j)v_j^\top}{\phi(q_i) \sum_{j=1}^N \phi(k_j)} $ The terms \sum_{j=1}^N \phi(k_j)v_j^\top and \sum_{j=1}^N \phi(k_j) can be precomputed and cached, enabling linear time complexity. However, the paper notes that the simple ϕ(x)=1+elu(x)\phi(x) = 1 + \mathrm{elu}(x) mapping introduces a significant gap in expressiveness compared to the exponential kernel of softmax, often leading to degraded performance or requiring extensive retraining.

SANA-Video

SANA-Video [4] is a concurrent work that also explores efficient attention for video diffusion models by incorporating linear attention. While it achieves efficiency, it is a purely linear attention approach, and the authors note that it requires substantial retraining from scratch, incurring a very high training cost (e.g., 12 days of 64 H100 GPUs), which is two orders of magnitude more expensive than ReHyAt.

Attention Surgery

Attention Surgery [11] proposed a temporally uniform hybrid attention method for video diffusion. It combines linear and softmax attention to improve quality over purely linear methods. However, a key limitation highlighted by ReHyAt is that Attention Surgery retains quadratic complexity and cannot be reformulated as an RNN, meaning it doesn't solve the memory and compute scalability issues for arbitrarily long videos.

Wan2.1

Wan2.1 1.3B [35] is a state-of-the-art video generative model that ReHyAt uses as its teacher model for distillation. It is a bidirectional full softmax attention model, representing the high-quality but computationally expensive systems that ReHyAt aims to make efficient.

Other Efficient Attention Mechanisms

The paper mentions various works aiming to reduce quadratic complexity:

  • EfficientViT [2], PADRe [23], Performer [5], Linformer [38] for vision and general transformers.
  • SANA [25], LinGen [36], Grafting [3] for image generation.
  • Language modeling focused [29, 37, 43, 46, 48].
  • Linear recurrent models (SSM, RWKV) [9, 10, 37, 45, 53]: These are alternatives to self-attention for long sequences. However, they have architectural differences that make distilling DiT weights into them costly. ReHyAt preserves the original block structure.

Other Video Diffusion Models

Large-scale systems like CogVideoX [44], Open-Sora Plan [26], PyramidalFlow [16], LTX-video [12] have advanced video generation quality but at high compute costs. Lightweight designs for mobile/PC (Mobile Video Diffusion [42], MoViE [18], SnapGen-V [41], AMD-HummingBird [15], On-device Sora [21], MobileVDiT [40], NeoDragon [19]) often lack DiT-based architecture or still rely on quadratic attention.

Video Diffusion Models with Efficient Attention

Prior work has explored efficiency through:

  • Token merging [1, 8, 17]
  • Token downsampling [6, 31]
  • Attention tiling [8, 50]
  • Sparsity [24, 49] The paper differentiates ReHyAt from tiling and sparsity-based methods by noting that ReHyAt attends to the full token set, combining linear for long-range and softmax for local interactions. M4V [13] accelerates video DiTs by distilling into Mamba blocks, but ReHyAt claims to outperform M4V in quality and efficiency with a simpler block structure and lightweight training.

3.3. Technological Evolution

The field of generative video has evolved from U-Net based diffusion models, which struggled with long temporal structures and scalability to higher resolutions/durations, to Diffusion Transformers (DiTs). DiTs became the prevailing trend, enabling state-of-the-art results but introducing the quadratic complexity problem. This led to a pursuit of efficient attention mechanisms, either through linear attention or hybrid approaches. Initially, these efficient methods often required training from scratch, which is prohibitively expensive for large models. ReHyAt represents a further evolution by enabling the distillation of high-quality, pre-trained softmax-based DiTs into an efficient, recurrent hybrid attention form, thus leveraging existing SOTA models while solving their scalability issues. This allows for practical long-duration and on-device video generation.

3.4. Differentiation Analysis

The core differences and innovations of ReHyAt compared to main methods in related work are:

  • Hybrid Design for Expressiveness: Unlike purely linear attention methods (e.g., SANA-Video), ReHyAt employs a hybrid approach. It retains softmax attention for crucial local dependencies (within and across adjacent frames) while using linear attention for global, long-range dependencies. This combines the expressiveness of softmax with the efficiency of linear attention, mitigating the quality degradation often seen in purely linear models.

  • Recurrent Reformulation for Scalability: Crucially, ReHyAt's hybrid design is specifically engineered to be reformulated into a chunk-wise RNN. This property, unlike Attention Surgery which remains quadratic, allows constant memory usage and linear time complexity irrespective of video duration, making arbitrarily long video generation feasible and enabling on-device execution.

  • Efficient Distillation Pipeline: ReHyAt introduces a two-stage training pipeline (attention distillation from a softmax teacher followed by lightweight fine-tuning). This is a significant departure from methods that require training from scratch (like SANA-Video), which can be two orders of magnitude more expensive. This distillation approach allows ReHyAt to leverage the high quality of existing, compute-heavy bidirectional full softmax attention models (e.g., Wan2.1) with a minimal training budget (~160 GPU hours).

  • Temporally Non-Uniform Hybrid Arrangement: Unlike Attention Surgery's temporally uniform hybrid attention, ReHyAt uses a temporally non-uniform arrangement. This allows for differential treatment of token dependencies, providing a better inductive bias for video generation by prioritizing high-fidelity softmax interactions for temporally proximate tokens and efficient linear attention for distant ones.

    In essence, ReHyAt is a recurrent, hybrid, and distillable attention mechanism that efficiently scales video diffusion transformers to long durations and edge devices by combining the strengths of different attention types and leveraging existing high-quality models.

4. Methodology

4.1. Principles

The core idea behind ReHyAt is to circumvent the quadratic complexity of standard self-attention in video diffusion transformers while preserving high generation quality. This is achieved by combining the expressiveness of softmax attention for critical local dependencies with the efficiency of linear attention for global, long-range interactions. A crucial innovation is to structure this hybrid mechanism in a temporally chunked manner that allows for a recurrent reformulation, enabling constant memory usage and linear time complexity for arbitrarily long video sequences. Furthermore, ReHyAt leverages an efficient two-stage training pipeline (distillation and fine-tuning) to adapt existing high-quality, but costly, softmax-based models rather than training from scratch, significantly reducing computational overhead.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Preliminaries: Linear Attention

Let's begin by defining the standard transformer block and attention mechanism, and then understand how linear attention modifies it.

Transformer Block: A transformer block in the ll-th layer, denoted as Tl(x)T_l(x), processes a sequence of NN tokens, where each token is a DD-dimensional feature vector xRN×Dx \in \mathbb{R}^{N \times D}. The block is typically formulated as: $ T_l(x) = f_l\left(A_l(x) + x\right) $ Here, fl()f_l(\cdot) represents a token-wise transformation, often implemented as a feed-forward network, and Al()A_l(\cdot) is the self-attention operator responsible for interactions among tokens.

Standard Self-Attention: The standard self-attention mechanism computes its output yy as: $ A_l(x) = y = \operatorname{softmax}\left(\frac{qk^\top}{\sqrt{D}}\right)v $ where qq, kk, and vv are the Queries, Keys, and Values, respectively. These are derived from the input xx through linear projections: $ q = xw_q, \quad k = xw_k, \quad v = xw_v $ with learnable weight matrices wq,wk,wvRD×Dw_q, w_k, w_v \in \mathbb{R}^{D \times D}. The term D\sqrt{D} is a scaling factor to prevent large dot products from saturating the softmax function.

To understand the core issue, consider the output yiy_i for an individual token ii. The softmax attention for token ii can be expressed as a weighted sum of the value vectors vjv_j from all tokens jj: $ y_i = \frac{\sum_{j=1}^N \mathrm{sim}(q_i, k_j) v_j}{\sum_{j=1}^N \mathrm{sim}(q_i, k_j)} $ In standard softmax attention, the similarity function sim(qi,kj)\mathrm{sim}(q_i, k_j) is an exponential kernel: sim(qi,kj)=eqikj\mathrm{sim}(q_i, k_j) = e^{q_i k_j^\top}. The computation of qkq k^\top (an N×NN \times N matrix) and its subsequent multiplication with vv leads to the quadratic time and memory complexity.

Linear Attention Transformation: Linear attention applies the kernel trick to generalize the similarity function. Instead of the exponential kernel, it uses a non-negative feature map ϕ()\phi(\cdot) such that sim(qi,kj)=ϕ(qi)ϕ(kj)\mathrm{sim}(q_i, k_j) = \phi(q_i) \phi(k_j)^\top. This allows for a rearrangement of the summation: $ y_i = \frac{\phi(q_i) \sum_{j=1}^N \phi(k_j) v_j^\top}{\phi(q_i) \sum_{j=1}^N \phi(k_j)} $ Here, ϕ(qi)\phi(q_i) and ϕ(kj)\phi(k_j) are feature maps applied to the queries and keys, respectively. Let's break down the terms in this equation:

  • ϕ(qi)\phi(q_i): The feature map applied to the query vector of token ii.

  • \sum_{j=1}^N \phi(k_j)v_j^\top: This is a sum over all key-value pairs. Crucially, this sum does not depend on ii. It can be precomputed once for all tokens in the sequence. Let's call this sum S_v = \sum_{j=1}^N \phi(k_j)v_j^\top.

  • \sum_{j=1}^N \phi(k_j): This is a sum over all mapped key vectors. It also does not depend on ii and can be precomputed. Let's call this sum S_k = \sum_{j=1}^N \phi(k_j).

    With these precomputations, the linear attention output for token ii becomes y_i = \frac{\phi(q_i) S_v}{\phi(q_i) S_k}. This reformulation changes the complexity from quadratic to linear because the expensive pairwise N×NN \times N interaction is replaced by aggregations that scale linearly with NN. However, the choice of ϕ()\phi(\cdot) is critical; a simple mapping like 1+elu(x)1 + \mathrm{elu}(x) often leads to reduced expressiveness compared to softmax.

4.2.2. Hybrid Attention Formulation

ReHyAt's core innovation is its hybrid attention mechanism, which combines softmax-based and kernelized linear attention formulations. It addresses the expressiveness gap of purely linear attention while maintaining efficiency.

For an input latent xRN×Dx \in \mathbb{R}^{N \times D}, the NN tokens are flattened from a spatiotemporal tensor of shape (T, H, W, D), where N=T×H×WN = T \times H \times W. The method introduces the concept of temporal chunks. A chunk consists of TcT_c temporal slices from the latent, represented as XtRN×DˉX_t \in \mathbb{R}^{N' \times \bar{D}}, where N=Tc×H×WN' = T_c \times H \times W is the number of tokens in a chunk. The total number of chunks is T=N/NT' = N / N'.

The hybrid attention output y^tRN×D\hat{y}_t \in \mathbb{R}^{N' \times D} for tokens in chunk tt is a weighted sum of softmax and linear attention contributions, normalized jointly: $ \hat{y}_t = \frac{a_t^\mathrm{S} + a_t^\mathrm{L}}{n_t^\mathrm{S} + n_t^\mathrm{L}} $ where:

  • atSa_t^\mathrm{S} is the softmax attention component for chunk tt.

  • atLa_t^\mathrm{L} is the linear attention component for chunk tt.

  • ntSn_t^\mathrm{S} is the normalizer for softmax attention.

  • ntLn_t^\mathrm{L} is the normalizer for linear attention.

    Let QtRN×DQ_t \in \mathbb{R}^{N' \times D} be the queries for chunk tt. The components are formulated as follows: $ a_t^\mathrm{S} = \sum_{j \in \mathcal{T}t^\mathrm{S}} \exp(Q_t k_j^\intercal / \sqrt{D} - c_t) v_j $ $ a_t^\mathrm{L} = \phi_q(Q_t) \left( \sum{j \in \mathcal{T}t^\mathrm{L}} \phi_k(k_j) v_j^\intercal \right) $ $ n_t^\mathrm{S} = \sum{j \in \mathcal{T}t^\mathrm{S}} \exp(Q_t k_j^\intercal / \sqrt{D} - c_t) $ $ n_t^\mathrm{L} = \phi_q(Q_t) \left( \sum{j \in \mathcal{T}_t^\mathrm{L}} \phi_k(k_j) \right) $ Here:

  • ctc_t: A stabilizing constant, typically the maximum exponent in the softmax computation, used for numerical stability.

  • ϕq()\phi_q(\cdot) and ϕk()\phi_k(\cdot): Kernel feature maps for queries and keys, respectively, used in the linear component. They map DD-dimensional vectors to DD'-dimensional vectors.

  • T\mathcal{T} represents the set of all NN tokens: T={1,2,,N}\mathcal{T} = \{1, 2, \ldots, N\}.

  • TtS\mathcal{T}_t^\mathrm{S}: The set of tokens attended to by softmax attention for chunk tt.

  • TtL\mathcal{T}_t^\mathrm{L}: The set of tokens attended to by linear attention for chunk tt.

Partitioning of Tokens (Without Overlap): Initially, the tokens are partitioned such that: $ \mathcal{T}_t^\mathrm{S} = {j \ : | \ : t N' \leq j < (t+1)N' } $ $ \mathcal{T}_t^\mathrm{L} = \mathcal{T} - \mathcal{T}_t^\mathrm{S} $ This means that tokens within the current chunk tt attend to each other using softmax attention, while they attend to all other tokens (outside the current chunk) using linear attention. This is illustrated in the top part of Figure 2.

Overlapping Chunks for Temporal Coherence: To mitigate potential episodic incoherence (e.g., motion or appearance inconsistencies) between frames at chunk transitions, overlapping chunks are introduced for softmax attention. This allows a more accurate, softmax-attention-based message passing across chunk boundaries. For generating attention output for a chunk tt, the keys and values for softmax attention are extended by ToT_o (overlap size) temporal slices. The reformulated token sets are: $ \mathcal{T}_t^\mathrm{S} = {j \ | \ \operatorname{max}(tN' - T_o HW, 0) \leq j < (t+1)N' } $ $ \mathcal{T}_t^\mathrm{L} = \mathcal{T} - \mathcal{T}_t^\mathrm{S} $ Here, max(tNToHW,0)max(tN' - T_o HW, 0) ensures that the starting index for softmax attention doesn't go below the first token index (0). This effectively expands the window of softmax attention to include ToT_o frames from the previous chunk, as depicted in the bottom part of Figure 2.

The following figure (Figure 2 from the original paper) illustrates the chunking and overlapping strategies:

Figure 2. Overview of the temporally chunked hybrid attention arrangement without (top) and with chunk overlap (bottom). 该图像是图表,展示了时序分块混合注意力机制的概述,包括无重叠分块(上图)和有重叠分块(下图)两种情况。图中标识了不同类型的注意力:软max注意力(蓝色),因果线性注意力(灰色虚线),非因果线性注意力(红色虚线)。

Figure 2. Overview of the temporally chunked hybrid attention arrangement without (top) and with chunk overlap (bottom).

Characterization of ϕ\phi (Kernel Feature Maps): To enhance the expressiveness of linear attention (which simple ELU-based mappings lack), ReHyAt employs distinct learnable feature maps ϕq,ϕk:RDRD\phi_q, \phi_k: \mathbb{R}^D \to \mathbb{R}^{D'}. These maps aim to approximate the large dynamic range of the exponential kernel (eqikje^{q_i k_j^\top}) more accurately. Each ϕ()\phi(\cdot) is defined by:

  1. A lightweight per-head embedding network: Implemented as grouped 1×11 \times 1 convolutions with non-linear activations. This produces an intermediate representation.
  2. Polynomial Expansion: This intermediate representation is split into PP equal parts. Each part is then raised to a different polynomial degree (from 1 to PP) and concatenated along the feature dimension. Formally, for an input xRDx \in \mathbb{R}^D: $ \phi(x) = [(\psi_1(x))^1, (\psi_2(x))^2, \ldots, (\psi_P(x))^P]^\top \in \mathbb{R}^{D'} $ where ψi()\psi_i(\cdot) is the ii-th learnable embedding slice produced by the shared embedding network. This polynomial expansion allows ϕq(qi)ϕk(kj)\phi_q(q_i)\phi_k(k_j)^\top to better approximate the complex interactions of the exponential kernel.

4.2.3. Recurrent HyAt

A key advantage of linear attention, particularly when made causal, is its ability to be reformulated as an RNN. ReHyAt extends this to its hybrid attention mechanism.

Causal Reformulation: For the hybrid attention to be reformulated as an RNN, the linear attention component must be causal. This means that tokens can only attend to previous tokens in the sequence, not future ones. The original definition of TtL\mathcal{T}_t^\mathrm{L} (which included all tokens outside the current softmax window, both past and future) is modified. The new causal partitioning for tokens is: $ \mathcal{T}_t^\mathrm{L} = {j \ : | j < \operatorname{max}(tN' - T_o HW, 0) } $ $ \mathcal{T}_t^\mathrm{S} = {j \ : | \operatorname{max}(tN' - T_o HW, 0) \le j < (t+1)N' } $ In this causal setup, TtL\mathcal{T}_t^\mathrm{L} now only includes tokens strictly before the current softmax attention window (which itself includes overlaps with the preceding chunk). The softmax attention within the chunk (TtS\mathcal{T}_t^\mathrm{S}) does not need to be causal because the model generates a full chunk's latents at once. The non-causal linear attention (attending to future tokens) is explicitly removed to enable the RNN formulation.

Chunk-wise RNN Formulation: Thanks to the temporal decoupling of TtS\mathcal{T}_t^\mathrm{S} and TtL\mathcal{T}_t^\mathrm{L} and the causality, ReHyAt can be defined as a chunk-wise RNN. This means the model generates latents for TcT_c temporal slices (a chunk) at a time, updating its internal state.

Let stRD×Ds_t \in \mathbb{R}^{D' \times D} and ztRD×1z_t \in \mathbb{R}^{D' \times 1} represent the state variables for the linear attention component and its normalizer, respectively, for the tt-th chunk. These state variables accumulate information from previous chunks. The RNN formulation is as follows: Initial State: $ s_0 = 0 $ $ z_0 = 0 $ Output for Chunk tt: $ y_t = \frac{a_t^\mathrm{S} + \phi_q(Q_t) s_t}{n_t^\mathrm{S} + \phi_q(Q_t) z_t} $ State Update for Next Chunk t+1t+1: $ s_{t+1} = s_t + \sum_{j \in \mathcal{T}t^\mathrm{S}} \phi_k(k_j) v_j^\top $ $ z{t+1} = z_t + \sum_{j \in \mathcal{T}_t^\mathrm{S}} \phi_k(k_j) $ Let's explain the terms:

  • st,zts_t, z_t: These are the accumulated state from all previous chunks. sts_t accumulates the sum of ϕk(kj)vj\phi_k(k_j)v_j^\top from prior tokens, and ztz_t accumulates the sum of ϕk(kj)\phi_k(k_j) from prior tokens.
  • ϕq(Qt)st\phi_q(Q_t) s_t: This term represents the contribution of past linear attention states to the current chunk's values.
  • ϕq(Qt)zt\phi_q(Q_t) z_t: This term represents the contribution of past linear attention states to the current chunk's normalizer.
  • atS,ntSa_t^\mathrm{S}, n_t^\mathrm{S}: These are the softmax attention and its normalizer, calculated only over the current softmax attention window TtS\mathcal{T}_t^\mathrm{S} (which includes overlaps with the preceding chunk).
  • The state update equations st+1s_{t+1} and zt+1z_{t+1} show that the current chunk's contribution (from its softmax window) is added to the accumulated state to form the state for the next chunk. This chunk-wise accumulation is what maintains the "memory" of the RNN.

Key Properties of Recurrent HyAt:

  1. Non-causal Softmax within Chunk: The softmax attention within each chunk (TtS\mathcal{T}_t^\mathrm{S}) does not need to be causal because the entire chunk's latents are generated simultaneously.
  2. Training Flexibility: The model can be trained in its causal non-recurrent form (i.e., processing the full sequence but only attending causally with linear attention) and then rearranged into the RNN form during sampling (inference) without requiring a separate RNN training phase.
  3. Scalability: This RNN formulation ensures that the computational complexity remains O(N)\mathcal{O}(N) with respect to the total length of the generated video, and critically, the memory complexity remains constant irrespective of the video duration, unlocking true scalability for long videos.

4.2.4. Two-stage Training

Given the immense computational cost of training state-of-the-art video diffusion models from scratch, ReHyAt proposes an efficient two-stage training process centered on distillation from existing high-quality models. This strategy significantly reduces the required compute.

4.2.4.1. Attention Distillation

The first stage focuses on distilling a pre-trained bidirectional full softmax teacher model (e.g., Wan2.1) into a causal hybrid attention student model.

  • Independent Block Training: Each transformer block of the student model is trained independently.
  • Learnable Parameters: During this stage, all parameters of the student model are frozen except for the ϕq\phi_q and ϕk\phi_k feature maps within each block. The goal is to learn these ϕ\phi parameters such that the linear attention component effectively approximates the dependencies captured by the softmax attention of the teacher.
  • Data-Agnostic Distillation: This distillation setup does not require any prompt/video pairs for training. Instead, the student model is trained to match the activations (outputs of the attention mechanism) of the teacher model.
  • Distillation Objective: The objective function formalizes this by minimizing the difference between the teacher's attention output and the student's hybrid attention output: $ \phi_l = \phi_l - \eta \nabla_{\phi_l} \left( \mathbb{E}_{\substack{\epsilon \in \mathcal{N} \ p \in \mathcal{P}}} \big| y^{(l, \epsilon, p, i)} - \hat{y}^{(l, \epsilon, p, i)} \big| \right) $ Where:
  • ϕl\phi_l: Represents the learnable parameters (ϕq,ϕk)(\phi_q, \phi_k) for the ll-th block of the student model.
  • η\eta: The learning rate.
  • ϕl\nabla_{\phi_l}: The gradient with respect to ϕl\phi_l.
  • EϵNpP\mathbb{E}_{\substack{\epsilon \in \mathcal{N} \\ p \in \mathcal{P}}}: Expectation taken over noise samples ϵ\epsilon from a distribution N\mathcal{N} and textual prompts pp from a distribution P\mathcal{P}.
  • y(l,ϵ,p,i)y^{(l, \epsilon, p, i)}: The output of the bidirectional softmax teacher model for block ll, given prompt pp, noise ϵ\epsilon, and denoising step ii.
  • y^(l,ϵ,p,i)\hat{y}^{(l, \epsilon, p, i)}: The output of the ReHyAt student model for block ll under the same conditions. This objective ensures that the student model learns to mimic the teacher's intricate attention patterns, especially the long-range dependencies, through its hybrid mechanism.

4.2.4.2. Lightweight Fine-tuning

After the distillation stage, the ϕq\phi_q and ϕk\phi_k parameters for each block are learned, allowing the student model to perform recurrent hybrid attention. However, because blocks were trained in isolation during distillation, the overall generation quality and particularly temporal smoothness between chunks might not be perfect.

  • Full Model Fine-tuning: In this stage, the entire DiT model (including the newly learned ϕ\phi parameters and all other weights) is fine-tuned.

  • Modest Data and Iterations: This fine-tuning is performed on a modest set of prompt/video pairs for a small number of iterations (e.g., 1,000 iterations).

  • Objective: The fine-tuning optimizes the normal flow-matching objective [27], which is the standard objective for training diffusion models. This global fine-tuning helps restore lost generation quality, especially ensuring smooth transitions and overall coherence that might have been compromised during the isolated block distillation.

    This two-stage approach allows ReHyAt to achieve competitive quality within a remarkably low computational budget (less than 200 GPU-hours), making the transformation of costly SOTA models into efficient RNNs practical.

5. Experimental Setup

5.1. Datasets

The paper describes using two different datasets for fine-tuning, depending on the resolution:

  • Low-resolution models: A 350K subset of a video dataset from Open-Sora Plan [26]. The characteristics and domain are not explicitly detailed beyond being a video dataset.
  • High-resolution models: 22K synthetic video samples generated by Wan2.1 14B (a larger version of the teacher model), with prompts drawn from the same source used for the low-resolution dataset. This indicates a reliance on synthetic data for higher resolution fine-tuning.

5.2. Evaluation Metrics

The evaluation of ReHyAt's generation quality and computational efficiency relies on a combination of quantitative benchmarks, human preference studies, and direct system performance measurements.

Generation Quality Metrics (VBench & VBench-2.0): The paper evaluates video generation quality using two prominent benchmarks: VBench [14] and VBench-2.0 [51]. These benchmarks are designed to comprehensively assess various aspects of video generative models. While the paper doesn't provide the exact formulas for each sub-metric within VBench, they typically involve automated evaluation tools that measure:

  • Total Score: An overall aggregate score reflecting general video quality.
  • Quality Score: Assesses visual fidelity, sharpness, realism, and aesthetic appeal.
  • Semantic Score: Measures how well the generated video aligns with the semantic content of the input prompt (e.g., correct objects, actions, scene).
  • Human Fidelity Score (VBench-2.0): Likely measures how human-like or realistic the generated elements (especially human subjects) appear.
  • Creativity Score (VBench-2.0): Might assess the diversity and originality of the generated content, given a prompt.
  • Controllability Score (VBench-2.0): Measures how well the model follows specific instructions or constraints in the prompt.
  • Commonsense Score (VBench-2.0): Evaluates if the generated video adheres to common-sense physical laws and contextual plausibility.
  • Physics Score (VBench-2.0): Specifically evaluates the adherence to physical laws (e.g., gravity, collisions).
  • Subject Consistency, Temporal Flickering, Temporal Style, Camera Motion (VBench): These metrics evaluate temporal coherence, stability of objects/subjects, and consistency of visual style and camera movement over time.

Human Preference Study: This is a qualitative evaluation method where human participants judge the quality of generated videos.

  • Methodology: A blinded human preference study was conducted. 50 random prompts from VBench were used. Participants were shown paired videos (one from ReHyAt, one from Wan2.1 1.3B) and asked to choose their preference or indicate no significant difference. The order of videos in pairs was randomized to avoid bias. A total of 500 paired comparisons were collected.
  • Purpose: To assess visual qualities (aesthetics, realism) and prompt alignment from a human perspective, which can capture nuances not fully captured by automated metrics.

Compute Complexity Assessment:

  • FLOPs (Floating Point Operations): This metric quantifies the total number of arithmetic operations (additions, multiplications, etc.) performed by a model. It's a standard, hardware-agnostic measure of computational cost. Higher FLOPs indicate more computation. The DeepSpeed library was used for measurement.
  • Latency (ms): This measures the time taken for a model to generate a video (or a specific component like a DiT block). It's a direct measure of inference speed. Measurements were performed on-device (Snapdragon8-Gen4 SoC) using Qualcomm AI Runtime (QNN).
  • Memory Read/Write (GB): This metric quantifies the total amount of data moved between memory and the processor. It correlates directly with power consumption and can also impact latency. Higher memory traffic typically means more power usage and slower performance. Measurements were also performed on-device. The paper emphasizes that for ReHyAt, peak memory usage remains constant even if total read/write grows linearly with video duration, which is crucial for on-device deployment.

5.3. Baselines

The ReHyAt model is primarily compared against its teacher model and other state-of-the-art and efficient video generation models:

  • Wan2.1 1.3B: This is the main bidirectional full softmax attention teacher model from which ReHyAt is distilled. It represents a state-of-the-art but computationally expensive baseline. The paper also uses Wan2.11.3BWan2.1 1.3B* for its own reproduction of results.
  • CogVideoX (2B, 5B, 1.5 5B) [44]: Large-scale transformer-based video diffusion models.
  • Open-Sora Plan V1.3 [26], Open-Sora V1.2 [52]: Other prominent open-source video generation models.
  • LTX-Video [12], PyramidalFlow [16], Neodragon [19]: Other recent large-scale or efficient video diffusion models.
  • Mobile Video DiT - Mobile/Server [40], Hummingbird [15], SnapGenV [41]: Models specifically designed for efficiency or mobile deployment.
  • Efficient VDiT [8], M4V [13], STA [50], VSA [49]: Other efficient video diffusion models that employ techniques like token merging, tiling, or sparsity.
  • SANA-Video [4]: A concurrent linear-only attention video diffusion model, highlighted for its high training cost.
  • Attention Surgery (15×R2) [11]: A temporally uniform hybrid attention method that is quadratic in complexity and non-recurrent.
  • HedgeHog Linear Attention: A specific linear attention variant with learnable ϕ\phi mentioned in mobile latency comparisons.
  • Uniform Hybrid - R8: Another variant of hybrid attention, likely from Attention Surgery, used for mobile latency comparison.

5.4. Training Specification

Model Hyperparameters:

  • Number of Converted Blocks: Experiments were conducted by converting different numbers of transformer blocks in Wan2.1 1.3B to the ReHyAt mechanism: 15, 20, and 25 blocks out of a total of 30 blocks.
  • Chunk Sizes (TcT_c): The temporal chunk size for hybrid attention was varied: Tc{1,2,3,5,7}T_c \in \{1, 2, 3, 5, 7\}.
  • Overlap Sizes (ToT_o): The temporal overlap size for softmax attention was explored: To{0,1,2,3}T_o \in \{0, 1, 2, 3\}.
  • ϕ\phi Characterization: The kernel feature maps ϕq\phi_q and ϕk\phi_k were parameterized using a lightweight 2-layer MLP with degree-2 polynomial features. This configuration adds approximately 2.4M parameters per converted block.

Pretraining (Distillation Stage) Details:

  • Parameter Optimization: Only ϕk\phi_k and ϕq\phi_q for each block are learnable; all other model parameters remain frozen.
  • Optimizer: AdamW optimizer.
  • Batch Size: 1.
  • Learning Rate: 10310^{-3}.
  • Objective: Value distillation objective (Equation 19 in the paper, which was the loss function for ϕl\phi_l).
  • Teacher Activation Generation: Teacher activations were obtained by sampling with 50 denoising steps and a guidance scale of 5, using the Euler Ancestral Discrete Scheduler to integrate the reverse diffusion process.

Finetuning Stage Details:

  • Parameter Optimization: All parameters of the hybrid DiT model are updated, including the ϕ\phi transformations and the feed-forward MLP layers.
  • Optimizer: AdamW optimizer.
  • Batch Size: 16.
  • Learning Rate: 10510^{-5}.
  • Precision: bf16 mixed-precision.
  • Training Duration: 1,000 iterations.
  • Objective: Normal flow-matching objective [27].

Sampling Details (for VBench Evaluation):

  • Prompts: Wan Enhanced prompts (a specific set of prompts used with the Wan models).
  • Denoising Iterations: 50.
  • Classifier Guidance Scale: 6.
  • Noise Scheduler: UniPCMultistep noise scheduler.
  • Flow Shift: 8.

6. Results & Analysis

6.1. Core Results Analysis

VBench SOTA Comparison

The paper compares ReHyAt (distilled from Wan2.1 1.3B) against various state-of-the-art and efficient video diffusion models. ReHyAt demonstrates competitive performance, notably achieving higher total scores than some larger models (e.g., Open-Sora Plan V1.3, CogVideoX 5B). Its total score of 83.79 is slightly higher than the original Wan2.1 1.3B (83.31) and outperforms all other efficient models listed, including SANA-Video and Attention Surgery. This indicates that ReHyAt successfully maintains high video generation quality while enabling efficiency.

The following are the results from [Table 1] of the original paper:

Models with 2B-5B parameters Total↑ Quality↑ Semantic↑
Open-Sora Plan V1.3 [26] 77.23 80.14 65.62
CogVideoX 5B [44] 81.91 83.05 77.33
CogVideoX1.5 5B [44] 82.01 82.72 79.17
Models up to 2B parameters
Open-Sora V1.2 [52] 79.76 81.35 73.39
LTX-Video [12] 80.00 82.30 70.79
SnapGenV [41] 81.14 83.47 71.84
Hummingbird 16frame [15] 81.35 83.73 71.84
Mobile Video DiT - Mobile [40] 81.45 83.12 74.76
Mobile Video DiT - Server [40] 83.09 84.65 76.86
CogVideoX 2B [44] 81.55 82.48 77.81
PyramidalFlow [16] 81.72 84.74 69.62
Neodragon [19] 81.61 83.68 73.36
Wan2.1 1.3B [35] 83.31 85.23 75.65
Wan2.1 1.3B* [35] 83.10 85.10 75.12
Linear/Hybrid Models
Efficient VDiT [8] 76.14 -
M4V [13] 81.91 83.36 76.10
STA [50] 83.00 85.37 73.52
VSA [49] 82.77 83.60 79.47
SANA-Video [4] 83.71 84.35 81.35
Attention Surgery (15×R2) [11] 83.21 85.19 75.25
Wan2.1 1.3B* + ReHyAt (15×Tc=3,To=1) 83.79 84.57 80.70

Table 1. Comparisons with SOTA efficient video diffusion models. W^an2.1\mathbf{\hat{W}an}2.1\mathbf{\ast} is our best reproduction using our evaluation pipeline.

A critical aspect is the training cost. ReHyAt achieves its results with approximately 160 GPU hours, which is two orders of magnitude more efficient than SANA-Video's reported cost. This highlights ReHyAt's effectiveness in leveraging existing models through distillation.

VBench-2.0 Comparison

On the newer VBench-2.0 benchmark, ReHyAt (15×Tc=3 and 15×Tc=5 variants) remains competitive with larger models like CogVideoX-1.5 5B and the original Wan2.1 1.3B. While there's a "small drop" in overall scores compared to Wan2.1, it's marginal (56.1 vs 56.0 for Wan2.1 1.3B) and ReHyAt even outperforms Wan2.1 in some dimensions like Human Fidelity and Creativity. The results suggest that the efficiency gains do not come at a prohibitive cost to advanced evaluation metrics.

The following are the results from [Table 2] of the original paper:

Model VBench-2.0
Total↑ Hum.Fid.↑ Creativity Control. Com.sense↑ Physics↑
Wan2.1 1.3B 56.0 80.7 48.7 34.0 63.4 53.8
CogVideoX-1.5 5B 53.4 72.1 43.7 29.6 63.2 48.2
Attn. Surgery 15×R2 55.1 78.9 47.5 33.4 63.1 52.8
ReHyAt 15×Tc=3 56.1 81.9 55.1 30.8 62.7 50.0
ReHyAt 15×Tc=5 56.3 79.8 55.7 31.9 64.2 49.7

Table 2. Quantitative comparison on VBench-2.0 benchmark

Human Preference Evaluation

The human visual preference study provides a crucial qualitative assessment. Comparing ReHyAt (15×Tc=3) with Wan2.1 1.3B, the results show no significant difference in human preference (27.6% for ReHyAt, 29.0% for Wan2.1, 43.5% no preference). This strongly suggests that from a human perception standpoint, the efficiency gains of ReHyAt do not compromise the subjective quality of the generated videos. It even slightly outperforms Wan2.1 in "Temporal Style" and "Color", while Wan2.1 performs better in "Subject Consistency".

The following are the results from [Table 3] of the original paper:

Prompt Dimension Human Preference %
Ours No preference Wan2.1
Color 43.3 46.7 10.0
Human Action 21.7 41.7 36.7
Object Class 25.0 45.0 30.0
Overall Consistency 27.1 47.1 25.9
Scene 40.0 60.0 0.0
Spatial Relationship 20.0 70.0 10.0
Subject Consistency 21.7 28.3 50.0
Temporal Flickering 24.0 54.0 22.0
Temporal Style 43.3 30.0 26.7
Total 27.6 43.5 29.0

Table 3. Results of the method-blinded human visual preference study over 500 paired video comparisons. Rows correspond to subsets filtered by different VBench prompt dimensions.

Qualitative samples in Figure 3 visually support this, showing comparable fidelity to Wan2.1. The following figure (Figure 3 from the original paper) provides qualitative comparisons:

Figure 3. Qualitative comparison of Wan2.1 1.3B (Top) to ReHyAt \(1 5 \\times T _ { c } { = } 3\) (bot) for two sample VBenc prompts, A cat and . and "A dog drinking water." 该图像是图表,展示了不同视频生成模型生成的动物图像样本进行比较,包括猫与狗的互动场景。图中包括了多组图像,展示了猫和狗在室内和自然环境中的活动,展示了生成模型在细节和生动感上的差异。

Figure 3. Qualitative comparison of Wan2.1 1.3B (Top) to ReHyAt 15×Tc=31 5 \times T _ { c } { = } 3 (bot) for two sample VBenc prompts, A cat and . and "A dog drinking water."

Sampling Compute Burden

FLOPs Analysis: ReHyAt significantly reduces computational cost. Figure 4 shows that ReHyAt offers up to 4×4 \times FLOPs saving compared to FlashAttention (used in Wan2.1) and is approximately 2×2 \times more efficient than Attention Surgery (R=2R=2 variant) while maintaining quality. This demonstrates the effectiveness of the hybrid and chunk-wise design. The following figure (Figure 4 from the original paper) compares attention compute:

Figure 4. Comparison of attention compute (FLOPs) on \(2 1 \\times 3 0 \\times 5 2\) latent size (5 seconds) 该图像是图表,展示了不同 Chunk Size (Tc) 下的注意力计算(FLOPs)对比。图中包括 ReHyAt 在不同 To 值下的表现,以及与其他方法的对比。横轴为 Chunk Size,纵轴为 TFLOPs,显示出 ReHyAt 相较于 Softmax 及其他 Hybrid Uniform 方法的效率。

Figure 4. Comparison of attention compute (FLOPs) on 21×30×522 1 \times 3 0 \times 5 2 latent size (5 seconds)

Scaling with Video Duration: Figure 1 (top) and Figure 6 vividly illustrate ReHyAt's superior scaling behavior. While FlashAttention's FLOPs and latency grow quadratically with video duration, ReHyAt exhibits a near-linear increase. This is the direct benefit of its recurrent formulation, enabling generation of arbitrarily long videos without prohibitive compute spikes. The following figure (Figure 1 from the original paper) shows compute complexity increase with video duration:

Figure 1. A comparison of our proposed Recurrent Hybrid Attention model with Wan2.1 bidirectional full softmax attention. Top: Compute complexity increase with video duration growth (left: FLOPs, right: phone latency). Bottom: comparing our hybrid model \(2 0 \\times\) ReHyAt blocks) with original Wan2.1 1.3B, qualitatively and quantitatively. Prompt: "A cat drinking water." 该图像是图表,展示了ReHyAt与Wan2.1在视频生成中的计算复杂度和移动延迟对比。上方左侧图表显示了随着视频时长增加的FLOPs(1个DiT块),右侧图表展示了移动延迟的趋势。下方展示了两种模型生成的猫饮水视频的比较效果。

Figure 1. A comparison of our proposed Recurrent Hybrid Attention model with Wan2.1 bidirectional full softmax attention. Top: Compute complexity increase with video duration growth (left: FLOPs, right: phone latency). Bottom: comparing our hybrid model 20×2 0 \times ReHyAt blocks) with original Wan2.1 1.3B, qualitatively and quantitatively. Prompt: "A cat drinking water." The following figure (Figure 6 from the original paper) shows compute complexity growth comparisons:

Figure 6. Compute complexity growth comparisons w.r.t. video length versus \(\\mathrm { W a n } 2 . 1\) flash attention and attention surgery, in FLOPs (top) and latency (bottom) 该图像是图表,展示了在视频长度变化下不同方法的计算复杂度比较,包括 FLOPs(上图)和移动延迟(下图)。可以看到,ReHyAt 方法在较长视频时具有显著的性能优势,减少了计算成本和延迟。

Figure 6. Compute complexity growth comparisons w.r.t. video length versus Wan2.1\mathrm { W a n } 2 . 1 flash attention and attention surgery, in FLOPs (top) and latency (bottom)

On-Mobile Latency: Table 4 presents crucial results for on-device deployment. ReHyAt is the only method that can extend to more than 10 seconds of video generation without Out-Of-Memory (OOM) errors, demonstrating its practical scalability for edge devices. At 121 frames (approx. 7.5s), ReHyAt is approximately 16×16 \times faster than FlashAttention. This is a massive improvement for practical applications.

The following are the results from [Table 4] of the original paper:

Attention Block Number of frames (320× 480) resolution
81 101 121 141 161
Softmax Flash Attention 281 2964 4809 OOM OOM
HedgeHog Linear Attention 360 455 469 542 OOM
Uniform Hybrid - R8 464 625 818 1215 OOM
ReHyAt - Tc=3 (ours) 192 247 302 329 384

Table 4. On mobile (Snapdragon8-Gen4) latency (ms) vs. number of frames at 320×4803 2 0 \times 4 8 0 resolution

Memory Read/Write Load: Table 5 shows ReHyAt's significant memory efficiency. At 121 frames, ReHyAt is approximately 11×11 \times more efficient in total memory read/write compared to FlashAttention. Critically, its peak memory usage remains constant regardless of video duration, a direct consequence of the RNN reformulation. This is vital for mobile devices where memory is a tightly constrained resource.

The following are the results from [Table 5] of the original paper:

Attention Block Number of frames - Memory Read/Write (GB)
81 101 121 141 161
W R W R W R W R W R
Softmax Flash Attention 5.1 6.0 12.9 16.4 22.7 53.6 OOM OOM OOM OOM
HedgeHog Linear Attention 5.7 8.1 7.0 10.1 6.9 11.3 8.0 13.2 0OM OOM
Uniform Hybrid - R8 6.3 10.1 5.2 10.9 6.4 13.2 7.8 35.2 OOM OOM
ReHyAt - Tc=3 (ours) 1.7 2.8 2.2 3.6 2.7 4.4 3.0 4.8 3.5 5.6

Table 5. Comparison of total memory read/write for Wan2.1\mathrm { W a n } 2 . 1 DiT Blocks with various attention mechanisms on Snapdragon8-Gen4

6.2. Ablation Studies / Parameter Analysis

Number of ReHyAt Blocks and Chunk-size TcT_c

Figure 5 shows the trade-off between computational cost (FLOPs) and VBench score for different ReHyAt configurations (number of converted blocks and TcT_c). It confirms that increasing the number of ReHyAt blocks (converting more of the original 30 blocks) reduces FLOPs, while higher TcT_c generally increases quality at a higher computational cost.

The following figure (Figure 5 from the original paper) compares FLOPs and VBench score:

Figure 5. The total DiT FLOPs percentages versus the VBench score of original \(\\mathrm { W a n } 2 . 1 \\ 1 . 3 \\mathrm { B }\) model compared to various hybrid configurations or \(3 2 0 \\times 4 8 0\) (top)and \(4 8 0 \\times 8 3 2\) (bottom) resolutions. 该图像是图表,展示了原始 ext{Wan2.1} 模型与不同混合配置在 320 imes 480480 imes 832 分辨率下的 Vbench 分数与计算负载之间的关系。图中显示了不同配置下的总模型 FLOPs 百分比与 Vbench 得分的对比。

Figure 5. The total DiT FLOPs percentages versus the VBench score of original Wan2.1 1.3B\mathrm { W a n } 2 . 1 \ 1 . 3 \mathrm { B } model compared to various hybrid configurations or 320×4803 2 0 \times 4 8 0 (top)and 480×8324 8 0 \times 8 3 2 (bottom) resolutions.

Table 6 specifically examines the impact of TcT_c on quality. The following are the results from [Table 6] of the original paper:

Chunk-size Tc Block TFLOPs↓ VBench
Total ↑ Quality ↑ Semantic ↑
1 3.87 80.97 82.37 75.39
2 4.04 82.08 83.86 74.99
3 4.30 82.17 83.72 75.96
5 4.82 82.48 84.12 75.93

Table 6. Impact of T _ { c } on ReHyAt hybrid model quality. All the models have 25×2 5 \times converted ReHyAt blocks with To=1T _ { o } { = } 1 .

The results show that:

  • Increasing TcT_c generally improves VBench Total and Quality scores.
  • The most significant improvement occurs when moving from Tc=1T_c=1 to Tc=2T_c=2. This is attributed to the extension of softmax attention from purely spatial to spatiotemporal, allowing more accurate local dependency modeling across a few frames.
  • Further increases in TcT_c (to 3 and 5) yield diminishing returns in total score, suggesting an optimal balance around Tc=2T_c=2 or Tc=3T_c=3 for the given TFLOPs.

Overlap size ToT_o

Table 7 investigates the effect of chunk overlap (ToT_o) on generation quality. The following are the results from [Table 7] of the original paper:

Chunk- overlap To VBench
Total ↑ Quality ↑ Semantic ↑ Subj. Cons.↑
0 81.56 83.23 74.90 90.90
1 82.17 83.72 75.96 92.05
2 82.17 83.84 75.50 92.13
3 82.19 83.86 75.51 92.24

Table 7. Impact of T _ { o } on ReHyAt hybrid model quality as measured on VBench. All the models have 25×2 5 \times converted ReHyAt blocks with Tc=3{ T _ { c } } { = } 3 .

Key findings:

  • Introducing any overlap (To=1T_o=1 from To=0T_o=0) results in a notable jump in overall model quality (Total score from 81.56 to 82.17). This confirms the importance of overlapping chunks for maintaining temporal coherence and accurate message passing between chunks.
  • Beyond To=1T_o=1, the total score saturates, with only mild gradual improvements.
  • However, Subject Consistency (Subj. Cons.) shows a continuous, albeit small, improvement as ToT_o increases (from 90.90 at To=0T_o=0 to 92.24 at To=3T_o=3). This highlights that overlap is particularly beneficial for maintaining the identity and consistent appearance of subjects across chunk transitions, thus reducing temporal incoherencies.

Causality

Table 8 presents the impact of causality on ReHyAt's quality and compute. The following are the results from [Table 8] of the original paper:

Causal Block TFLOPs↓ VBench
Total ↑ Quality ↑ Semantic ↑
× 4.17 82.27 83.84 75.99
4.04 82.35 83.97 75.87

Table 8. Impact of causality on ReHyAt hybrid model quality as measured on VBench on 15×Tc=3,To=01 5 \times T _ { c } { = } 3 , T _ { o } { = } 0 configuration

The table compares a non-causal version (marked ×) with a causal version (no mark).

  • Quality: The causal reformulation does not deteriorate the quality of the model, as indicated by very similar VBench scores (82.35 for causal vs. 82.27 for non-causal).
  • Compute: The saving in TFLOPs from removing the forward-looking linear attention in the causal version is not substantial (4.04 vs 4.17).
  • Primary Advantage: The main benefit of enforcing causality is not a dramatic compute reduction in this specific comparison, but rather that it enables the RNN reformulation. This, in turn, leads to the crucial advantage of lower and constant peak memory usage, which is indispensable for on-device generation of longer videos.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces ReHyAt, a recurrent hybrid attention mechanism tailored for video diffusion transformers. By strategically combining local softmax attention for high-fidelity dependencies and global linear attention for efficiency, ReHyAt achieves linear time complexity and constant memory usage with respect to video duration. This makes it uniquely suitable for generating arbitrarily long videos and for on-device deployment. A key contribution is its two-stage distillation and lightweight fine-tuning pipeline, which allows ReHyAt to be built upon existing state-of-the-art softmax-based models (like Wan2.1) with an exceptionally low training cost (~160 GPU hours), two orders of magnitude more efficient than training from scratch. Experimental results on VBench, VBench-2.0, and human preference studies confirm that ReHyAt delivers competitive, near state-of-the-art video quality, while dramatically improving scalability and efficiency.

7.2. Limitations & Future Work

The authors acknowledge a current limitation:

  • Temporal Incoherence: While ReHyAt performs strongly overall, "a small fraction of videos—especially with the most efficient variants—still show some temporal incoherence". This indicates that despite the overlapping chunks and hybrid design, maintaining perfect consistency across very long generated sequences remains a challenge, highlighting an area for future improvement.

    Future work directions implied by the paper include:

  • Further refining mechanisms to reduce temporal incoherence, particularly for the most efficient ReHyAt configurations.

  • Exploring the application of ReHyAt to an even wider range of state-of-the-art bidirectional softmax-based models, leveraging its generalizable distillation recipe.

  • Pushing the boundaries of on-device video generation capabilities, potentially enabling more complex and longer video content on mobile platforms.

7.3. Personal Insights & Critique

This paper presents a highly practical and impactful innovation in the field of video generation. The core insight—that one can effectively distill the quality of expensive softmax-based transformers into an efficient, recurrent hybrid form—is very clever. Instead of trying to build a new architecture from the ground up that matches softmax performance, ReHyAt leverages the immense computational investment already made in training powerful DiTs. This distillation pipeline is arguably the most significant contribution, offering a "recipe" that can adapt to future advancements in bidirectional full softmax attention models, ensuring long-term relevance.

The hybrid attention design itself (local softmax, global linear) is a well-balanced compromise. The polynomial expansion for ϕ\phi is an elegant way to regain some of the expressiveness lost by simpler linear attention kernels. The rigorous demonstration of constant memory and linear compute with video length, especially on mobile hardware, directly addresses the critical bottleneck for practical applications like on-device video generation.

One potential area for deeper exploration could be the theoretical limits of the polynomial kernel in approximating the exponential kernel, and if specific types of video content (e.g., highly dynamic scenes, complex object interactions) push these limits more than others, leading to the observed "temporal incoherence." While the ablation studies are comprehensive, further investigation into dynamic adaptation of TcT_c and ToT_o based on video content or specific layers might yield even better performance. For instance, some layers might benefit from larger softmax windows or more overlap, while others might tolerate smaller ones.

The paper's focus on benchmarking on mobile hardware is commendable, demonstrating a tangible path towards deployment. The human preference study adds a crucial layer of validation that often gets overlooked in favor of purely quantitative metrics. Overall, ReHyAt provides a powerful, pragmatic solution to a pressing problem in generative AI, making high-quality video generation more accessible and scalable.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.