Paper status: completed

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Published:12/11/2024

Autoregressive Video Diffusion Models (4)Video Generation Acceleration (4)Diffusion Model Distillation (2)Dynamic Video Generation (1)Long-Horizon Video Synthesis (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces a novel autoregressive video diffusion model that transforms a slow bidirectional model into a fast one using distribution matching distillation. With teacher trajectory initialization and asymmetric distillation, it achieves an 84.27 score on VBench-Long, si

Abstract

Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher's ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models. It enables fast streaming generation of high-quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner.

Mind Map

In-depth Reading

English Analysis~19 min read · 25,248 chars

1. Bibliographic Information

1.1. Title

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

The title clearly states the paper's core objective: to transform slow video generation models that process all frames at once (bidirectional) into fast models that generate frames sequentially (autoregressive), using the framework of diffusion models.

1.2. Authors

Tianwei Yin¹, Qiang Zhang², Richard Zhang², William T. Freeman¹, Frédo Durand¹, Eli Shechtman², Xun Huang²

The authors are affiliated with:

MIT (Massachusetts Institute of Technology): A world-renowned university for computer science and artificial intelligence research.
Adobe Research: The research arm of Adobe, a company at the forefront of creative software, with significant contributions to generative AI, computer vision, and graphics.

The authors are established researchers in the fields of computer vision, computational photography, and generative models, which lends significant credibility to the work.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server for academic papers. This means it has not yet undergone formal peer review for a conference or journal. Given the publication date and the high quality of the work, it is likely intended for a top-tier computer vision or machine learning conference such as CVPR, ICCV, or NeurIPS in 2025.

1.4. Publication Year

2024

1.5. Abstract

The abstract summarizes the paper's key contributions. Current high-quality video diffusion models are slow and non-interactive because they use bidirectional attention, requiring the entire video sequence to generate a single frame. The authors tackle this by converting a pretrained bidirectional model into a fast autoregressive model that generates frames sequentially. To achieve high speed, they extend Distribution Matching Distillation (DMD) to video, reducing a 50-step diffusion process to just 4 steps. Key to their success are two novel techniques: a student model initialization scheme based on the teacher's ODE trajectories and an asymmetric distillation strategy where a causal student learns from a bidirectional teacher. This approach successfully mitigates error accumulation, enabling the generation of long videos. Their model, CausVid, achieves a state-of-the-art score on the VBench-Long benchmark and generates high-quality video at 9.4 FPS, enabling applications like streaming video editing and dynamic prompting.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2412.07772
PDF Link: https://arxiv.org/pdf/2412.07772v4.pdf
Publication Status: This is a preprint and has not yet been officially published in a peer-reviewed venue.

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

State-of-the-art video generation models, while producing visually stunning results, are fundamentally slow. This is because many of them are bidirectional, meaning to generate any given frame, the model must look at all other frames in the video, including those that come after it (the "future"). This design creates two major bottlenecks:

High Latency: Users must wait for the entire video to be processed before seeing even the first frame.
Scalability Issues: The computational and memory costs grow quadratically with the number of frames, making it prohibitively expensive to generate long videos.

These limitations make such models unsuitable for interactive applications like live video editing, streaming, or real-time game rendering, where frames must be generated on-the-fly.

2.1.2. Existing Gaps

Autoregressive models are a natural alternative. They generate video frame-by-frame, conditioning each new frame only on the ones that came before it. This approach allows for streaming and interactive control. However, existing autoregressive video models face their own significant challenges:

Error Accumulation: Because each frame is generated based on the previous one, small errors can compound over time, leading to a noticeable degradation in quality, especially in long videos. The video might "drift" from the original prompt or develop visual artifacts.
Subpar Quality: Historically, autoregressive models have not matched the visual quality of their bidirectional counterparts.
Speed Limitations: Even though they have lower latency, many are still not fast enough for truly interactive frame rates.

2.1.3. Innovative Idea

The paper introduces a novel approach to get the best of both worlds: the high quality of bidirectional models and the speed and interactivity of autoregressive models. The central idea is to distill the knowledge from a powerful, pretrained bidirectional teacher model into a lightweight, autoregressive student model.

The key innovation lies in the asymmetric distillation process. Instead of training a causal student from a weaker causal teacher (which would inherit its flaws), they use the superior bidirectional teacher to guide the causal student. This unique setup allows the student to learn high-quality generation while its causal structure enables fast, sequential inference. This process is shown to be surprisingly effective at preventing the typical error accumulation seen in autoregressive models.

2.2. Main Contributions / Findings

The paper makes several key contributions to the field of video generation:

Asymmetric Bidirectional-to-Causal Distillation: The authors propose a novel distillation strategy where a causal student model is trained under the supervision of a more powerful bidirectional teacher. This is the paper's most significant contribution, as it effectively mitigates quality degradation and error accumulation in the autoregressive student.
Extension of DMD to Video: They successfully adapt Distribution Matching Distillation (DMD), a few-step distillation method originally for images, to the video domain. This allows them to distill a slow, 50-step diffusion model into an extremely fast 4-step generator.
ODE-based Student Initialization: To ensure the distillation process is stable, they introduce an efficient method to initialize the student model. This involves pre-training the student to mimic the denoising trajectories generated by the teacher model's Ordinary Differential Equation (ODE) solver.
State-of-the-Art Performance and Speed: Their resulting model, CausVid, achieves the top score (84.27) on the VBench-Long benchmark, outperforming all previously evaluated models. Crucially, it generates video at 9.4 FPS on a single GPU, a speed-up of over 160x compared to its teacher model.
Enabling Interactive Applications: Thanks to its autoregressive nature and speed, CausVid supports a range of real-time applications in a zero-shot manner, including streaming video-to-video translation, image-to-video generation, and dynamic prompting (changing the text prompt mid-generation).

3.1. Foundational Concepts

To understand this paper, it's essential to grasp the following concepts:

3.1.1. Diffusion Models

Diffusion models are a class of generative models that create data (like images or videos) by reversing a noise-adding process. The process has two parts:

Forward Process: This is a fixed process where we start with a clean data sample $x_0$ (e.g., a video frame) and gradually add a small amount of Gaussian noise over a series of $T$ timesteps. At any timestep $t$ , the noisy sample $x_t$ is a combination of the original sample and noise. This is defined by the formula: $ x _ { t } = \alpha _ { t } x _ { 0 } + \sigma _ { t } \epsilon , \epsilon \sim { \mathcal N } ( 0 , I ) $ Here, $\alpha_t$ and $\sigma_t$ are schedule parameters that control the signal-to-noise ratio, and $\epsilon$ is random noise. As $t$ approaches $T$ , $x_T$ becomes almost pure noise.
Reverse Process: This is where the learning happens. A neural network, often called a denoiser, is trained to reverse the noising process. At each step $t$ , it takes the noisy sample $x_t$ and predicts the original noise $\epsilon$ that was added. The training objective is typically a simple mean squared error loss: $ \mathcal { L } ( \theta ) = \mathbb { E } _ { t , x _ { 0 } , \epsilon } \left| \epsilon _ { \theta } ( x _ { t } , t ) - \epsilon \right| _ { 2 } ^ { 2 } $ where $\epsilon_\theta$ is the network's prediction. By repeatedly applying this denoiser, we can start from pure noise $x_T$ and gradually generate a clean sample $x_0$ . A key concept is the score function, which is the gradient of the log-probability of the data distribution. It is directly related to the predicted noise: $ s _ { \theta } ( x _ { t } , t ) = \nabla _ { x _ { t } } \log p ( x _ { t } ) = - \frac { \epsilon _ { \theta } ( x _ { t } , t ) } { \sigma _ { t } } $

3.1.2. Latent Diffusion Models (LDMs)

Training diffusion models directly on high-resolution videos is computationally very expensive. Latent Diffusion Models (LDMs) solve this by working in a compressed, lower-dimensional latent space.

An encoder (part of a Variational Autoencoder, or VAE) first compresses the high-resolution video into a small latent representation.
The diffusion process (both forward and reverse) occurs entirely within this efficient latent space.
After the denoising is complete, a decoder (from the VAE) reconstructs the final video from the generated latent representation. This paper uses a 3D VAE that compresses chunks of video frames.

3.1.3. Transformers and Attention Mechanism

Transformers are a neural network architecture that has become dominant in many fields. Their power comes from the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing a specific part.

For each input element (e.g., a patch of a video frame), we create three vectors: a Query (Q), a Key (K), and a Value (V).
The attention score is calculated by taking the dot product of the Query of the current element with the Keys of all other elements. This determines "how much attention" the current element should pay to every other element.
These scores are scaled, passed through a softmax function to create weights, and then used to compute a weighted sum of all the Value vectors. The formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where $d_k$ is the dimension of the Key vectors.

3.1.4. Bidirectional vs. Autoregressive Attention

Bidirectional Attention: In a standard Transformer, attention is bidirectional. When processing any element in a sequence, it can "attend to" (i.e., gather information from) all other elements, both before and after it. This is powerful for understanding context but requires the entire sequence to be available at once.
Autoregressive (or Causal) Attention: In an autoregressive model, attention is masked. When processing an element at position $i$ , it can only attend to elements at positions $j <= i$ . It is "causal" because it cannot see the future. This allows for sequential generation, which is the foundation of models like GPT and the CausVid model in this paper.

3.1.5. Knowledge Distillation

Knowledge distillation is a technique for model compression. A large, powerful but slow "teacher" model is used to train a smaller, faster "student" model. The student learns to mimic the outputs or internal representations of the teacher, effectively transferring the teacher's "knowledge" into a more efficient architecture. This paper uses a specific form of distillation called Distribution Matching Distillation.

3.2. Previous Works

The paper builds on several lines of research:

Autoregressive Video Generation: Early methods used regression or GANs. More recent approaches, inspired by Large Language Models (LLMs), tokenize video frames and generate them token-by-token. Diffusion-based autoregressive methods like Diffusion Forcing [8] have shown promise by training models to denoise future frames conditioned on past frames, or by denoising a whole sequence where different frames have different noise levels. This paper's method is related but introduces distillation for superior efficiency and quality.
Long Video Generation: Generating long, coherent videos is a major challenge. Some methods generate overlapping short clips and stitch them together. Others use a hierarchical approach, generating keyframes first and then interpolating between them. Autoregressive models are naturally suited for long video generation, but as noted, they struggle with error accumulation. This paper's key contribution is mitigating this error accumulation, enabling high-quality long video synthesis.
Diffusion Distillation: To combat the slow sampling of diffusion models, various distillation techniques have been proposed.
- Progressive Distillation [69] halves the number of steps at each distillation stage.
- Consistency Distillation [76] trains a model to map any point on a denoising trajectory directly to the clean image.
- Distribution Matching Distillation (DMD) [100], which this paper uses, is a powerful technique that minimizes the divergence between the student's output distribution and the real data distribution (represented by the teacher). Its core idea is captured by its loss gradient, which pushes the student's output distribution to match the teacher's: $ \nabla _ { \phi } \mathcal { L } _ { \mathrm { D M D } } \approx - \mathbb { E } _ { t } \left( \int ( s _ { \mathrm { data } } ( \cdot ) - s _ { \mathrm { gen } , \xi } ( \cdot ) ) \frac { d G _ { \phi } ( \epsilon ) } { d \phi } d \epsilon \right) $
- DMD is flexible, allowing the student and teacher to have different architectures, a property this paper exploits for its asymmetric distillation.

3.3. Technological Evolution

The field of video generation has evolved from generating short, often blurry clips to high-fidelity, minute-long videos. This progress was largely driven by scaling up datasets and models, particularly bidirectional diffusion transformers. However, this came at the cost of extreme computational requirements and latency. The current frontier is to maintain this high quality while drastically improving speed and enabling interactivity. This paper marks a significant step in this direction, showing that autoregressive models, when trained correctly via distillation, can achieve the quality of top bidirectional models while being orders of magnitude faster.

3.4. Differentiation Analysis

The core innovation of this paper compared to prior work is the asymmetric distillation strategy.

Previous distillation works for video models typically maintained the same causality: a non-causal teacher was distilled into a non-causal student.
Previous autoregressive video models were either trained from scratch (suffering from lower quality) or fine-tuned from a pretrained model, but often inherited or developed issues with error accumulation.
This paper is the first to propose distilling a bidirectional (non-causal) teacher into a causal student. This unique setup allows the student to benefit from the superior, globally-aware knowledge of the bidirectional teacher, while its own causal structure makes it fast and interactive. This specific combination is what effectively suppresses error accumulation and closes the quality gap between autoregressive and bidirectional models.

4. Methodology

4.1. Principles

The core principle of CausVid is to create a fast and high-quality autoregressive video generator by "distilling" the knowledge from a slow but powerful pretrained bidirectional video diffusion model. The methodology is designed to solve two main challenges:

Bridging the Architectural Gap: How to transfer knowledge effectively from a bidirectional teacher to a causal student without performance loss.
Ensuring Stability and Quality: How to train this distilled model to be both fast (few steps) and robust against error accumulation during long video generation.

The solution involves a two-stage training process: an ODE-based initialization to align the student with the teacher, followed by an asymmetric distribution matching distillation to fine-tune for quality and speed.

4.2. Core Methodology In-depth

The overall method is visualized in Figure 6 of the paper.

$Figure 6.Our method distill a many-step, bidirectional videodiffusion model $s \\mathrm { d a t a }$ into a 4-step, causal generator $G _ { \\phi }$ The training$ 该图像是示意图，展示了一个视频生成模型的训练过程。上半部分描述了学生初始化阶段，其中教师模型通过提取数据样本和生成的ODE轨迹进行噪声注入；下半部分则展示了利用非对称蒸馏与分布匹配蒸馏（DMD）的方法，将大量步的双向视频扩散模型蒸馏为一个4步的因果生成器 $G_{\phi}$ ，并通过提示输入和重建过程来提高视频质量。

4.2.1. Autoregressive Architecture

The model first uses a 3D VAE to compress video into a latent space. The core generator, CausVid, is a Diffusion Transformer (DiT) that operates on these latents. The key architectural modification to enable autoregressive generation is the block-wise causal attention mask.

Concept: The sequence of latent frames is divided into chunks. Within each chunk, attention is bidirectional, allowing the model to capture local temporal dependencies fully. However, across chunks, attention is strictly causal. A frame in chunk $i$ can only attend to frames within its own chunk and frames in all preceding chunks ( $j <= i$ ), but not to any future chunks ( $j > i$ ).
Formal Definition: The attention mask $M$ $M$ that enforces this is defined as: $ M _ { i , j } = \left{ { \begin{array} { l l } { 1 , } & { { \mathrm { if ~ } } \left\lfloor { \frac { j } { k } } \right\rfloor \le \left\lfloor { \frac { i } { k } } \right\rfloor , } \ { 0 , } & { { \mathrm { o t h e r w i s e } } . } \end{array} } \right. $
- $i$ and $j$ : Indices of the frames in the sequence.
- $k$ : The size of each chunk.
- $\lfloor \cdot \rfloor$ : The floor function, which rounds down to the nearest integer. $\lfloor j/k \rfloor$ gives the index of the chunk that frame $j$ belongs to.
- Explanation: This formula states that an attention connection from frame $i$ to frame $j$ is allowed ( $M_{i,j}=1$ ) only if the chunk index of frame $j$ is less than or equal to the chunk index of frame $i$ .
  
  This design combines the benefits of local coherence (from bidirectional attention within chunks) and causal generation (for streaming and unlimited length).

4.2.2. Stage 1: Student Initialization via ODE Trajectory Regression

Directly training a causal student from a bidirectional teacher with a complex distillation loss like DMD can be unstable. To mitigate this, the authors first pre-train the student model to align it with the teacher's behavior.

Procedure:
1. Generate a Dataset: A small dataset of ODE solution pairs is created using the pretrained bidirectional teacher. For a set of random noise inputs $\{x_T\}$ , an ODE solver (like DDIM) is used to compute the entire denoising path, yielding the full trajectory $\{x_t\}$ for all timesteps $t$ from $T$ down to 0, and the final clean output $\{x_0\}$ .
2. Initialize Student: The student generator $G_\phi$ (which has the causal architecture) is initialized with the weights of the bidirectional teacher model.
3. Regression Training: The student is then trained on this trajectory dataset. Its task is to take a noisy frame $\{x_{t^i}^i\}$ from the trajectory and predict the corresponding final clean frame $\{x_0^i\}$ . This is trained with a simple L2 regression loss: $ \mathcal { L } _ { \mathrm { i n i t } } = \mathbb { E } _ { \boldsymbol { x } , t ^ { i } } | G _ { \phi } ( { x _ { t ^ { i } } ^ { i } } _ { i = 1 } ^ { N } , { t ^ { i } } _ { i = 1 } ^ { N } ) - { x _ { 0 } ^ { i } } _ { i = 1 } ^ { N } | ^ { 2 } $
- $\{ x _ { t ^ { i } } ^ { i } \}$ : The noisy input video chunks at different timesteps $t^i$ .
- $G_\phi$ : The causal student generator.
- $\{ x _ { 0 } ^ { i } \}$ : The ground-truth clean video chunks from the ODE trajectory.
  
  This step effectively provides the student with a strong "head start" by teaching it the basic denoising function of the teacher before moving to the more complex distillation.

4.2.3. Stage 2: Asymmetric Distillation with DMD

After initialization, the core distillation training begins. This process uses the Distribution Matching Distillation (DMD) framework in a novel, asymmetric way.

The Three Key Players:
1. Teacher Score Function ( $s_{\mathrm{data}}$ ): This is the original, pretrained bidirectional diffusion model. It is frozen during training and acts as the "ground truth" for the data distribution.
2. Student Generator ( $G_{\phi}$ ): This is the causal model being trained. It is a few-step generator (e.g., 4 steps) that takes a noisy input and aims to produce a clean output.
3. Generator's Score Function ( $s_{\mathrm{gen}, \xi}$ ): This is another diffusion model, initialized from the teacher, which is trained online to learn the distribution of the student's own outputs.
The Training Loop (Algorithm 1):
1. A video from the real dataset is sampled.
2. Noisy inputs $\{x_{t^i}^i\}$ are created for the student generator.
3. The causal student generator $G_\phi$ predicts the clean video $\hat{x}_0$ .
4. A random timestep $\dot{t}$ is sampled, and noise is added to the student's prediction $\hat{x}_0$ to get $\hat{x}_{\dot{t}}$ .
5. The DMD loss is computed. The gradient update for the student $G_\phi$ is approximately: $ \nabla _ { \phi } \mathcal { L } _ { \mathrm { D M D } } \approx - \mathbb { E } \left[ \left( s _ { \mathrm { data } } (\hat{x}{\dot{t}}, \dot{t}) - s _ { \mathrm { gen } , \xi } (\hat{x}{\dot{t}}, \dot{t}) \right) \frac { d G _ { \phi } } { d \phi } \right] $
  - Intuition: The term $(s_{\mathrm{data}} - s_{\mathrm{gen}, \xi})$ represents the "distributional error." It's the difference between the score of the real data distribution (from the bidirectional teacher) and the score of the student's current output distribution. The gradient pushes the generator $G_\phi$ in a direction that makes this difference smaller, forcing the student's output distribution to match the teacher's.
6. Simultaneously, the generator's score function $s_{\mathrm{gen}, \xi}$ is updated using a standard denoising loss on the student's outputs $\hat{x}_0$ . This keeps $s_{\mathrm{gen}, \xi}$ as an accurate model of the student's current capabilities.
  
  The full process is described in Algorithm 1.

4.2.4. Efficient Inference with KV Caching

During inference, the autoregressive structure is exploited for maximum efficiency using Key-Value (KV) caching.

Concept: In a Transformer, when generating a new token (or video chunk), the Key (K) and Value (V) matrices for all previous tokens are required for the attention calculation. Instead of recomputing these for every new step, they can be cached and reused.
Inference Procedure (Algorithm 2):
1. The process starts with an empty KV cache $\mathbf{C}$ .
2. For the first chunk $i=1$ , start with random noise $x_{t_Q}^1$ and denoise it over a few steps (e.g., Q=4). Since there is no past, the cache is not used.
3. Once the clean chunk $x_0^1$ is generated, its K and V pairs are computed in a final forward pass and stored in the cache $\mathbf{C}$ .
4. For the next chunk $i=2$ , the denoising process begins. At each attention layer, the model uses the K and V pairs from its own chunk tokens, and concatenates them with the cached K and V pairs from chunk 1 stored in $\mathbf{C}$ .
5. This allows the model to "see" the past without recomputing it.
6. Once chunk 2 is denoised to $x_0^2$ , its K and V pairs are appended to the cache $\mathbf{C}$ .
7. This process repeats, sliding the window forward one chunk at a time, allowing for the generation of infinitely long videos with constant memory cost per step.
  
  The inference procedure is detailed in Algorithm 2.

5. Experimental Setup

5.1. Datasets

Training Data: The model was trained on a mixed dataset of images and videos, following the strategy of CogVideoX [96]. The video data consists of approximately 400,000 single-shot videos from an internal, copyrighted dataset. All data was filtered based on safety and aesthetic scores.
Data Preprocessing: Videos were resized and cropped to a resolution of $352 \times 640$ pixels and standardized to 12 FPS. Training was performed on 10-second video clips.
Example Data: An example of an input to the model would be a text prompt like "A photorealistic video of a yellow sports car driving down a road, with trees in the background." paired with a 10-second video clip matching that description.

The use of a large-scale, high-quality internal dataset is a key factor in the model's strong performance, although it impacts the reproducibility of the training process.

5.2. Evaluation Metrics

The primary evaluation benchmark used is VBench [26], a comprehensive suite for assessing video generation models across 16 different metrics. The paper focuses on three high-level categories from the VBench competition suite:

Temporal Quality: This category assesses the quality of motion and temporal consistency in the generated video. It is crucial for evaluating how well a model handles dynamics and avoids artifacts over time. Key sub-metrics include:
- Conceptual Definition: Measures aspects like smoothness of motion, absence of flickering between frames (Temporal Flickering), and overall dynamic realism. A high score indicates coherent and believable motion.
Frame Quality: This evaluates the visual quality of individual frames, independent of motion.
- Conceptual Definition: Assesses properties like image clarity, realism, aesthetic appeal (Aesthetic Quality), and absence of artifacts (Imaging Quality). It answers the question: "Does each frame look like a good picture?"
Text Alignment: This measures how well the generated video content matches the meaning of the input text prompt.
- Conceptual Definition: This is typically measured using vision-language models like CLIP. The model calculates a similarity score (CLIP-Score) between the text prompt and the generated video frames. A high score means the video is a faithful representation of the prompt.
- Mathematical Formula (for CLIP Score): While the VBench suite is complex, the core of text alignment relies on CLIP similarity. The formula is: $ \text{CLIP-Score}(T, V) = \mathbb{E}_{f \in V} \left[ \text{cosine_similarity}(E_T(T), E_I(f)) \right] $
- Symbol Explanation:
  - $T$ : The input text prompt.
  - $V$ : The set of frames $f$ in the generated video.
  - $E_T$ : The CLIP text encoder, which maps the text prompt to an embedding vector.
  - $E_I$ : The CLIP image encoder, which maps a video frame to an embedding vector.
  - $\text{cosine\_similarity}$ : Calculates the cosine similarity between the two embedding vectors. The final score is the average similarity across all frames.

5.3. Baselines

The paper compares CausVid against a strong set of state-of-the-art video generation models:

For Short Video Generation:
- CogVideoX [96]: A powerful bidirectional transformer-based model, similar in architecture to the teacher model used in this paper.
- OpenSORA [109]: An open-source replication of OpenAI's Sora, based on a diffusion transformer architecture.
- Pyramid Flow [28]: An efficient flow-matching-based model.
- MovieGen [61]: A large-scale video generation model from Google.
For Long Video Generation:
- Gen-L-Video [82], FreeNoise [63]: Methods focused on extending short video models to generate longer videos.
- StreamingT2V [22], FIFO-Diffusion [33]: Autoregressive methods designed for streaming or infinite video generation.
- Pyramid Flow [28]: Also evaluated in a long-video context.
  
  These baselines are representative as they cover the main architectural paradigms (diffusion transformers, flow matching) and generation strategies (bidirectional, autoregressive, clip-stitching).

6. Results & Analysis

6.1. Core Results Analysis

The experimental results strongly validate the effectiveness of the CausVid model in achieving both high quality and high speed.

6.1.1. Text-to-Short-Video Generation

The following are the results from Table 1 of the original paper:

Method	Length (s)	Temporal Quality	Frame Quality	Text Alignment
CogVideoX-5B	6	89.9	59.8	29.1
OpenSORA	8	88.4	52.0	28.4
Pyramid Flow	10	89.6	55.9	27.1
MovieGen	10	91.5	61.1	28.8
CausVid (Ours)	10	94.7	64.4	30.1

Analysis: CausVid outperforms all state-of-the-art baselines across all three primary metrics. The most significant lead is in Temporal Quality (94.7), which is particularly noteworthy for an autoregressive model. This result suggests that the asymmetric distillation from a bidirectional teacher was highly effective at teaching the model to maintain temporal consistency and avoid motion-related artifacts, a common weakness of autoregressive approaches.

6.1.2. Text-to-Long-Video Generation

The following are the results from Table 2 of the original paper:

Method	Temporal Quality	Frame Quality	Text Alignment
Gen-L-Video	86.7	52.3	28.7
FreeNoise	86.2	54.8	28.7
StreamingT2V	89.2	46.1	27.2
FIFO-Diffusion	93.1	57.9	29.9
Pyramid Flow	89.0	48.3	24.4
CausVid (Ours)	94.9	63.4	28.9

Analysis: When generating longer videos (30 seconds), CausVid continues to lead in Temporal Quality and Frame Quality. This demonstrates its robustness against the error accumulation problem. While some autoregressive methods like StreamingT2V and Pyramid Flow see a drop in quality, CausVid maintains high performance, rivaling and even surpassing methods specifically designed for long video generation.

6.1.3. Efficiency Comparison

The following are the results from Table 3 of the original paper:

Method	Latency (s)	Throughput (FPS)
CogVideoX-5B	208.6	0.6
Pyramid Flow	6.7	2.5
Bidirectional Teacher	219.2	0.6
CausVid (Ours)	1.3	9.4

Analysis: This table highlights the dramatic efficiency gains. CausVid reduces the latency for generating a 10-second video from ~219 seconds (for its teacher model) to just 1.3 seconds—a 168x speedup. Its throughput of 9.4 FPS is interactive and far exceeds all other methods. This confirms the success of combining few-step distillation (DMD) with an efficient autoregressive architecture (KV caching).

6.1.4. Visualizing Error Accumulation

The chart below (Figure 8 from the paper) plots image quality over a 30-second generation period.

Figure 8. Imaging quality scores of generated videos over 30 seconds. Our distilled model and FIFO-Diffusion are the most effective at maintaining imaging quality over time. The sudden increase of score for the causal teacher around 20s is due to a switch of the sliding window, resulting in a temporary improvement in quality. 该图像是一个展示生成视频平均成像质量随时间变化的图表。图中显示了不同模型在30秒内成像质量的变化，其中我们的模型和FIFO-Diffusion在维持成像质量方面最为有效。因滑动窗口的切换，因果教师模型在20秒时的分数突然提升，导致质量暂时改善。 Analysis: This graph provides crucial evidence for the effectiveness of the proposed method.

CausVid (blue line) and FIFO-Diffusion (a strong baseline) maintain consistently high image quality over the full 30 seconds.
Causal Teacher (orange line), which is a version of the bidirectional model simply fine-tuned with a causal mask, shows significant quality degradation over time. This proves that naive adaptation is not sufficient and leads to severe error accumulation.
The student distilled from this weak causal teacher (green line, see Table 4) also suffers. This directly supports the paper's central claim: the asymmetric distillation from a strong bidirectional teacher is essential to prevent the quality decay that plagues autoregressive models.

6.2. Ablation Studies / Parameter Analysis

The following are the results from Table 4 of the original paper, which dissects the contribution of each component of the proposed method.

		Causal Generator?	# Fwd Pass	Temporal Quality	Frame Quality	Text Alignment
Many-step models
Bidirectional		×	100	94.6	62.7	29.6
Causal		✓	100	92.4	60.1	28.5
Few-step models
ODE Init.	Teacher
×	Bidirectional	✓	4	93.4	60.6	29.4
✓	None	✓	4	92.9	48.1	25.3
✓	Causal	✓	4	91.9	61.7	28.2
✓	Bidirectional	✓	4	94.7	64.4	30.1

Analysis:

Many-Step Model Comparison: The first two rows show that simply converting the bidirectional teacher to a causal model (Causal) by fine-tuning leads to a drop in performance across all metrics compared to the original Bidirectional teacher. This highlights the inherent difficulty of training high-quality causal models.
Importance of Teacher Choice: Comparing the student trained with a Causal teacher to the one trained with a Bidirectional teacher (rows 7 and 8) shows a dramatic improvement in all scores when using the bidirectional teacher. This is the central finding of the paper: asymmetric distillation is superior. The student learns better from a stronger, non-causal teacher.
Importance of ODE Initialization: Comparing the model trained with ODE Init (row 8) to the one without it (row 5) shows a clear benefit from the initialization step, especially in Frame Quality and Text Alignment. This confirms that the initialization scheme stabilizes training and improves final performance.
Final Model Performance: The last row represents the full CausVid model. Remarkably, it not only matches but exceeds the original 100-step bidirectional teacher in Frame Quality (64.4 vs. 62.7) and Text Alignment (30.1 vs. 29.6), while being ~168x faster. This is an exceptionally strong result, demonstrating that the distillation process not only accelerated the model but also improved certain aspects of its output.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully addresses the critical challenge of speed and interactivity in high-quality video generation. By proposing a novel asymmetric distillation framework, the authors demonstrate how to convert a slow, powerful bidirectional diffusion transformer into a fast, autoregressive student model, CausVid. This approach, which involves distilling a non-causal teacher into a causal student, is shown to be highly effective at mitigating the error accumulation problem that has long plagued autoregressive models.

Combined with an ODE-based initialization scheme and the extension of Distribution Matching Distillation (DMD) to video, their method achieves state-of-the-art results on the VBench-Long benchmark. The resulting model is over 160 times faster than its teacher, enabling interactive frame rates (9.4 FPS) and unlocking new applications like streaming video editing and dynamic prompting, all while matching or even exceeding the quality of top-tier bidirectional models.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations and suggest future research directions:

Error Accumulation in Extreme Cases: While significantly reduced, quality degradation can still occur in extremely long videos (e.g., over 10 minutes). More robust strategies are needed to completely eliminate this drift.
VAE Latency Bottleneck: The current model's latency is constrained by its VAE, which processes video in chunks of frames. Adopting a frame-wise VAE could potentially reduce latency even further, pushing the model closer to true real-time performance.
Reduced Output Diversity: The DMD objective, which is based on reverse KL divergence, is known to sometimes reduce the diversity of generated samples compared to the teacher model. Future work could explore alternative distillation objectives like EM-Distillation or Score Implicit Matching that may better preserve output diversity.
Path to Real-Time: While 9.4 FPS is fast, achieving true real-time performance (>24 FPS) will likely require standard engineering optimizations like model compilation, quantization, and parallelization.

7.3. Personal Insights & Critique

This paper presents a clever and highly effective solution to a practical and important problem. Its contributions are both empirical and conceptual.

Personal Insights:
- The concept of asymmetric distillation is the most inspiring takeaway. It's a powerful idea that a student model with architectural constraints (like causality) can learn more effectively from a teacher without those constraints. This paradigm could be highly valuable in other domains beyond video, such as in robotics (distilling a slow, non-causal planner into a fast, real-time policy) or language modeling (distilling a bidirectional model into an autoregressive one for faster inference).
- The work is a testament to the power of distillation not just for acceleration, but also for regularization and quality improvement. The fact that the student model outperformed its much larger teacher on some metrics suggests that distillation can help a model focus on the most salient features of the data distribution.
Critique and Potential Issues:
- Reproducibility: The reliance on a large, internal, and proprietary dataset is a significant weakness from an academic standpoint. While this is common in industrial research labs, it prevents the broader community from reproducing the training process and building directly upon the work.
- Diversity Trade-off: The paper mentions the reduced output diversity as a limitation but does not provide a quantitative analysis. For creative applications, a lack of diversity can be a major drawback, as users often want to explore a wide range of outputs. It would be valuable to see an evaluation of this trade-off.
- "Competes with Bidirectional Diffusion": This is a strong claim. While the VBench scores are excellent, the ablation study (Table 4) shows the final model has slightly lower Temporal Quality than the original bidirectional teacher (94.7 vs. 94.6). Although it's better on other metrics, whether it fully "competes" is nuanced. However, given the massive speedup, the small trade-off is more than justified.
  
  Overall, this paper represents a significant advancement in generative AI. It elegantly bridges the gap between quality and speed in video generation, paving the way for a new class of interactive and creative tools. Its core methodological innovations are likely to have a lasting impact on the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~19 min read · 25,248 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

2.1.2. Existing Gaps

2.1.3. Innovative Idea

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Diffusion Models

3.1.2. Latent Diffusion Models (LDMs)

3.1.3. Transformers and Attention Mechanism

3.1.4. Bidirectional vs. Autoregressive Attention

3.1.5. Knowledge Distillation

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Autoregressive Architecture

4.2.2. Stage 1: Student Initialization via ODE Trajectory Regression

4.2.3. Stage 2: Asymmetric Distillation with DMD

4.2.4. Efficient Inference with KV Caching

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Text-to-Short-Video Generation

6.1.2. Text-to-Long-Video Generation

6.1.3. Efficiency Comparison

6.1.4. Visualizing Error Accumulation

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers