Tags: Text-to-Video Generation - Paper Library - SwiftScholar

Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

Published:10/31/2025

Distribution Matching DistillationMulti-Step Distillation FrameworkEnhancing Generative Model CapacityMixture-of-Experts ArchitectureText-to-Video Generation

The Phased DMD method enhances model capacity in complex generative tasks by improving traditional onestep models. It uses score matching within SNR subintervals, reducing learning difficulty while addressing instability and efficiency issues in multistep distillation.

03

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Published:4/3/2024

Video Generation ControlCamera Trajectory ParameterizationDiffusion Model Camera ControlText-to-Video GenerationControllable Video Generation

This paper presents CameraCtrl, a method for precise camera pose control in video generation. By utilizing effective camera trajectory parameterization and a plugandplay control module, CameraCtrl enhances user controllability and creative expression without affecting other bas

02

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

Published:12/2/2024

Customized Motion TransferMultimodal Large Language Modelvideo diffusion modelsMotion ModelingText-to-Video Generation

MoTrans introduces a customized motion transfer method using a multimodal large language model recaptioner and an appearance injection module, effectively transferring specific humancentric motions from reference videos to new contexts, outperforming existing techniques.

06

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Published:8/12/2024

Text-to-Video GenerationDiffusion ModelsDiffusion Transformer3D Variational AutoencoderVideo Generation Quality Improvement

CogVideoX is a largescale texttovideo model using a diffusion transformer that generates 10second videos at 16 fps and 768×1360 resolution. It addresses coherence and semantic alignment issues with methods like 3D VAE and expert transformers, achieving significant quality imp

03

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Published:11/26/2023

Diffusion ModelsVideo Generation ModelsText-to-Video GenerationHigh-Quality Video Fine-TuningVideo Dataset Curation

The paper presents Stable Video Diffusion (SVD), a model for highresolution texttovideo and imagetovideo generation. It evaluates a threestage training process and highlights the importance of wellcurated datasets for highquality video generation, demonstrating strong per

04

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Published:3/22/2024

Text-to-Video GenerationLong Video GenerationAutoregressive Video GenerationConditional Attention MechanismVideo Enhancement Application

This paper presents StreamingT2V, an autoregressive method for long video generation, addressing limitations of existing texttovideo models. It utilizes a Conditional Attention Module and Appearance Preservation Module for smooth transitions and scene feature retention, alongsi

04

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Published:10/6/2022

Text-to-Video GenerationLong Video GenerationVariable Length Video GenerationJoint Training on Image-Text PairsTemporal Encoding with Transformer

Phenaki is a model designed for generating variablelength videos from text prompts. It introduces a novel video representation method using causal attention and a bidirectional masked transformer, overcoming challenges of computational cost and data scarcity while improving temp

06

Free Reads