CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Published:03/31/2026

Analysis

~21 min read · 28,797 charsThis analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

1. Bibliographic Information

1.1. Title

The paper's title is CutClaw: Agentic Hours-Long Video Editing via Music Synchronization. Its central topic is the development of an autonomous multi-agent framework powered by Multimodal Language Models (MLLMs) that automates the editing of hours-long untrimmed raw video footage into high-quality short videos, with strict alignment to input music rhythm, user instructions, and professional aesthetic standards.

1.2. Authors

The authors and their affiliations are:

  • Shifang Zhao: Beijing Jiaotong University, GVC Lab (Great Bay University)
  • Yihan Hu: GVC Lab (Great Bay University)
  • Ying Shan: ARC Lab (Tencent)
  • Yunchao Wei (corresponding author): Beijing Jiaotong University
  • Xiaodong Cun: GVC Lab (Great Bay University) The research team has expertise in computer vision, multimodal AI, and video processing, with affiliations spanning academia and industrial research labs.

1.3. Journal/Conference

As of the current date, the paper is published as a preprint on arXiv, a widely used open-access preprint server for computer science, physics, and related fields. It has not yet undergone peer review or been accepted for publication in a formal conference or journal.

1.4. Publication Year

The paper was published on arXiv on 31 March 2026 (UTC), so its publication year is 2026.

1.5. Abstract

The paper addresses the problem that manual video editing is extremely time-consuming and repetitive for content creators. It introduces CutClaw, a multi-agent MLLM framework designed to edit hours-long raw footage into meaningful short videos with synchronized music, faithful instruction following, and aesthetic appeal. The core workflow includes: (1) hierarchical multimodal decomposition of raw video and audio to capture both fine-grained details and global structure, overcoming MLLM context length limits; (2) a Playwriter Agent that orchestrates the narrative flow, anchoring visual scenes to musical shifts to ensure narrative consistency; (3) collaborative optimization by Editor and Reviewer Agents to select fine-grained visual content against aesthetic and semantic criteria. Extensive experiments show CutClaw significantly outperforms state-of-the-art baselines across all key quality metrics.

2. Executive Summary

2.1. Background & Motivation

Core Problem

The paper aims to solve the challenge of automated professional-level audio-driven video editing for hours-long untrimmed footage. Manual editing of long footage to produce music-synced short videos requires hundreds of hours of labor, relying on human expertise in narrative construction, aesthetic judgment, and audio-visual synchronization.

Importance and Gaps in Prior Research

Automated video editing is a high-demand task for social media content creation, film post-production, and vlog editing, but existing automated methods have critical flaws:

  1. Template-based methods: Force clips into rigid pre-defined slots, lack audio-visual synchronization and semantic awareness, produce repetitive outputs with no narrative progression.
  2. Highlight detection methods: Optimize for local visual salience but are audio-agnostic, treat clips in isolation, fail to build globally coherent narratives.
  3. Text-based editing methods: Align visuals to transcripts/instructions but ignore musical structure, disrupting rhythmic and affective alignment. All existing methods fail to balance the dual constraints of global narrative coherence and fine-grained audio-visual harmony required for professional-quality edits. The paper identifies three specific technical challenges for this task:
  4. Context length limitation: Hours of dense visual information exceed the context window of current MLLMs.
  5. Context-grounded storytelling: Reconciling user instructions with the intrinsic semantics of raw video and audio to build a coherent narrative without decoupling from source material.
  6. Fine-grained cross-modal alignment: Synchronizing visual cuts to musical shifts while maintaining narrative logic, aesthetic quality, and instruction fidelity.

Innovative Entry Point

The paper proposes a multi-agent framework that mimics the professional human post-production workflow, using a coarse-to-fine hierarchy to break the large, intractable search space of possible edits into manageable specialized subtasks handled by dedicated agents.

2.2. Main Contributions / Findings

The paper's three core contributions are:

  1. It formally defines the novel task of audio-driven long-form video editing as a joint optimization problem that simultaneously satisfies instruction-driven storytelling and fine-grained rhythmic harmony.
  2. It introduces CutClaw, an MLLM-powered multi-agent framework that handles hours-long footage via bottom-up multimodal deconstruction, a music-anchored Playwriter agent for narrative planning, and collaborative Editor/Reviewer agents for precise segment selection.
  3. Extensive quantitative experiments and user studies demonstrate that CutClaw significantly outperforms state-of-the-art baselines in visual quality, instruction following, and audio-visual harmony, with editing naturalness close to professional human editors.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, beginners need to grasp the following core concepts:

  • Multimodal Language Model (MLLM): A type of large language model extended to process and understand multiple input modalities (text, images, video, audio) instead of only text. MLLMs are used in this work to analyze video content, music structure, and user instructions.
  • Shot Boundary Detection: A preprocessing step for video analysis that identifies timestamps where one camera shot ends and another begins (cuts, fades, dissolves), splitting raw video into atomic visual units for further processing.
  • Video Temporal Grounding (VTG): The task of locating the exact start and end timestamps of a segment in raw video that matches a given text query or requirement.
  • Highlight Detection: The task of identifying the most salient, interesting, or important segments in a video, typically used to create short summaries.
  • Multi-agent System: A system where multiple independent AI agents with specialized roles collaborate to complete a complex task. Each agent handles a specific subtask, communicates with other agents, and contributes to the final output.
  • Downbeats: The first, most perceptually salient beat of each bar in a piece of music, commonly used as natural cut points in music video editing.

3.2. Previous Works

The paper categorizes prior research into three core domains:

  1. AI-assisted Video Editing: Early works like Write-A-Video (2019) and ESA (2025) formulated editing as an energy minimization problem to align shots to thematic cues. Recent generative methods use high-level text instructions or subtitles to assemble clips, but rely on pre-segmented clips, require explicit scripts, and completely ignore music rhythm. The representative baseline in this category is NarratoAI (2025), an open-source subtitle-driven editing framework that generates clips based on text instructions but fails on VLOG footage without dense speech/subtitles.
  2. Video Temporal Grounding and Highlight Detection: Conventional VTG methods use pre-trained feature encoders to localize segments matching text queries, while recent works like Time-R1 (2025) use MLLMs to improve instruction understanding. Highlight detection methods evolved from visual saliency scoring to text prompt-guided selection, but both lines of research struggle to model long-term context of hours-long footage, and cannot precisely control segment duration to match music rhythm. Representative baselines here include Time-R1 (SOTA temporal grounding) and UVCOM (2024, SOTA unified VTG/highlight detection framework).
  3. Agents for Video Generation and Editing: Recent multi-agent frameworks like EditDuet (2025) support non-linear video editing, but are limited by MLLM context windows when processing long footage, and fail to achieve precise audio-visual synchronization due to coarse LLM planning.

3.3. Technological Evolution

The evolution of automated video editing follows this timeline:

  1. Manual professional editing tools (Adobe Premiere Pro, DaVinci Resolve, CapCut)
  2. Template-based automated editing for simple use cases
  3. AI-powered highlight detection and temporal grounding for clip selection
  4. Text-driven generative editing for narrative alignment
  5. This work's multi-agent music-synced editing for long-form footage, which fills the gap of combining long context handling, multi-agent collaboration, and precise audio-visual synchronization for professional-level edits.

3.4. Differentiation Analysis

CutClaw has three core differences from prior methods:

  1. It directly processes hours-long untrimmed raw footage, with no requirement for pre-segmented clips, manual scripts, or dense subtitles.
  2. It explicitly uses music structure as the invariant temporal anchor for editing, ensuring precise rhythmic alignment that all prior methods ignore.
  3. Its specialized multi-agent workflow (Playwriter, Editor, Reviewer) mimics professional post-production, balancing global narrative coherence and local aesthetic/alignment constraints better than single-model or single-agent systems.

4. Methodology

4.1. Principles

The core idea of CutClaw is to mimic the professional human video editing workflow, breaking the complex, computationally intractable task of editing hours of footage into specialized, manageable subtasks handled by dedicated agents, using a coarse-to-fine hierarchy. The theoretical intuition is that decomposing the huge search space of possible edits into structured steps reduces computational complexity, while specialized agents can optimize for different constraints (narrative, sync, aesthetic) far better than a single generalist model.

4.2. Core Methodology In-depth

4.2.1. Problem Formulation

Given three inputs: raw video footage ν\nu, background music track M\mathcal{M}, and user text instructions L\mathcal{L}, the goal is to generate an edited video composed of a sequence of selected clips E=(c1,c2,...,cN)\mathcal{E} = (c_1, c_2, ..., c_N), where each clip ci=(tiin,tiout)c_i = (t_i^{in}, t_i^{out}) is a continuous segment extracted from the raw video ν\nu. The optimal edited timeline E\mathcal{E}^* is found by maximizing the following joint objective function: E=argmaxε(λvQvis(E)+λnQnarr(E)+λcQcond(E,L)+λsQsync(E,M)), \begin{array} { r l } & { \mathcal { E } ^ { * } = \underset { \varepsilon } { \arg \operatorname* { m a x } } \left( \lambda _ { v } \mathcal { Q } _ { \mathrm { v i s } } ( \mathcal { E } ) + \lambda _ { n } \mathcal { Q } _ { \mathrm { n a r r } } ( \mathcal { E } ) + \right. } \\ & { \left. \lambda _ { c } \mathcal { Q } _ { \mathrm { c o n d } } ( \mathcal { E } , \mathcal { L } ) + \lambda _ { s } \mathcal { Q } _ { \mathrm { s y n c } } ( \mathcal { E } , \mathcal { M } ) \right) , } \end{array} Symbol explanation:

  • λv,λn,λc,λs\lambda_v, \lambda_n, \lambda_c, \lambda_s: Hyperparameter weights that balance the relative importance of each objective term.

  • Qvis(E)\mathcal{Q}_{vis}(\mathcal{E}): Visual Quality score, measuring aesthetic appeal, protagonist prominence, and absence of visual degradation (blurriness, bad lighting, etc.) in the edited timeline.

  • Qnarr(E)\mathcal{Q}_{narr}(\mathcal{E}): Narrative Flow score, measuring logical coherence and story progression between adjacent clips.

  • Qcond(E,L)\mathcal{Q}_{cond}(\mathcal{E}, \mathcal{L}): Semantic Alignment score, measuring how well the selected content matches the user's text instructions L\mathcal{L}.

  • Qsync(E,M)\mathcal{Q}_{sync}(\mathcal{E}, \mathcal{M}): Rhythmic Alignment score, measuring how precisely visual cut points align with musical beats, shifts, and structure in the input music M\mathcal{M}.

    Instead of brute-force searching the enormous space of possible timelines, CutClaw uses a four-stage hierarchical workflow to approximate the optimal solution, as illustrated in Figure 2 from the original paper:

    Fig. 2: The whole workflow of the CutClaw. The multi-modal footage is first Deconstructed, and then, the shot plan is generated by the Playwriter, scene retrieval and editing by the Editor, and quality validation by the Reviewer. 该图像是CutClaw的工作流程示意图,展示了多模态视频素材的分解与编辑过程。图中包含了场景级别的字幕、情节构建、音乐结构和质量审核环节,体现了由Playwriter、Editor和Reviewer协同工作的过程。

The four stages are: (1) Bottom-Up Multimodal Footage Deconstruction, (2) Playwriter: Music-Anchored Script Synthesis, (3) Editor: Top-Down Hierarchical Visual Grounding, (4) Reviewer: Multi-Criteria Validity Gate.


4.2.2. Bottom-Up Multimodal Footage Deconstruction

This stage discretizes continuous raw video and audio into structured semantic units to reduce the search space and overcome MLLM context length limitations. It has two submodules:

4.2.2.1. Video Shots Aggregation

This submodule splits raw video into atomic shots, then aggregates semantically similar shots into coherent scenes, as shown in the left panel of Figure 3:

Fig.3: Left: Video Shots Aggregation. We first perform shot detection on the entire video and conduct a caption for a detailed understanding. Then, we use an LLM to aggregate similar content for a scene-level description. Right: The workflow of Playwriter. Playwriter generate the whole storyline according to the input, and gives the detailed shot plan of each specific shot of music. 该图像是示意图,展示了视频编辑过程中的镜头结构解析和场景聚合。左侧展示了原始视频的镜头解析与语义场景聚合,右侧则展示了与音乐结构对齐的镜头计划与场景描述,这些步骤确保了视频内容与音乐的高度一致性和叙述性。

Steps:

  1. Shot Parsing: Use shot boundary detection (PySceneDetect) to split raw video ν\nu into atomic shots S\boldsymbol{S}, where each shot is a continuous visual unit bounded by camera cuts. For each shot sis_i, use an MLLM to extract semantic attributes A(si)\mathcal{A}(s_i) covering cinematography, character dynamics, and environment.
  2. Scene Aggregation: Group adjacent shots into contiguous, spatio-temporally coherent scenes by calculating the transition similarity between consecutive shots: Sim(si,si+1)=αvi,i+1\mathrm { S i m } ( s _ { i } , s _ { i + 1 } ) = { \alpha } ^ { \top } { \mathbf v } _ { i , i + 1 } Symbol explanation:
    • vi,i+1\mathbf{v}_{i,i+1}: Attribute-wise similarity vector between shot sis_i and si+1s_{i+1}, derived from LLM embeddings of their semantic attributes.
    • α\alpha: Weight vector balancing the importance of different attribute types (e.g., character similarity, location similarity, action similarity). A scene boundary is created whenever the similarity score drops below a predefined threshold τ\tau, splitting the full video into a set of scenes Z\mathcal{Z}.
  3. Character-Aware Grounding: Use Automatic Speech Recognition (ASR) to analyze video dialogue and infer character identities H\mathcal{H} (names, roles). Inject these identities into the MLLM during scene analysis to generate grounded scene descriptions D(zj)\mathcal{D}(z_j) that use specific character names instead of generic terms (e.g., "Joker" instead of "a man"), enabling reliable cross-scene character tracking.

4.2.2.2. Structural Audio Parsing

This submodule converts continuous music waveform into discrete cut points and structural units to enable precise rhythmic alignment:

  1. Hierarchical Keypoint Detection: Extract three types of perceptually salient sound keypoints on the music time axis:
    • Kdb\mathcal{K}_{db}: Downbeats (bar-level rhythmic accents)
    • Kpc\mathcal{K}_{pc}: Pitch Changes (melodic transitions)
    • Kse\mathcal{K}_{se}: Spectral Energy Changes (timbral/volume transitions) Combine all keypoints into a candidate pool K0=KdbKpcKse\mathcal{K}_0 = \mathcal{K}_{db} \cup \mathcal{K}_{pc} \cup \mathcal{K}_{se}, then apply temporal filtering Φ()\Phi(\cdot) (peak de-duplication, removing keypoints that are too close together) to get robust cut boundaries K=Φ(K0)\mathcal{K} = \Phi(\mathcal{K}_0).
  2. Structure-Guided Refinement: Use an MLLM to split the music into high-level structural units U={uj}j=1M\mathcal{U} = \{u_j\}_{j=1}^M (e.g., verse, chorus, bridge). For each structural unit uju_j, score each contained keypoint tKujt \in \mathcal{K} \cap u_j to retain only the most significant boundaries: score(t)=βi(t), where i(t)=[intdb(t),intpc(t),intse(t)]\mathrm { s c o r e } ( t ) = \beta ^ { \top } \mathbf { i } ( t ) , \mathrm { ~ w h e r e ~ } \mathbf { i } ( t ) = [ \mathrm { i n t _ { d b } } ( t ) , \mathrm { i n t _ { p c } } ( t ) , \mathrm { i n t _ { s e } } ( t ) ] ^ { \top } Symbol explanation:
    • i(t)\mathbf{i}(t): Intensity vector of the three keypoint types at time tt.
    • intdb(t)\mathrm{int}_{db}(t): Perceptual intensity of the downbeat at time tt.
    • intpc(t)\mathrm{int}_{pc}(t): Magnitude of the pitch change at time tt.
    • intse(t)\mathrm{int}_{se}(t): Magnitude of the spectral energy change at time tt.
    • β\beta: Weight vector balancing the importance of each keypoint type. Finally, generate structure-aligned captions for each music unit, describing local rhythm, emotion, and energy to guide visual matching.

4.2.3. Playwriter: Music-Anchored Script Synthesis

This agent uses the music structural units as an invariant temporal anchor to create the global narrative plan, aligning user instructions, video scenes, and music structure, as shown in the right panel of Figure 3. It follows two strict execution rules to ensure validity:

  1. Disjoint Resource Allocation: Video scenes allocated to different music units cannot overlap: ZujZuk=\mathcal{Z}_{u_j} \cap \mathcal{Z}_{u_k} = \emptyset for all jkj \neq k, where Zuj\mathcal{Z}_{u_j} is the subset of video scenes allocated to music unit uju_j. This prevents reuse of source material across different narrative blocks.

  2. Structural Temporal Anchoring: The total planned duration for each music unit uju_j exactly matches the unit's actual duration: pPjDuration(p)=uj\sum_{p \in P_j} \text{Duration}(p) = |u_j|, where PjP_j is the shot plan for unit uju_j. This ensures the final edit is perfectly aligned to the music length.

    The Playwriter workflow has two steps:

  3. Structural Scene Allocation: Map each musical structural unit uju_j to a subset of video scenes Zuj\mathcal{Z}_{u_j} conditioned on user instructions L\mathcal{L}: Zuj=Φmacro(uj,LZ)\mathcal { Z } _ { u _ { j } } = \varPhi _ { \mathrm { m a c r o } } ( u _ { j } , \mathcal { L } \mid \mathcal { Z } ) Where Φmacro\Phi_{macro} is the LLM-based global planning function. If the generated allocation has overlapping scenes, it is rejected and regenerated with negative constraints to avoid reuse.

  4. Keypoint-Aligned Shot Planning: Refine the scene allocation into executable shot specifications for each fine-grained music segment kik_i in unit uju_j. Each specification pi=(τi,zid,di)p_i = (\tau_i, z_{id}, d_i) defines constraints for the downstream Editor:

    • τi\tau_i: Target duration of the shot, derived directly from the length of music segment kik_i to ensure rhythmic alignment.

    • zidz_{id}: Index of the source scene from Zuj\mathcal{Z}_{u_j} that the shot must be selected from, pruning the search space for the Editor.

    • did_i: Semantic visual description (required plot, emotion, character) guiding content matching within the selected scene.


4.2.4. Editor: Top-Down Hierarchical Visual Grounding

This ReAct (Reasoning + Acting) agent uses the shot plan from the Playwriter to find the exact timestamps of clips that maximize visual quality and semantic alignment, as shown in Figure 4:

Fig. 4: Editor and Reviewer are used to perform segment selection and validation. SNR stands for Semantic Neighborhood Retrieval, and FGST stands for Fine-Grained Shot Trimming. 该图像是一个示意图,展示了CutClaw框架中Editor和Reviewer的操作流程,包括语义邻域检索(SNR)和精细镜头修剪(FGST)等步骤,分别用于段落选择和验证。此外,画面还展示了审核过程中对镜头长度、审美分数和主演比例的检查。

The Editor has three core actions:

  1. Semantic Neighborhood Retrieval (SNR): Initialize the local search space Ωi\Omega_i as all shots in the assigned scene zidz_{id}. If no suitable clips are found in this primary space, expand the search to adjacent scenes to avoid retrieval dead-ends: Ωi=Ωi{ssNeighbor(zid,Δ)}\Omega _ { i } ^ { \prime } = \Omega _ { i } \cup \{ s \mid s \in \mathrm { N e i g h b o r } ( z _ { \mathrm { i d } } , \varDelta ) \} Where Δ\Delta is the maximum number of adjacent scenes to include in the expanded search space.

  2. Fine-Grained Shot Trimming (FGST): For each candidate shot sΩis \in \Omega_i, find the sub-segment cisc_i \subset s of exact duration τi\tau_i that maximizes the local quality score: ci=argmaxcs,c=τi(αSaes(c)+βRprot(cH))c _ { i } ^ { * } = \underset { c \subset s , | c | = \tau _ { i } } { \arg \operatorname* { m a x } } \left( \alpha \cdot S _ { \mathrm { a e s } } ( c ) + \beta \cdot R _ { \mathrm { p r o t } } ( c \mid \mathcal { H } ) \right) Symbol explanation:

    • Saes(c)S_{aes}(c): Aesthetic score of the clip cc, contributing to the Qvis\mathcal{Q}_{vis} objective.
    • Rprot(cH)R_{prot}(c | \mathcal{H}): Protagonist Presence Ratio of clip cc, calculated by cross-referencing frame content with the character identity set H\mathcal{H} defined in the deconstruction stage, contributing to the Qcond\mathcal{Q}_{cond} objective.
    • α,β\alpha, \beta: Weight parameters balancing the relative importance of aesthetic quality and protagonist presence. If the resulting clip has a suboptimal score, the Editor shifts the temporal window based on VLM feedback until a high-quality clip is found.
  3. Commit: Submit the trimmed candidate cic_i to the Reviewer agent. If approved, the clip is added to the final timeline; if rejected, the Editor backtracks to explore alternative intervals in the search space Ωi\Omega_i.


4.2.5. Reviewer: Multi-Criteria Validity Gate

This agent acts as a quality control gate, auditing every candidate clip from the Editor to ensure all constraints are met. It performs three checks:

  1. Semantic Identity Verification: Validates that the clip's visual subject aligns with the target character in H\mathcal{H}, filtering out clips where the protagonist is occluded, unrecognizable, or only a background extra, to ensure narrative consistency.
  2. Temporal and Structural Integrity: Verifies two hard constraints: (a) Non-overlap: The clip does not overlap with any previously selected clips in the timeline. (b) Duration Fidelity: The clip's duration exactly matches the target τi\tau_i, and cut points align with the corresponding music keypoints. Any violation triggers immediate rejection.
  3. Perceptual Quality Assurance: Rejects clips with visual quality degradation (blurriness, poor lighting, shaky footage) to ensure broadcast-level viewing standards. If a clip is rejected, the Reviewer sends structured feedback to the Editor to guide selection of alternative clips.

An example of the full end-to-end workflow for a single shot using footage from Interstellar is shown in Figure 5:

Fig.5: A sample execution of a single-shot cutting, utilizing footage from movie "Interstellar" and the music "Moon." Actions performed by the Playwright, Editor, and Reviewer are color-coded in blue, yellow, and green, respectively. The orange background traces the execution path leading to the final clip selection. 该图像是结果分析图,展示了视频剪辑过程中的不同迭代步骤,包括审查者和编辑者的评估。图中呈现了各个草稿镜头的长度、审美评分和主角比例的检查,使用不同颜色标记了操作环节,图中的时间范围及结论有助于了解最终剪辑选择。

5. Experimental Setup

5.1. Datasets

The authors constructed a custom benchmark specifically for agentic long-form video editing:

  • Source footage: 10 distinct source pairs, including 5 feature-length films and 5 long-duration VLOGs, with raw footage length ranging from 1 to 3 hours, totaling ~24 hours of footage. This covers both professionally cinematographed scripted content and unscripted naturalistic content.
  • Music inputs: 10 segmented music tracks spanning Pop, Jazz, OST, Rock, and R&B genres, with target edit durations ranging from 20 seconds to 1 minute.
  • Instruction types: Two distinct instruction categories to test different capabilities:
    1. Character-Centric (Obj) Instructions: Require the edit to focus exclusively on a single protagonist, testing the system's ability to maintain identity consistency.
    2. Narrative-Centric (Nar) Instructions: Require inclusion of multiple characters or complex interactions to convey a cohesive story, testing narrative construction capabilities. Total evaluation cases: 20 (10 source pairs × 2 instruction types). This dataset is chosen for its diversity, ensuring robust evaluation across different content types, music genres, and editing requirements.

5.2. Evaluation Metrics

The paper uses four evaluation metrics, defined below:

1. Visual Quality

  1. Conceptual Definition: Measures the aesthetic appeal of the edited video, including absence of quality degradation, good composition, proper lighting, and consistent protagonist prominence.
  2. Quantitative Scoring:
    • Automated evaluation: Scored out of 100 by GPT-5.2, based on aesthetic integrity assessment.
    • User study: Percentage of users voting the method as best in this category.

2. Instruction Follow

  1. Conceptual Definition: Measures how well the edited video adheres to the user's text instructions, including correct character focus, required narrative content, and specified elements.
  2. Quantitative Scoring:
    • Automated evaluation: Scored out of 100 by GPT-5.2, split into Object (character-centric) and Narrative sub-metrics.
    • User study: Percentage of users voting the method as best in this category.

3. Audio-Visual (AV) Harmony

  1. Conceptual Definition: Measures how precisely visual cut points align with musical beats and shifts, ensuring natural rhythmic synchronization.
  2. Mathematical Formula (standard audio-visual alignment error): Δt=1Ni=1N1tcut,itmusic,i\Delta t = \frac{1}{N} \sum_{i=1}^{N-1} |t_{cut,i} - t_{music,i}|
  3. Symbol Explanation:
    • Δt\Delta t: Average temporal offset between visual cuts and corresponding music keypoints, lower = better alignment.
    • NN: Total number of cuts in the edited video.
    • tcut,it_{cut,i}: Timestamp of the ii-th visual cut.
    • tmusic,it_{music,i}: Timestamp of the nearest music keypoint to the ii-th cut. The paper rewards alignments where \Delta t \leq 0.1s (perceptually unnoticeable to human viewers). Automated evaluation scores are out of 100, based on the proportion of cuts meeting this threshold. User study scores are percentage of votes for best synchronization.

4. Human-Likeness (User Study Only)

  1. Conceptual Definition: Measures how natural the editing pacing and narrative logic are, compared to work produced by professional human editors.
  2. Scoring: Percentage of users voting the method as most human-like.

5.3. Baselines

The paper compares CutClaw against three representative state-of-the-art baselines covering all existing editing paradigms:

  1. NarratoAI (2025): Mainstream open-source subtitle-driven editing framework, representative of text-driven editing paradigms. It uses full video subtitles to generate clips based on text instructions, but cannot handle VLOG footage without dense speech/subtitles.
  2. UVCOM (2024): State-of-the-art unified framework for moment retrieval and highlight detection, representative of highlight detection paradigms. Adapted for long-form footage by first segmenting the source video, selecting the top 5 highest confidence clips, then trimming to match target duration.
  3. Time-R1 (2025): State-of-the-art temporal video grounding model, representative of VTG paradigms. Adapted for long-form footage using the same segmentation and trimming workflow as UVCOM. These baselines are chosen because they represent the best performing existing methods across all automated editing paradigms, enabling a fair and comprehensive comparison.

6. Results & Analysis

6.1. Core Results Analysis

Quantitative Results

The following are the results from Table 1 of the original paper:

Method Visual Quality Instruction Follow AV Harmony
Film Vlog Avg. Obj Nar Avg. Film Vlog Avg.
NarratoAI 75.7 - 75.7 56.0 72.0 64.0 84.9 - 84.9
UVCOM 71.2 73.6 72.4 60.8 64.5 62.6 78.9 79.7 79.3
Time-R1 73.3 72.6 72.9 51.9 71.0 61.5 77.0 75.8 76.4
CutClaw 79.2 76.0 77.6 66.6 73.4 70.0 85.7 87.3 86.5

CutClaw outperforms all baselines across every metric:

  • Visual Quality: Average 77.6, 1.9 points higher than the second-best NarratoAI for film content, 5.2 points higher than UVCOM for VLOG content, demonstrating superior aesthetic performance across both scripted and unscripted footage.
  • Instruction Follow: Average 70.0, 6 points higher than NarratoAI, with particularly strong performance on character-centric (Obj) instructions (66.6 vs 60.8 for second-best UVCOM), confirming precise visual content localization and identity consistency.
  • AV Harmony: Average 86.5, 1.6 points higher than NarratoAI for film, 7.6 points higher than UVCOM for VLOGs, validating precise rhythmic alignment to music.

User Study Results

The following are the results from Table 3 of the original paper:

Method Visual Quality Instruction Follow Audio-Visual Harmony Human-Like
Nar Obj Film Vlog Avg Nar Obj Film Vlog Avg Nar Obj Film Vlog Avg Nar Obj Film Vlog Avg
NarratoAI* 11.6% 11.2% 27.3% - 11.4% 14.8% 10.8% 28.7% - 12.8% 11.2% 12.4% 25.3% - 11.8% 12.0% 8.4% 23.3% - 10.2%
UVCOM 18.0% 16.8% 7.3% 23.6% 17.4% 18.8% 13.2% 7.3% 22.0% 16.0% 17.2% 13.2% 6.0% 22.8% 15.2% 18.8% 15.6% 9.3% 23.6% 17.2%
Time-R1 25.2% 17.6% 18.0% 22.4% 21.4% 22.4% 19.6% 16.7% 24.0% 21.0% 23.2% 16.8% 17.3% 18.4% 20.0% 24.8% 22.8% 16.7% 25.6% 23.8%
CutClaw 45.2% 54.4% 47.3% 54.0% 49.8% 44.0% 56.4% 47.3% 53.6% 50.2% 48.4% 57.6% 51.3% 57.6% 53.0% 44.4% 53.2% 50.7% 50.8% 48.8%

*Note: NarratoAI cannot handle VLOGs as they lack dense subtitles. CutClaw outperforms all baselines by a large margin across all user study metrics:

  • Average 49.8% of votes for Visual Quality, more than double the second-best Time-R1 (21.4%).
  • Average 53.0% of votes for Audio-Visual Harmony, more than double Time-R1 (20.0%).
  • Average 48.8% of votes for Human-Likeness, meaning nearly half of users judged CutClaw's edits to be as good as professional human editors' work.

Qualitative Results

Qualitative comparison with baselines is shown in Figure 6:

Fig. 6: Qualitative comparison between CutClaw and baseline methods. The two cases utilize full-length footage from the films "Paprika" and "La La Land", paired with the musical tracks "Luv(sic) Pt.2" and "Norman F\\*\\*king Rockwell", respectively. Shot boundary detection is performed using PySceneDetect \[6\]. 该图像是一个示意图,展示了基于对象和叙事驱动的指令如何影响CutClaw生成的视频。左侧为输入指令,右侧显示截取的镜头及背景音乐,各个部分通过时间同步和语义一致性过滤出高质量片段。图中展示了不同情感层次和视听元素的对照与表达。

Baseline methods show consistent flaws: NarratoAI loosely follows instructions but suffers from severe visual degradation; UVCOM and Time-R1 maintain visual quality but lack logical narrative connections between shots and fail to align with music structure.

6.2. Ablation Studies

The authors conducted ablation studies to validate the effectiveness of individual CutClaw components. The following are the results from Table 2 of the original paper:

Method Visual Quality Instruction Follow AV Harmony
Film Vlog Avg. Obj Nar Avg. Film Vlog Avg.
w/o Audio 77.3 73.8 75.5 63.4 74.3 68.9 78.0 76.5 77.2
w/o Editor 78.6 75.4 77.0 59.7 71.5 65.6 84.8 86.0 85.4
w/o Reviewer 78.1 74.0 76.0 66.6 72.9 69.8 85.1 89.4 87.2
CutClaw 79.2 76.0 77.6 66.6 73.4 70.0 85.7 87.3 86.5

Ablation analysis:

  1. w/o Audio: Music structural analysis is replaced with fixed-length segmentation. AV Harmony drops sharply from 86.5 to 77.2, confirming that music structure anchoring is critical for rhythmic alignment.
  2. w/o Editor: The Editor agent is replaced with a random clip selector. Average Instruction Follow score drops from 70.0 to 65.6, showing the Editor's hierarchical grounding is essential for narrative coherence and semantic accuracy.
  3. w/o Reviewer: The Reviewer quality gate is removed. Visual Quality drops from 77.6 to 76.0, as low-quality clips and transition mismatches are no longer filtered out, confirming the Reviewer's role in maintaining aesthetic quality. The slightly higher AV Harmony score for the w/o Reviewer variant (87.2 vs 86.5 for full CutClaw) is a minor tradeoff: the Reviewer occasionally rejects perfectly aligned clips for aesthetic reasons, which is acceptable for overall final quality.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces CutClaw, an autonomous multi-agent MLLM framework that automates the editing of hours-long untrimmed raw footage into high-quality short videos with precise music synchronization, faithful instruction following, and professional aesthetic appeal. It addresses three core challenges of long-form audio-driven editing: (1) context length limitation via hierarchical multimodal decomposition of video and audio into structured semantic units; (2) context-grounded storytelling via a Playwriter agent that anchors narrative planning to music structure; (3) fine-grained cross-modal alignment via collaborative Editor and Reviewer agents that select and validate clips against strict quality criteria. Extensive quantitative experiments and user studies demonstrate that CutClaw significantly outperforms state-of-the-art baselines across all key metrics, with editing naturalness approaching that of professional human editors.

7.2. Limitations & Future Work

The authors identify two key limitations of the current framework:

  1. Lack of advanced visual hooks: The system only selects existing clips from raw footage, and does not support adding generated visual effects, transitions, text overlays, or monologue highlights that enhance content engagement. Future work will integrate generative video models to synthesize these expressive elements.
  2. High inference latency: The multi-stage pipeline processing hours of footage has high computational latency, making it unsuitable for real-time interactive feedback. Future work will optimize pipeline speed and implement coarse-to-fine processing strategies to enable real-time user interaction.

7.3. Personal Insights & Critique

CutClaw represents a significant advance towards fully automated professional video editing, with enormous practical value for content creators, social media influencers, film studios, and marketing teams, reducing editing time from hours to minutes. The multi-agent workflow mimicking human post-production is a highly promising paradigm for complex multimodal tasks, as it breaks large intractable problems into specialized subtasks that individual agents can handle far better than a single generalist model. Potential improvements to the framework include:

  1. Adding support for iterative user feedback during editing, allowing users to adjust the narrative plan, clip selection, or style mid-process.
  2. Extending support for dynamic music playlists, multiple music tracks, and adaptive music changes, rather than only a single fixed input music track.
  3. Optimizing the pipeline for edge consumer hardware, to enable deployment on personal devices instead of requiring high-performance HPC resources. One unverified assumption in the current work is that disjoint scene allocation is always optimal for narrative coherence: in some creative use cases (e.g., flashbacks, parallel narrative cuts), controlled reuse of scenes could improve storytelling quality, so adding optional support for targeted scene reuse could expand the framework's creative capabilities.