Paper status: completed

See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

Published:12/08/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ViVLA is a general robotic manipulation model that learns new tasks from a single expert video demonstration. By processing the video alongside robot observations, it distills expertise for improved performance in unseen tasks, showing significant experimental gains.

Abstract

Developing robust and general-purpose manipulation policies represents a fundamental objective in robotics research. While Vision-Language-Action (VLA) models have demonstrated promising capabilities for end-to-end robot control, existing approaches still exhibit limited generalization to tasks beyond their training distributions. In contrast, humans possess remarkable proficiency in acquiring novel skills by simply observing others performing them once. Inspired by this capability, we propose ViVLA, a generalist robotic manipulation policy that achieves efficient task learning from a single expert demonstration video at test time. Our approach jointly processes an expert demonstration video alongside the robot's visual observations to predict both the demonstrated action sequences and subsequent robot actions, effectively distilling fine-grained manipulation knowledge from expert behavior and transferring it seamlessly to the agent. To enhance the performance of ViVLA, we develop a scalable expert-agent pair data generation pipeline capable of synthesizing paired trajectories from easily accessible human videos, further augmented by curated pairs from publicly available datasets. This pipeline produces a total of 892,911 expert-agent samples for training ViVLA. Experimental results demonstrate that our ViVLA is able to acquire novel manipulation skills from only a single expert demonstration video at test time. Our approach achieves over 30% improvement on unseen LIBERO tasks and maintains above 35% gains with cross-embodiment videos. Real-world experiments demonstrate effective learning from human videos, yielding more than 38% improvement on unseen tasks.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

Central topic: A new vision-language-action (VLA) model, called ViVLA, that enables robots to learn manipulation skills for unseen tasks at test time by watching a single video demonstration, including cross-embodiment videos (e.g., human demonstrations).

1.2. Authors

  • Guangyan Chen, Meiling Wang, Qi Shao, Zichen Zhou, Yinan Deng, Luojie Yang, Yi Yang, Yufeng Yue (Beijing Institute of Technology)

  • Weixin Mao, Minzhao Zhu, Zhanqi Zhang, Hua Chen (LimX Dynamics)

  • Te Cui (likely Beijing Institute of Technology; the paper shows a typesetting artifact with the name)

    Research backgrounds and affiliations: The author list spans academic and industrial labs focused on robotics, computer vision, and foundation models. Beijing Institute of Technology and LimX Dynamics have active groups in embodied AI and robot manipulation.

1.3. Journal/Conference

  • Venue: arXiv preprint (identifier: arXiv:2512.07582)
  • Reputation: arXiv is a widely used preprint server; while not peer-reviewed, it disseminates cutting-edge research quickly. Subsequent journal or conference publication status is not indicated in the provided content.

1.4. Publication Year

2025 (Published at UTC: 2025-12-08T14:25:30.000Z)

1.5. Abstract

Objective: Build a general-purpose manipulation policy that can learn new tasks from a single expert demonstration video at test-time.

Core methodology:

  • A VLA model, ViVLA, that jointly processes an expert demonstration video and the robot’s current observations to predict both the demonstrated latent action sequence and the robot’s subsequent actions.

  • A latent action learning framework with action-centric cycle consistency (A3C) to create a unified latent action space across embodiments (human vs robot), using a tokenizer trained on both expert videos and agent trajectories.

  • A parallel decoding strategy to avoid shortcut learning present in autoregressive action prediction and reduce inference latency.

  • A video-driven expert-agent paired data generation pipeline that converts human demonstration videos into 4D robot scenes via 3D Gaussian splatting, creating human-robot pairs; combined with curated pairs from public datasets.

    Main results:

  • 30% improvement on unseen LIBERO tasks.

  • 35% improvement with cross-embodiment videos.

  • 38% improvement on real-world unseen tasks with human videos.

    Conclusions: ViVLA learns fine-grained manipulation from single demonstrations and transfers knowledge across embodiments without additional fine-tuning at test-time.

2. Executive Summary

2.1. Background & Motivation

Core problem: Current vision-language-action (VLA) models and robot policies struggle to generalize to unseen manipulation tasks outside their training distribution. Humans, however, can learn a novel skill by observing a single demonstration (“one-shot”).

Importance:

  • For practical robotics, eliminating task-specific data collection and fine-tuning is critical.

  • Efficient cross-embodiment learning (e.g., from human videos) enables intuitive robot instruction via demonstrations rather than code or complex programming.

    Challenges in prior work:

  • VLMs excel at semantic video understanding but lack fine-grained manipulation action recognition.

  • Video datasets lack action labels; embodiments differ (human vs robot), and action spaces are mismatched.

  • Autoregressive action modeling can cause shortcut learning (conditioning on ground-truth past actions) and slow inference.

  • Scarcity of expert-agent pair data to train generalizable policies.

    Paper’s entry point / innovative idea:

  • Train a VLA that, at test time, grounds robot actions in an expert video via a unified discrete latent action space learned from both robot and expert videos with cycle consistency.

  • Decode actions in parallel to prevent shortcuts and reduce latency.

  • Generate large-scale human-robot paired data via 3D Gaussian splatting, plus curated pairs from public datasets.

2.2. Main Contributions / Findings

  • A new paradigm (ViVLA) that learns novel manipulation tasks from one-shot videos at test time without further training:
    1. Latent action learning with action-centric cycle consistency (A3C) unifies action representations across embodiments, enabling transfer from expert videos (including humans) to robots.
    2. Parallel decoding of action sequences avoids autoregressive leakage, reduces latency, and enforces grounding in visual understanding.
    3. Temporal-spatial masking on expert videos improves training efficiency and forces robust video understanding from partial observations.
    4. A scalable video-driven expert-agent pair data generation pipeline reconstructs 4D robot scenes from human videos, yielding 89,736 human-robot pairs, augmented to a total of 892,911 expert-agent samples.
  • Empirical gains:
    • 30% improvement on unseen LIBERO tasks versus baselines.

    • 35% gains with cross-robot expert videos (UR demonstrations controlling a Franka agent).

    • 38% improvement on real-world unseen tasks learned from human videos.

      The following figure (Figure 1 from the original paper) shows the overall concept and performance overview:

      该图像是一个示意图,展示了ViVLA模型在单次任务学习中的训练和测试过程,包括专家演示视频与机器人的视觉观察的结合,以及883,911个专家-代理样本的生成和性能比较。 该图像是一个示意图,展示了ViVLA模型在单次任务学习中的训练和测试过程,包括专家演示视频与机器人的视觉观察的结合,以及883,911个专家-代理样本的生成和性能比较。

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Vision-Language Models (VLMs): Models that jointly process images/videos and text. They learn semantic understanding of visual scenes with language grounding.
  • Vision-Language-Action (VLA): Extends VLMs to output robot actions. VLAs are trained on robot trajectory datasets, sometimes quantizing continuous actions into discrete tokens.
  • Latent action: A discrete representation (token sequence) that captures transition dynamics between observations (e.g., what manipulation happens between frame t and t+H).
  • Vector-Quantized Variational Autoencoder (VQ-VAE): Learns a discrete codebook and quantizes latent vectors to nearest codebook entries. Useful to discretize actions for LM-style training.
  • Cycle consistency: A self-supervised constraint where information encoded in one step should be recoverable after transformations. Here, latent actions when decoded to frames and re-encoded should match the original sampled codebook indices.
  • Spatiotemporal transformer: A transformer architecture with interleaved spatial and temporal self-attention layers to model video dynamics.
  • Parallel decoding: Generating multiple tokens in one forward pass rather than autoregressive step-by-step, reducing latency and preventing reliance on ground-truth history tokens.
  • 3D Gaussian splatting: A method to reconstruct and render 3D scenes with Gaussian primitives, enabling novel view synthesis and 4D scene rendering.

3.2. Previous Works

  • RT-2 (Zitkovich et al., 2023): Discretized continuous actions into 256 bins per dimension; co-trained web-scale language models on robot trajectories to transfer semantic knowledge to control.
  • OpenVLA (Kim et al., 2024): VLA trained on the Open X-Embodiment (OXE) dataset spanning many robot embodiments with discretized actions.
  • Continuous action generation via action experts:
    • π0π_0 and π0.5π_0.5: Integrate specialized modules (e.g., flow matching) to produce continuous actions within VLA pipelines.
  • One-shot imitation learning (OSIL):
    • Duan et al., 2017: Original OSIL formulation with attention-based models for block stacking.
    • T-OSVI, AWDA: Transformers and waypoint-based approaches for one-shot video imitation.
    • OSVI-WM: World-model-guided trajectory generation from videos.
  • Cross-embodiment learning:
    • CrossFormer, Octo: Transformer-based policies robust across varied observation/action spaces.
    • Flow representations (ATM, TAPIR/COTRACKER): Predict trajectories of points for cross-domain transfer.
    • Latent action methods:
      • LAPA, UniVLA: Quantized latent actions from videos; UniVLA proposes task-centric latent actions decomposed into task-relevant/irrelevant components.
      • Genie: Generative interactive environments used as a latent action learning baseline by many VLAs.
  • Data augmentation:
    • MimicGen: Synthesizes demonstrations by rearranging object-centric segments from few human demonstrations.
    • 2D inpainting and view augmentation: Mirage, VISTA, Rovi-aug.
    • 3D scene editing: RoboSplat.

3.3. Technological Evolution

  • Early robot policies relied on task-specific datasets and hand-designed modules.
  • VLAs brought foundation model priors (LLMs/VLMs) but still struggled with unseen task generalization.
  • OSIL approaches introduced conditioning on demonstration content.
  • Recent works moved toward latent actions and cross-embodiment robustness.
  • This paper advances latent action learning with cycle consistency across expert and agent data, integrates parallel decoding to avoid autoregressive pitfalls, and scales with a novel human-video-driven data pairing pipeline.

3.4. Differentiation Analysis

Core differences and innovations:

  • Unified latent action space across embodiments via an action-centric cycle consistency objective and a local-global discriminator, directly regularizing semantics and preventing leakage.
  • Parallel decoding of action sequences eliminates shortcut learning and reduces latency, bolstering reliance on video understanding.
  • Temporal-spatial masking makes training efficient and robust by forcing reasoning on partial videos.
  • A practical human-video-to-robot-pairs pipeline via 3D Gaussian splatting expands training data drastically, strengthening generalization and real-world transfer.

4. Methodology

4.1. Principles

ViVLA’s core idea: During inference, given one expert video and the current robot observation plus language instruction, the model predicts:

  1. the latent action sequence exhibited in the expert video, and
  2. the robot’s subsequent latent actions and continuous control actions to execute the task.

Foundational intuitions:

  • Discrete latent actions learned from both expert and agent videos (with cycle consistency) bridge embodiment gaps.

  • Predicting latent actions first forces fine-grained action recognition from videos, not just semantic captioning.

  • Parallel decoding prevents exploitation of past ground-truth actions and speeds inference.

  • Training with temporal-spatial masking encourages robust spatiotemporal understanding from partial evidence.

    The following figure (Figure 5 from the original paper) shows the ViVLA architecture overview:

    该图像是示意图,展示了ViVLA模型的结构和工作流程。它采用了专家示范视频和机器人视觉观测进行操作,提取潜在动作并通过解码器生成具体动作,从而实现从单一示范视频中学习操控技能的过程。 该图像是示意图,展示了ViVLA模型的结构和工作流程。它采用了专家示范视频和机器人视觉观测进行操作,提取潜在动作并通过解码器生成具体动作,从而实现从单一示范视频中学习操控技能的过程。

4.2. Latent Action Learning with Action-Centric Cycle Consistency (A3C)

This stage learns latent actions jointly from expert videos and agent trajectories to produce a unified discrete codebook, regularized by cycle consistency and a discriminator.

4.2.1. Encoder-Quantizer-Decoder for Latent Actions

Input frames: For brevity, II denotes visual frames from either expert videos vv or agent observations oo. The method uses pairs (It,It+H)(I_t, I_{t+H}) separated by a temporal window HH to capture transition dynamics.

Feature extraction and ST-transformer encoding: $ \begin{array}{rl} & z_t^{e} = \mathrm{ST\mbox{-}Transformer}([f_t, \bar{f_{t+H}}, z]), \quad z_t^{e} \in \mathbb{R}^{l_z \times c_z} \ & z_t^{q} = \mathrm{VQ}(z_t^{e}), \quad z_t^{q} \in \mathbb{R}^{l_z \times c_z}. \end{array} $

  • ftf_t: DINOv2 embeddings of ItI_t (current frame).

  • ft+Hˉ\bar{f_{t+H}}: DINOv2 embeddings of It+HI_{t+H} (future frame).

  • zz: learnable latent action tokens injected into the transformer input.

  • ST\mboxTransformer\mathrm{ST\mbox{-}Transformer}: a spatiotemporal transformer with interleaved spatial and temporal attention.

  • ztez_t^{e}: continuous latent action embeddings before quantization (length lzl_z, channel czc_z).

  • VQ()\mathrm{VQ}(\cdot): vector quantization mapping to codebook entries.

  • ztqz_t^{q}: quantized latent action tokens.

    Tokenization and reconstruction summary: $ \begin{array}{r} \begin{array}{rl} & z_t^{q} = \mathscr{E}(I_t, I_{t+H}), \quad z_t^{q} \in \mathbb{R}^{l_z \times c_z}, \ & \hat{I}{t+H} = \mathscr{D}(I_t, z_t^{q}), \quad \hat{I}{t+H} \in \mathbb{R}^{w \times h \times c_o}. \end{array} \end{array} $

  • E\mathscr{E}: encoder that outputs quantized latent actions from frame pairs.

  • D\mathscr{D}: decoder (spatial transformer) reconstructing future frames from current frames and latent actions.

  • I^t+H\hat{I}_{t+H}: reconstructed future frame.

  • w,h,cow, h, c_o: image width, height, and channels.

    Symbol explanations:

  • lzl_z: number of tokens representing one latent action.

  • czc_z: token embedding dimensionality.

  • Codebook size KK: number of discrete latent entries used in quantization.

  • DINOv2: pre-trained vision backbone used to extract patch-wise embeddings.

    The following figure (Figure 4 from the original paper) shows the latent action learning pipeline:

    该图像是示意图,展示了ViVLA模型的工作流程。左侧部分展示了输入图像如何通过编码器和解码器进行处理,生成潜在动作缓冲区的数据。右侧则展示了使用DINOv2的编码器和解码器,以及空间-时间变换器的结构,强调了如何从量化潜在动作中获取目标信息。整体流程旨在实现通过单一视频演示进行高效任务学习。 该图像是示意图,展示了ViVLA模型的工作流程。左侧部分展示了输入图像如何通过编码器和解码器进行处理,生成潜在动作缓冲区的数据。右侧则展示了使用DINOv2的编码器和解码器,以及空间-时间变换器的结构,强调了如何从量化潜在动作中获取目标信息。整体流程旨在实现通过单一视频演示进行高效任务学习。

4.2.2. Action-Centric Cycle Consistency

Goal: Ensure that latent actions represent semantically consistent transitions that generalize across embodiments.

Procedure: $ \begin{array}{rl} & \hat{I}_{g} = \mathcal{D}(I_c, z_s^{q}), \quad z_s^{q} \sim \mathcal{Z}, \ & \hat{z}s^{q} = \mathcal{E}(I_c, \hat{I}{g}), \quad \hat{z}_s^{q} \in \mathbb{R}^{l_z \times c_z}. \end{array} $

  • IcI_c: a current observation frame sampled from the dataset (expert or agent).

  • zsqz_s^{q}: a latent action sampled from the latent action buffer Z\mathcal{Z} (collected over previous BB batches).

  • I^g\hat{I}_g: the generated subsequent frame when decoding IcI_c with zsqz_s^{q}.

  • z^sq\hat{z}_s^{q}: the re-encoded latent action from (Ic,I^g)(I_c, \hat{I}_g).

    Consistency loss (computed via pre-quantization embeddings and codebook distances): $ \mathcal{L}{C} = - \Sigma{k=1}^{K} y_{k} \log \left( \frac{\exp(- d(\hat{z}s^{e}, e{k})/\tau)}{\Sigma_{j=1}^{K} \exp(- d(\hat{z}s^{e}, e{j})/\tau)} \right). $

  • z^se\hat{z}_s^{e}: latent action embedding before quantization for the re-encoded action.

  • eke_k: kk-th codebook vector.

  • d(,)d(\cdot, \cdot): distance metric (e.g., squared Euclidean).

  • τ\tau: temperature parameter controlling softness of the distribution.

  • yky_k: one-hot target over codebook indices equal to the sampled zsqz_s^{q} index.

    Cross-embodiment unification: Sample zsqz_s^{q} from embodiment EiE_i and apply to frames from EjE_j, then enforce z^sqzsq\hat{z}_s^{q} \approx z_s^{q} via LC\mathcal{L}_C, compelling a unified, embodiment-agnostic latent space.

4.2.3. Local-Global Discriminator to Prevent Leakage and Distribution Shift

Problem: Generated frames I^g\hat{I}_g may differ in distribution from real frames; the decoder might “leak” latent codes into generated frames in trivial ways enabling shortcut re-encoding.

Discriminator architecture: $ \begin{array}{rl} & \mathcal{F}{l} = \mathrm{Spatial\mbox{-}Transformer}(I), \quad \sigma{l} = \mathrm{MLP}(\mathcal{F}{l}), \ & \mathcal{F}{g} = \mathrm{GlobalPool}(\mathrm{Conv2D}(\mathcal{F}{l})), \quad \sigma{g} = \mathrm{MLP}(\mathcal{F}_{g}). \end{array} $

  • Fl\mathcal{F}_{l}: local patch features from a spatial transformer over input frame II.

  • σl\sigma_l: local logits per patch.

  • Fg\mathcal{F}_{g}: global features via Conv2D + global pooling.

  • σg\sigma_g: global logits.

    Adversarial losses (applied at local and global levels): $ \begin{array}{rl} & \mathcal{L}{GAN}^{\Psi} = - \log(\Psi(o)) - (1 - \log(\Psi(\mathcal{D}(o, z)))), \ & \mathcal{L}{GAN}^{\mathcal{D}} = 1 - \log(\Psi(\mathcal{D}(o, z))). \end{array} $

  • Ψ()\Psi(\cdot): discriminator outputs (probability/logit) predicting “real” (dataset) vs “generated”.

  • oo: dataset frame (real sample).

  • D(o,z)\mathcal{D}(o, z): decoder-generated frame from oo and latent zz.

    Note: The functional form is unusual compared to standard GAN losses but must be followed faithfully as presented. The discriminator penalizes distribution deviation; the decoder aims to make generated frames indistinguishable from real frames, and avoid trivially embedding code indices.

Symbol explanations:

  • Ψ\Psi: discriminator network with both patch-wise and global outputs.

  • D\mathcal{D}: frame decoder from latent action and current frame.

  • o, z: real frame and latent action token sequence.

    The following figure (Figure 3 from the original paper) illustrates why cycle consistency and unified latent actions matter: semantic consistency and embodiment unification:

    该图像是表征机器人操作的一组对比图,展示了通过参考、Genie 和我们的模型生成的当前帧与动作生成。第一部分(a)侧重于语义一致性,显示了不同模型在生成动作方面的差异;第二部分(b)展示了体现统一性,强调了我们的模型生成的结果在对齐方面的优势。 该图像是表征机器人操作的一组对比图,展示了通过参考、Genie 和我们的模型生成的当前帧与动作生成。第一部分(a)侧重于语义一致性,显示了不同模型在生成动作方面的差异;第二部分(b)展示了体现统一性,强调了我们的模型生成的结果在对齐方面的优势。

4.3. ViVLA Training for One-Shot Task Learning

This stage trains the full VLA to predict latent actions and robot actions conditioned on masked expert videos, agent observations, and language instructions.

4.3.1. Temporal-Spatial Masking Strategy

  • Temporal masking: Keep only selected frames within a window (as in Qwen2.5-VL input scheme) but preserve absolute time encoding for original timesteps to retain temporal order information.

  • Spatial masking: After vision encoding into patch tokens, apply spatial dropout/masking, passing only unmasked tokens to the language model (LM).

  • Purpose: Reduce redundancy and computational load; force action prediction from partially observed videos, enhancing spatiotemporal reasoning.

    Encoded streams:

  • hvh_v: tokens from masked expert video.

  • hoh_o: tokens from agent observations (images).

  • hh_{\ell}: tokens from language instructions.

4.3.2. Parallel Decoding of Latent Actions and Robot Actions

Mechanism: Extend LM’s vocabulary with special START tokens indicating the number of query tokens to decode in parallel, plus query tokens for latent actions (LACT) and robot actions (ACT).

Parallel decoding equations: $ \begin{array}{c} s_t = \mathbf{LM}(h_{\leq t}), \quad h_{\leq t} = [h_v, h_o, h_{\ell}, h_p] \ {\hat{z}{t+i}}{i=0}^{n-1} = \mathbf{LM}([h_{\leq t}, s_t, \mathbb{LACT}^{n}]). \end{array} $

  • hth_{\leq t}: all current context tokens (video, observation, language, past LM outputs hph_p).

  • sts_t \in {START_LACT_n | n{1,2,,MAX_N}n \in \{1,2,\ldots, \mathrm{MAX\_N}\}}: signals decoding nn latent action tokens.

  • LACTn\mathbb{LACT}^{n}: nn latent action query tokens passed to LM for simultaneous decoding.

  • z^t+i\hat{z}_{t+i}: predicted latent action tokens.

    Robot actions follow the same mechanism with {START_ACT_n} and ACT query tokens.

Advantages:

  • Prevents information leakage from preceding ground-truth actions seen during training (a core issue in autoregressive next-token prediction).
  • Single forward pass for multiple tokens reduces inference latency.
  • Encourages grounding on visual and instruction content.

4.3.3. Latent Action Targets and Losses

Targets: $ \begin{array}{r} z_{\boldsymbol{v}, t}^{q} = \boldsymbol{\mathcal{E}}(\boldsymbol{v}t, \boldsymbol{v}{t+H}), \ z_{o, t}^{q} = \boldsymbol{\mathcal{E}}(\boldsymbol{o}t, \boldsymbol{o}{t+H}). \end{array} $

  • zv,tqz_{\boldsymbol{v}, t}^{q}: latent actions from expert video frame pairs.

  • zo,tqz_{o, t}^{q}: latent actions from agent observation frame pairs.

    Mapping: Each latent action token corresponds to a codebook index in a vocabulary of size KK via tokens {LACT_1, LACT_2, …, LACT_K}.

Loss: $ \mathcal{L}{z} = \mathbb{E}{z}\left[ - \sum_{i=1}^{N_z} \log P(\hat{z}_i^{q} = z_i^{q}) \right], $

  • NzN_z: total length of the latent action token sequence.

  • z^iq\hat{z}_i^{q}: LM’s predicted token.

  • ziqz_i^{q}: ground-truth codebook index token.

  • P(z^iq=ziq)P(\hat{z}_i^{q} = z_i^{q}): probability assigned by LM to the correct index.

    Robot action decoding: The action decoder aggregates final-layer LM embeddings of ACT tokens via attention and pooling, then maps to continuous robot actions via an MLP. Trained with L1 loss between predicted and ground-truth actions, jointly with Lz\mathcal{L}_{z}.

Auxiliary tasks:

  • Temporal localization: Insert agent observation images into the expert video sequence; train the model to identify their temporal positions to strengthen cross-modal temporal grounding.

  • Video exclusion during training: Randomly drop expert video, requiring the model to predict based on observations + instructions only, preserving robustness on seen tasks without demonstrations.

    Post-training:

  • Adapt to target robot embodiment by fully fine-tuning the action decoder and fine-tuning the VLM backbone via LoRA, keeping adaptation efficient while preserving generalization.

4.3.4. Architecture Backbone

ViVLA builds on Qwen2.5-VL:

  • Vision Transformer with window attention.
  • Vision-language merger (MLP) compressing visual features.
  • Large language model (Qwen2.5) multi-modal reasoning.

4.4. Video-Driven Expert-Agent Data Generation

Pipeline goals: Construct human-robot paired training samples by rendering robot executions aligned to human video demonstrations in reconstructed 4D scenes.

The following figure (Figure 6 from the original paper) shows the data generation pipeline:

该图像是示意图,展示了通过人类视频进行抓取优化和4D场景重建的过程。上方的部分显示了人类演示视频与结果的对比,底部则阐述了如何合成新视角、增强外观以及编辑视觉属性来生成机器人的演示。 该图像是示意图,展示了通过人类视频进行抓取优化和4D场景重建的过程。上方的部分显示了人类演示视频与结果的对比,底部则阐述了如何合成新视角、增强外观以及编辑视觉属性来生成机器人的演示。

4.4.1. Interaction Grounding

  • Hand tracking: HaMeR predicts hand shape/pose and mesh; ICP aligns hand mesh to segmented hand point cloud to obtain precise hand pose trajectories ξH\boldsymbol{\xi}_H.

  • Object pose estimation: FoundationPose or related pose trackers can provide ξΩ\boldsymbol{\xi}_{\Omega} (object trajectories), though the text is corrupted; conceptually, we obtain object pose sequences.

    Gripper pose computation from the hand (Figure 7):

  • Grasp point: midpoint between thumb and index fingertip.

  • Frame axes:

    • XX: normal to the plane spanned by tracked points of thumb/index.
    • YY: vector from grasp point to index fingertip.
    • ZZ: cross product. $ \vec{z} = \vec{x} \times \vec{y}. $ Symbols:
  • x,y\vec{x}, \vec{y}: unit vectors defining local gripper frame axes.

  • z\vec{z}: third axis via right-handed cross product.

    The following figure (Figure 7 from the original paper) depicts the 6D gripper pose construction:

    Fig. 7: Calculation of the 6D gripper pose from the estimated hand pose. The grasp point is computed as the midpoint between the thumb and index finger tips. The coordinate frame is defined with the \(X\) -axis normal to the plane spanned by all tracked points on both digits, the \(Y\) -axis pointing from the grasp point to the index finger tip, and the \(Z\) -axis obtained through \(\\vec { z } = \\vec { x } \\times \\vec { y }\) . 该图像是一个手部姿态的示意图,展示了6D夹持器姿态的计算过程。抓取点被定义为拇指与食指尖之间的中点,坐标框架中,XX 轴垂直于由两指尖的所有追踪点所构成的平面,YY 轴指向从抓取点到食指尖的方向,ZZ 轴通过 oldsymbol{z} = oldsymbol{x} imes oldsymbol{y} 计算得出。

4.4.2. Video Parsing into Subtasks via Contact Events

Compute distances between hand and object point clouds P\mathcal{P}; detect contact initiation/end: $ \begin{array}{rl} & d = \mathrm{dist}(\mathcal{P}), \quad t_b = { t \mid d^{t-1} > \epsilon \wedge d^{t} < \epsilon }, \ & t_e = { t \mid d^{t-1} < \epsilon \wedge d^{t} > \epsilon }. \end{array} $

  • dist(P)\mathrm{dist}(\mathcal{P}): distance between two point clouds.

  • ϵ\epsilon: contact threshold.

  • tbt_b: contact begin times.

  • tet_e: contact end times.

    Classify clips:

  • Grasping phase: object stationary; robot reaches and grasps.

  • Manipulation phase: robot moves object and interacts with other objects.

4.4.3. End-Effector Pose Optimization

  • At grasp time tgt_g (end of grasping phase), sample NN candidate 6D grasps near initial grounded pose; filter feasible grasps (no collision; object inside grasp region).
  • Compute stability via normalized perpendicular distance from gripper plane to object center of gravity (COG); select highest score pose and express relative to object frame.
  • In manipulation phase, assume end-effector–object relative pose invariant; compose object motion with optimized grasp pose to obtain optimized end-effector trajectories ξ^E\hat{\xi}_E.

4.4.4. 4D Scene Reconstruction via 3D Gaussian Splatting

  • Objects: Gaussian models GΩ\mathcal{G}_{\Omega} converted from meshes with grounded trajectories.

  • Robot: Gaussian model GR\mathcal{G}_R from URDF/reconstruction; joint positions from motion planning aligned to ξ^E\hat{\xi}_E.

  • Render high-fidelity 4D scenes aligned to human video.

    Demonstration augmentations to diversify data (Novel view synthesis; textures; lighting):

  • Camera view: Render from diverse viewpoints via 3D Gaussian splatting.

  • Scene appearance: Vary table/background textures in Gaussian planes.

  • Lighting: Random scaling, offsets, noise to diffuse color attributes per Gaussian.

    The following figure (Figure 8 from the original paper) shows the object set used:

    Fig. 8: The set of objects used for the data generation pipeline. 该图像是用于数据生成管道的对象集合,展示了多种日常物品,包括餐具、玩具和零食等,组成了一幅丰富的视觉资源,旨在支撑机器人的学习与操作。

The following figure (Figure 2 from the original paper) illustrates training data generation toward 892K samples:

该图像是示意图,展示了892K专家代理样本的训练数据生成流程。流程中,专家视频和观察输入通过大规模视觉语言模型处理,以提取潜在动作,并使用潜在动作学习(LAT)编码器和解码器进行操控动作的预测与训练。 该图像是示意图,展示了892K专家代理样本的训练数据生成流程。流程中,专家视频和观察输入通过大规模视觉语言模型处理,以提取潜在动作,并使用潜在动作学习(LAT)编码器和解码器进行操控动作的预测与训练。

5. Experimental Setup

5.1. Datasets

Training corpus categories:

  • Robotic manipulation data: Selected subset of Open X-Embodiment (OXE) focused on single-arm end-effector control tasks.

  • Human video data: Ego-centric recordings from Ego4D and EgoDex.

  • Synthetically generated human-robot pairs: Human2Robot dataset via the pipeline in Section 4.4.

    The following are the results from Table I of the original paper:

    Datasets Num of Traj.
    Fractal [79] 87,212
    Bridge [80] 60,064
    Droid [81] 49,933
    Language Table [82] 442,226
    BC-Z [27] 43,264
    FMB Dataset [83] 4,592
    Ego4D [81] 18,368
    EgoDex [84] 97,516
    Human2Robot 89,736
    Overall 892,911

Rationale: Large-scale, diverse datasets covering multiple embodiments and tasks improve generalization. Human2Robot pairs specifically bridge human-to-robot transfer.

5.2. Evaluation Metrics

Primary metric: Success rate (SR) in task execution.

  • Conceptual definition: The fraction of trials in which the robot successfully completes the task as judged by criteria (e.g., human evaluation for real-world tasks).
  • Mathematical formula: $ \mathrm{SR} = \frac{N_{\mathrm{success}}}{N_{\mathrm{total}}}. $
  • Symbol explanation:
    • NsuccessN_{\mathrm{success}}: Number of successful trials.

    • NtotalN_{\mathrm{total}}: Total number of trials attempted.

      Secondary analyses: Seen vs unseen task splits on LIBERO; cross-embodiment robustness; robustness to environment variations.

5.3. Baselines

  • Diffusion Policy (DP): Visuomotor policy via conditional denoising diffusion; stable training for high-dimensional multimodal actions.

  • AWDA: One-shot imitation via attributed waypoints and demonstration augmentation; executes via motor primitives.

  • OpenVLA: Prismatic7B-based VLA trained on OXE and post-trained on LIBERO.

  • UniVLA: Task-centric latent actions trained on action-less datasets; cross-embodiment focus.

    Baselines are representative of diffusion-based policies, waypoint-based OSIL, and state-of-the-art VLAs focusing on discrete/latent action training.

Protocols:

  • LIBERO evaluation: 4 suites (Spatial, Object, Goal, Long), each with 10 tasks; 8 seen tasks for training, 2 unseen for testing.

  • Cross-robot setting: UR arm provides expert videos; Franka arm is agent; model post-trained on UR–Franka pairs; inference from UR video to Franka actions.

  • Real-world setting: Franka Emika 7-DoF arm; 12 tasks (6 seen, 6 unseen); 10 trials per task; human evaluation success.

    The following figure (Figure 10 from the original paper) shows the real-world experimental setup:

    Fig. 10: Real-world experiment setup. 该图像是实验设置的示意图,展示了一个机械臂与植物的操作环境。图中标注了侧视图和俯视图,显示了机械臂的工作领域及其与环境的交互方式。

6. Results & Analysis

6.1. Core Results Analysis

  • Unseen task learning (same embodiment, LIBERO): ViVLA outperforms DP, AWDA, OpenVLA, and UniVLA on unseen tasks with >30% improvement. Baselines degrade markedly on unseen tasks despite training on OXE or LIBERO.

  • Cross-robot (UR→Franka): ViVLA maintains strong performance; unseen task success remains high with only marginal drop relative to same embodiment, attributed to unified latent action space via A3C.

  • Human videos to real-world: ViVLA achieves high success on seen tasks (validates Human2Robot fidelity) and strong performance on unseen real-world tasks learned from one human video, >38% improvement.

    The following figure (Figure 9 from the original paper) presents example qualitative results and tabular performance comparisons:

    Fig. 9: Example qualitative results for real-world manipulation task. 该图像是表格和示例图的组合,展示了不同方法在LIBERO任务上的表现。表格中列出了多种方法在已知和未知任务的表现指标,并配有相应的操作示例,包含如翻转块、擦拭托盘等操作步骤。

6.2. Data Presentation (Tables)

Because the original paper provides image figures for the key results tables (e.g., Figure 9), we use the Image Citation Guidelines and embed them rather than transcribe.

6.3. Robustness and Environment Generalization

  • Object count and spatial arrangement variations: ViVLA maintains high success across tasks despite differences between expert video scenes and agent execution scenes.

  • Environment variations (camera, lighting, scene): Lighting impacts minimally; camera perspective and scene settings affect performance more, but overall robustness remains strong.

    As can be seen from the results in Figure 12 (from the original paper), success rates under environmental variants remain high:

    该图像是示意图,展示了不同环境变体对成功率的影响,包括相机设置、照明条件和场景变化。左侧(a)部分列出了四种环境变体,右侧(b)部分呈现了各变体的成功率对比图,显示在不同环境下机器人操作的成功率。 该图像是示意图,展示了不同环境变体对成功率的影响,包括相机设置、照明条件和场景变化。左侧(a)部分列出了四种环境变体,右侧(b)部分呈现了各变体的成功率对比图,显示在不同环境下机器人操作的成功率。

6.4. Ablation Studies / Parameter Analysis

The authors conduct comprehensive ablations on:

  • Latent action prediction tasks.

  • Discriminator and latent action buffer within A3C.

  • Temporal-spatial masking.

  • Parallel vs autoregressive modeling.

  • Language/video exclusion.

    The following are the results from Table V of the original paper:

    Methods LIBERO-Spatial LIBERO-Object LIBERO-Goal LIBERO-Long
    Seen Unseen Seen Unseen Seen Unseen Seen Unseen
    (a) Latent action prediction
    W/o prediction0.910.480.890.550.810.470.760.33
    Genie0.930.650.910.620.870.530.820.41
    A3C0.950.710.980.730.950.580.900.51
    (b) Latent action learning framework designs
    W/o discriminator0.890.540.910.590.860.450.810.38
    Local discriminator0.930.680.960.690.940.550.870.46
    W/o Latent action buffer0.920.660.970.700.910.500.860.46
    A3C0.950.710.980.730.950.580.900.51
    (c) Temporal-spatial masking strategy
    W/o masking0.940.640.980.670.940.540.910.46
    Spatial masking0.950.690.970.720.950.560.900.48
    Temporal-spatial masking0.950.710.980.730.950.580.900.51
    (d) Parallel modeling
    Auto-regressive0.920.630.960.670.930.520.870.35
    Parallel modeling0.950.710.980.730.950.580.900.51

Interpretation:

  • Removing latent action prediction harms unseen task performance: latent prediction pretraining is critical.

  • A3C outperforms Genie, validating cycle consistency’s role in semantic, unified latent actions.

  • Discriminator is essential to prevent leakage and distribution mismatch; local-global variant best.

  • Latent action buffer stabilizes training; removing it leads to collapse.

  • Temporal-spatial masking improves unseen performance by encouraging robust video understanding.

  • Parallel modeling beats autoregressive, especially on long-horizon unseen tasks, avoiding compounding errors and shortcut learning.

    Language/video exclusion ablation (Table VI):

    Methods Close basket Stir in tray Stack block Beat the drum Push the toy Overall
    Ours w/o lang 1.0 0.9 0.7 0.6 0.6 0.76
    Ours w/o video 0.9 0.9 0.3 0.1 0.0 0.44
    Ours 1.0 0.9 0.8 0.6 0.7 0.80

Interpretation:

  • Without language, performance degrades slightly: ViVLA can infer task semantics from demonstrations alone (strong video understanding).
  • Without videos, unseen tasks suffer drastically: videos are essential for learning novel behaviors; seen tasks remain decent due to internalized skills.

7. Conclusion & Reflections

7.1. Conclusion Summary

  • ViVLA enables one-shot learning of unseen manipulation tasks directly from expert videos at test time, without additional fine-tuning.
  • A3C latent action learning with cycle consistency yields a unified latent action space bridging human and robot embodiments.
  • Parallel decoding and temporal-spatial masking ensure robust video understanding and efficient inference.
  • A scalable 4D Gaussian splatting pipeline generates human-robot paired data, expanding to 892,911 samples with curated public pairs.
  • Empirically, ViVLA achieves >30% improvement on unseen LIBERO tasks, >35% gains with cross-embodiment videos, and >38% on real-world unseen tasks from human videos.

7.2. Limitations & Future Work

Authors’ noted limitations:

  • Failures often arise from grasping/placement inaccuracies due to occlusions and static camera viewpoints with low task-relevant pixel density.
  • Proposed addition of a wrist-mounted camera to increase detailed interaction visibility and task-relevant information.
  • Error recovery can be improved by augmenting data with controlled trajectory perturbations and recovery sequences.
  • Scaling to internet-scale human videos requires robust filtering, task identification, and quality assessment pipelines.

7.3. Personal Insights & Critique

  • Innovations likely generalize beyond manipulation to other embodied tasks that require fine-grained action recognition (e.g., navigation with interactive components), provided latent actions can be defined.

  • The A3C approach is particularly compelling: enforcing codebook-level cycle consistency across embodiments could benefit multi-agent imitation, teleoperation, or skill libraries from diverse sources.

  • The GAN losses are unconventional; a deeper analysis of stability, discriminator calibration, and the exact leakage mitigation mechanisms would be valuable.

  • Parallel decoding is a strong practical idea to fight autoregressive shortcut behavior; future work could explore hybrid schedules that switch between parallel and limited autoregressive decoding for longer horizons.

  • The Human2Robot pipeline is a promising direction; integrating physics-consistency checks (e.g., grasp stability simulators) and richer tactile cues could further improve transfer fidelity.

  • An open question is how instruction conditioning interacts with latent actions for compositional tasks; future research might decompose high-level goals into latent sub-skill sequences and plan over them.

  • Robustness to camera/view/scene variations is good but still impacted; multi-camera fusion and active viewpoint control during execution could be explored to mitigate failures due to occlusions or perspective shifts.

    The following figure (Figure 11 from the original paper) provides additional scenes and task results across varied setups:

    该图像是示意图,展示了多项任务的演示场景及其对应的结果数据。表格中列出了不同方法在多个任务上的表现,包括“Close basket”、“Stir in tray”、“Stack block”、“Beat the drum”和“Push the toy”,并比较了我们的方法与其他方法的得分,整体得分为0.80。 该图像是示意图,展示了多项任务的演示场景及其对应的结果数据。表格中列出了不同方法在多个任务上的表现,包括“Close basket”、“Stir in tray”、“Stack block”、“Beat the drum”和“Push the toy”,并比较了我们的方法与其他方法的得分,整体得分为0.80。

Similar papers

Recommended via semantic vector search.

No similar papers found yet.