Tags: Vision-Language-Action Model - Paper Library

SmolVLA is a compact and efficient visionlanguageaction model that achieves competitive performance at reduced computational costs, enabling deployment on consumergrade hardware and promoting broader participation in robotics research through communitydriven dataset pretraini

CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human

Published:9/18/2025

Vision-Language-Action ModelSelf-Reflective FrameworkDiffusion-Based Action GenerationMixture-of-Experts DesignAction Grounding and Reflection Tuning

CollabVLA transforms standard visuomotor policies into collaborative assistants by integrating reflective reasoning with diffusionbased action generation, addressing key limitations like overfitting and interpretability while enabling selfreflection and human guidance.

Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

Published:5/21/2025

Vision-Language-Action ModelCross-Task Zero-Shot GeneralizationRobotic Manipulation BenchmarkLLM Task PredictionAGNOSTOS Benchmark

This study introduces the AGNOSTOS benchmark for evaluating crosstask zeroshot generalization in VisionLanguageAction models and proposes the XICM method, which enhances action sequence prediction for unseen tasks, showing significant performance improvements.

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Published:10/22/2025

Vision-Language-Action ModelNatural Interaction FrameworkParallel Behavioral ModelsHumanoid Robot InteractionReal-Time User Interruption Handling

VITAE introduces a dualmodel framework enabling simultaneous seeing, hearing, speaking, and acting, overcoming limitations of existing VLA models. This design supports efficient multitasking while handling realtime interruptions, enhancing interaction fluidity and responsivene

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

Published:7/23/2025

Vision-Language-Action ModelMultimodal Reasoning and Action GenerationVision-Language-Action Instruction TuningMixture-of-Experts Adaptation TrainingRobotic Operation and High-Level Instruction Understanding

InstructVLA is a novel VisionLanguageAction model that introduces VisionLanguageAction Instruction Tuning (VLAIT) to mitigate catastrophic forgetting in robotic tasks. It achieves significant performance improvements while maintaining robust multimodal understanding and prec

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Published:11/29/2024

Vision-Language-Action ModelRobotic Manipulation Task ExecutionDiffusion Action TransformerLarge Vision-Language ModelRobotic Adaptation and Generalization

CogACT is a novel VisionLanguageAction model that enhances robotic manipulation by decoupling cognition and action. Its componentized architecture utilizes a powerful VisionLanguage Model for comprehension and a Diffusion Action Module for precise control, significantly outper

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Published:2/27/2025

Vision-Language-Action ModelMulti-Task Robotic ManipulationOpen-Ended Instruction FollowingComplex Instruction ProcessingRobotic Feedback Mechanism

The system utilizes hierarchical visionlanguage models to effectively handle complex instructions and feedback, reasoning to determine the next steps in task execution, demonstrated across multiple robotic platforms for tasks like cleaning and making sandwiches.

Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning

Published:8/14/2025

Large Model Empowered Embodied AIEmbodied Learning MethodologiesAutonomous Decision-Making MechanismsVision-Language-Action ModelImitation Learning and Reinforcement Learning

This survey discusses large model empowered embodied AI, focusing on autonomous decisionmaking and embodied learning. It explores hierarchical and endtoend decision paradigms, highlighting how large models enhance decision processes and VisionLanguageAction models, including

Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols

Published:12/2/2025

Vision-Language-Action ModelRobot Manipulation Failure DiagnosisViFailback DatasetVisual Question Answering TasksRobotic Manipulation Correction Guidance

The paper introduces the ViFailback framework for diagnosing and correcting robotic manipulation failures using explicit visual symbols. A large dataset with 58,126 VQA pairs and 5,202 trajectories is released to validate the ViFailback8B model's effectiveness in realworld reco

REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation

Published:12/23/2025

Vision-Language-Action ModelRobotic Manipulation BenchmarkRobotic Generalization EvaluationHigh-Fidelity Simulation EnvironmentTask Perturbation Factors

REALM introduces a highfidelity simulation environment for evaluating the generalization of VisionLanguageAction models, featuring 15 perturbation factors and over 3,500 objects. Validation shows strong alignment between simulated and realworld performance, highlighting ongoi

Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting

Published:9/26/2025

Fine-Tuning of Visual Language ModelsVision-Language-Action ModelCatastrophic Forgetting PreventionLow-Rank Adaptation MethodRobot Teleoperation Data

The paper presents VLM2VLA, a method for finetuning visionlanguage models (VLMs) into visionlanguageaction models (VLAs) without catastrophic forgetting, by representing lowlevel robot actions in natural language, achieving zeroshot generalization in real experiments.

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

Published:3/16/2025

Vision-Language-Action ModelRobotic Video SynthesisReal-to-Sim-to-Real ApproachRobot Dataset ScalingRobotic Manipulation Tasks

ReBot enhances robot learning by proposing a realtosimtoreal video synthesis method, addressing data scaling challenges. It replays real robot movements in simulators and combines them with inpainted realworld backgrounds, significantly improving VLA model performance with s

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Published:8/27/2025

Vision-Language-Action ModelRobotic ManipulationLong-Term Memory and Anticipatory ActionMemory-Conditioned Diffusion ModelsShort-Term Memory and Cognition Fusion

MemoryVLA is a memorycentric VisionLanguageAction framework for nonMarkovian robotic manipulation, integrating working memory and episodic memory. It significantly enhances performance in 150 tasks, achieving up to a 26% success rate increase across simulations and realworl

Real-Time Execution of Action Chunking Flow Policies

Published:6/9/2025

Real-Time Action Chunking Policy ExecutionVision-Language-Action ModelHigh-Frequency Control TasksKinetix SimulatorAction Chunking Algorithm

This paper introduces a novel algorithm called realtime chunking (RTC) to address inference latency issues in realtime control of visionlanguageaction models, showing improved task throughput and high success rates in dynamic and realworld tasks.

VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

Published:11/21/2025

Vision-Language-Action ModelSpatiotemporal Coherent Robotic Manipulation4D-aware Visual RepresentationMultimodal Action RepresentationVLA Dataset Extension

The VLA4D model introduces 4D awareness into VisionLanguageAction models for coherent robotic manipulation, integrating spatial and temporal information to ensure smooth and consistent actions in robot tasks.

See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

Published:12/8/2025

Vision-Language-Action ModelRobotic Manipulation Policy LearningOne-Shot Learning from DemonstrationLearning from Human Video DemonstrationsExpert Demonstration Video Generation

ViVLA is a general robotic manipulation model that learns new tasks from a single expert video demonstration. By processing the video alongside robot observations, it distills expertise for improved performance in unseen tasks, showing significant experimental gains.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Published:3/19/2025

Generalist Humanoid Robot Foundation ModelVision-Language-Action ModelDiffusion Transformer ModuleHumanoid Robot Manipulation TasksMultimodal Data Training

GR00T N1 is an open foundation model for humanoid robots, integrating a reasoning module and a motion generation module. Trained endtoend with a pyramid of heterogeneous data, it outperforms existing imitation learning methods in simulation benchmarks, demonstrating high perfor

FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models

Published:10/2/2025

Vision-Language-Action ModelRobotic Manipulation Failure RecoveryFailure Generation and Recovery SystemRobotic Manipulation DatasetLarge-Scale Robot Training Data

The paper introduces FailSafe, a system that generates diverse failure scenarios and executable recovery actions for VisionLanguageAction (VLA) models, achieving up to 22.6% performance improvement in robotic failure recovery across various tasks.

RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction

Published:5/18/2025

Robotic Failure Analysis and Correction FrameworkVision-Language-Action ModelTask Understanding for Failure CorrectionRoboFAC DatasetOpen-World Robotic Manipulation

The RoboFAC framework enhances robotic failure analysis and correction for VisionLanguageAction models in openworld scenarios. It includes a large dataset and a model capable of task understanding, with experimental results showing significant performance improvements.

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

Published:10/8/2025

Vision-Language-Action ModelSpatial Reasoning MechanismTarget Identification MemoryLong-Horizon Consistency ModelingAutoregressive Reasoning Model

TrackVLA is a novel VisionLanguageAction model that enhances embodied visual tracking by introducing a spatial reasoning mechanism and Target Identification Memory. It effectively addresses tracking failures under severe occlusions and achieves stateoftheart performance.

011

1 - 20 / 31

Free Reads