Free Reads

Sign in to view your remaining parses.
Tag Filter
Vision-Language-Action Model
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Published:6/2/2001
Vision-Language-Action ModelEfficient Robotics ModelCompact Model DesignConsumer-Grade Hardware Deployment
SmolVLA is a compact and efficient visionlanguageaction model that achieves competitive performance at reduced computational costs, enabling deployment on consumergrade hardware and promoting broader participation in robotics research through communitydriven dataset pretraini
01
CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human
Published:9/18/2025
Vision-Language-Action ModelSelf-Reflective FrameworkDiffusion-Based Action GenerationMixture-of-Experts DesignAction Grounding and Reflection Tuning
CollabVLA transforms standard visuomotor policies into collaborative assistants by integrating reflective reasoning with diffusionbased action generation, addressing key limitations like overfitting and interpretability while enabling selfreflection and human guidance.
05
Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization
Published:5/21/2025
Vision-Language-Action ModelCross-Task Zero-Shot GeneralizationRobotic Manipulation BenchmarkLLM Task PredictionAGNOSTOS Benchmark
This study introduces the AGNOSTOS benchmark for evaluating crosstask zeroshot generalization in VisionLanguageAction models and proposes the XICM method, which enhances action sequence prediction for unseen tasks, showing significant performance improvements.
03
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
Published:10/22/2025
Vision-Language-Action ModelNatural Interaction FrameworkParallel Behavioral ModelsHumanoid Robot InteractionReal-Time User Interruption Handling
VITAE introduces a dualmodel framework enabling simultaneous seeing, hearing, speaking, and acting, overcoming limitations of existing VLA models. This design supports efficient multitasking while handling realtime interruptions, enhancing interaction fluidity and responsivene
03
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation
Published:7/23/2025
Vision-Language-Action ModelMultimodal Reasoning and Action GenerationVision-Language-Action Instruction TuningMixture-of-Experts Adaptation TrainingRobotic Operation and High-Level Instruction Understanding
InstructVLA is a novel VisionLanguageAction model that introduces VisionLanguageAction Instruction Tuning (VLAIT) to mitigate catastrophic forgetting in robotic tasks. It achieves significant performance improvements while maintaining robust multimodal understanding and prec
02
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Published:11/29/2024
Vision-Language-Action ModelRobotic Manipulation Task ExecutionDiffusion Action TransformerLarge Vision-Language ModelRobotic Adaptation and Generalization
CogACT is a novel VisionLanguageAction model that enhances robotic manipulation by decoupling cognition and action. Its componentized architecture utilizes a powerful VisionLanguage Model for comprehension and a Diffusion Action Module for precise control, significantly outper
01
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Published:2/27/2025
Vision-Language-Action ModelMulti-Task Robotic ManipulationOpen-Ended Instruction FollowingComplex Instruction ProcessingRobotic Feedback Mechanism
The system utilizes hierarchical visionlanguage models to effectively handle complex instructions and feedback, reasoning to determine the next steps in task execution, demonstrated across multiple robotic platforms for tasks like cleaning and making sandwiches.
02
Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning
Published:8/14/2025
Large Model Empowered Embodied AIEmbodied Learning MethodologiesAutonomous Decision-Making MechanismsVision-Language-Action ModelImitation Learning and Reinforcement Learning
This survey discusses large model empowered embodied AI, focusing on autonomous decisionmaking and embodied learning. It explores hierarchical and endtoend decision paradigms, highlighting how large models enhance decision processes and VisionLanguageAction models, including
02
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
Published:12/2/2025
Vision-Language-Action ModelRobot Manipulation Failure DiagnosisViFailback DatasetVisual Question Answering TasksRobotic Manipulation Correction Guidance
The paper introduces the ViFailback framework for diagnosing and correcting robotic manipulation failures using explicit visual symbols. A large dataset with 58,126 VQA pairs and 5,202 trajectories is released to validate the ViFailback8B model's effectiveness in realworld reco
01
REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation
Published:12/23/2025
Vision-Language-Action ModelRobotic Manipulation BenchmarkRobotic Generalization EvaluationHigh-Fidelity Simulation EnvironmentTask Perturbation Factors
REALM introduces a highfidelity simulation environment for evaluating the generalization of VisionLanguageAction models, featuring 15 perturbation factors and over 3,500 objects. Validation shows strong alignment between simulated and realworld performance, highlighting ongoi
04
Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting
Published:9/26/2025
Fine-Tuning of Visual Language ModelsVision-Language-Action ModelCatastrophic Forgetting PreventionLow-Rank Adaptation MethodRobot Teleoperation Data
The paper presents VLM2VLA, a method for finetuning visionlanguage models (VLMs) into visionlanguageaction models (VLAs) without catastrophic forgetting, by representing lowlevel robot actions in natural language, achieving zeroshot generalization in real experiments.
04
ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis
Published:3/16/2025
Vision-Language-Action ModelRobotic Video SynthesisReal-to-Sim-to-Real ApproachRobot Dataset ScalingRobotic Manipulation Tasks
ReBot enhances robot learning by proposing a realtosimtoreal video synthesis method, addressing data scaling challenges. It replays real robot movements in simulators and combines them with inpainted realworld backgrounds, significantly improving VLA model performance with s
04
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Published:8/27/2025
Vision-Language-Action ModelRobotic ManipulationLong-Term Memory and Anticipatory ActionMemory-Conditioned Diffusion ModelsShort-Term Memory and Cognition Fusion
MemoryVLA is a memorycentric VisionLanguageAction framework for nonMarkovian robotic manipulation, integrating working memory and episodic memory. It significantly enhances performance in 150 tasks, achieving up to a 26% success rate increase across simulations and realworl
03
Real-Time Execution of Action Chunking Flow Policies
Published:6/9/2025
Real-Time Action Chunking Policy ExecutionVision-Language-Action ModelHigh-Frequency Control TasksKinetix SimulatorAction Chunking Algorithm
This paper introduces a novel algorithm called realtime chunking (RTC) to address inference latency issues in realtime control of visionlanguageaction models, showing improved task throughput and high success rates in dynamic and realworld tasks.
02
VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation
Published:11/21/2025
Vision-Language-Action ModelSpatiotemporal Coherent Robotic Manipulation4D-aware Visual RepresentationMultimodal Action RepresentationVLA Dataset Extension
The VLA4D model introduces 4D awareness into VisionLanguageAction models for coherent robotic manipulation, integrating spatial and temporal information to ensure smooth and consistent actions in robot tasks.
05
See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations
Published:12/8/2025
Vision-Language-Action ModelRobotic Manipulation Policy LearningOne-Shot Learning from DemonstrationLearning from Human Video DemonstrationsExpert Demonstration Video Generation
ViVLA is a general robotic manipulation model that learns new tasks from a single expert video demonstration. By processing the video alongside robot observations, it distills expertise for improved performance in unseen tasks, showing significant experimental gains.
03
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Published:3/19/2025
Generalist Humanoid Robot Foundation ModelVision-Language-Action ModelDiffusion Transformer ModuleHumanoid Robot Manipulation TasksMultimodal Data Training
GR00T N1 is an open foundation model for humanoid robots, integrating a reasoning module and a motion generation module. Trained endtoend with a pyramid of heterogeneous data, it outperforms existing imitation learning methods in simulation benchmarks, demonstrating high perfor
05
FailSafe: Reasoning and Recovery from Failures in Vision-Language-Action Models
Published:10/2/2025
Vision-Language-Action ModelRobotic Manipulation Failure RecoveryFailure Generation and Recovery SystemRobotic Manipulation DatasetLarge-Scale Robot Training Data
The paper introduces FailSafe, a system that generates diverse failure scenarios and executable recovery actions for VisionLanguageAction (VLA) models, achieving up to 22.6% performance improvement in robotic failure recovery across various tasks.
09
RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction
Published:5/18/2025
Robotic Failure Analysis and Correction FrameworkVision-Language-Action ModelTask Understanding for Failure CorrectionRoboFAC DatasetOpen-World Robotic Manipulation
The RoboFAC framework enhances robotic failure analysis and correction for VisionLanguageAction models in openworld scenarios. It includes a large dataset and a model capable of task understanding, with experimental results showing significant performance improvements.
02
TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking
Published:10/8/2025
Vision-Language-Action ModelSpatial Reasoning MechanismTarget Identification MemoryLong-Horizon Consistency ModelingAutoregressive Reasoning Model
TrackVLA is a novel VisionLanguageAction model that enhances embodied visual tracking by introducing a spatial reasoning mechanism and Target Identification Memory. It effectively addresses tracking failures under severe occlusions and achieves stateoftheart performance.
011