SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
TL;DR Summary
SmolVLA is a compact and efficient vision-language-action model that achieves competitive performance at reduced computational costs, enabling deployment on consumer-grade hardware and promoting broader participation in robotics research through community-driven dataset pretraini
Abstract
SmolVLA is a compact, efficient vision-language-action model that achieves competitive performance at reduced computational costs and can be deployed on consumer-grade hardware.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics. It introduces a novel, compact, and computationally efficient model designed to democratize robotics by enabling deployment on consumer-grade hardware.
1.2. Authors
The paper lists numerous authors primarily affiliated with Hugging Face, with additional contributions from Sorbonne University, valeo.ai, and École Normale Supérieure Paris-Saclay. The core team members are highlighted with an asterisk: Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. Their diverse affiliations suggest a collaborative effort spanning academic research and industry, particularly in the domain of large language models and machine learning for robotics.
1.3. Journal/Conference
This paper was published as a preprint, indicated by its presence on Hugging Face Papers and arXiv. The provided publication date (2001-06-01T16:00:00.000Z) seems to be a placeholder or an error in the user's prompt, as the arXiv link 2506.01844 implies a publication date in 2025. Given it's a preprint, it has not yet undergone formal peer review in a journal or conference proceedings. However, Hugging Face is a highly influential platform in the machine learning and natural language processing community, making it a prominent venue for sharing research, particularly for models and datasets.
1.4. Publication Year
Based on the arXiv identifier 2506.01844, the intended publication year is 2025. The provided UTC timestamp (2001-06-01T16:00:00.000Z) is inconsistent with the arXiv ID and likely an error.
1.5. Abstract
The paper introduces SmolVLA, a compact and efficient vision-language-action (VLA) model. The primary objective is to develop a VLA that achieves competitive performance while significantly reducing computational costs, enabling its deployment on consumer-grade hardware like standard GPUs or even CPUs. The model incorporates an asynchronous inference stack to enhance responsiveness by decoupling perception and action prediction from action execution, leading to higher control rates. A notable aspect is SmolVLA's pretraining on publicly available, community-contributed datasets. Despite its small size, SmolVLA demonstrates performance comparable to VLAs that are up to 10 times larger, as validated through evaluations on both simulated and real-world robotic benchmarks. The authors commit to releasing all code, pretrained models, and training data to foster reproducibility and broader participation in robotics research.
1.6. Original Source Link
- Original Source Link:
https://huggingface.co/papers/2506.01844 - PDF Link:
https://arxiv.org/pdf/2506.01844.pdfThe paper is available as a preprint on Hugging Face Papers and arXiv.
2. Executive Summary
2.1. Background & Motivation
The field of robotics is increasingly moving towards foundation models, particularly Vision-Language Models (VLMs), as a strong basis for robotic policies. These models, pretrained on large-scale multimodal datasets, encode rich visual and linguistic knowledge, enabling natural language-driven perception and control. However, the core problem is that existing Vision-Language-Action (VLA) models are typically massive, often comprising billions of parameters. This leads to several significant challenges:
-
High training costs: The computational resources required for training these large models are prohibitive for most researchers and institutions.
-
Limited real-world deployability: Their immense size makes them impractical for deployment on resource-constrained platforms, such as affordable robots or consumer-grade hardware.
-
Reliance on academic/industrial datasets: Many existing VLAs are trained on proprietary or specialized datasets, overlooking the growing availability of
community-collected datafrom more accessible robotic platforms. This limits accessibility and reproducibility within the broader robotics research community.The problem is important because it hinders the democratization of robotics. High computational and hardware barriers prevent wider participation and innovation in robot learning. The paper's entry point is to address these limitations by developing a
small, efficient, and community-driven VLAthat drastically reduces both training and inference costs while retaining competitive performance. The innovative idea is to achieve this efficiency through specific architectural choices and a novel inference strategy, making advanced robotics more accessible.
2.2. Main Contributions / Findings
The paper makes several primary contributions aimed at making VLAs more affordable and efficient:
-
Lightweight Architecture (
SmolVLA): They presentSmolVLA, a compact and efficientvision-language agentoptimized for training on consumer-grade GPUs and deployment even on CPUs. Key design choices include:- Skipping layers in the VLM: Reducing computational load during inference by using features from only the initial layers.
- Using a minimal number of visual tokens: Reducing the input dimension and processing requirements.
- Leveraging small pretrained VLMs: Building upon already efficient vision-language backbones.
- Interleaving self-attention and cross-attention layers: Optimizing the interaction between visual features and action tokens for better performance and speed.
-
Pretraining on Community-Driven Datasets:
SmolVLAis trained end-to-end on fewer than 30,000 episodes (approximately 10.6 million frames) sourced exclusively from publicly available, community-contributed datasets. This demonstrates strong performance with significantly less data than prior art and highlights the value of open-source data. -
Asynchronous Inference Stack: They introduce an asynchronous inference stack that decouples perception and action prediction from action execution. This allows for higher control rates and more responsive control by enabling
chunked action generationand predictive processing, avoidingidle lagsin robot operation. -
Competitive Performance with Reduced Costs: Despite its compact size (0.45 billion parameters, with a 2.25 billion parameter variant also tested),
SmolVLAachieves performance comparable to or even surpasses VLAs that are up to 10 times larger (e.g., 3.3 billion parameter ). This is demonstrated across a range of simulated (LIBERO, Meta-World) and real-world robotic benchmarks (SO-100, SO-101). -
Reproducibility and Open-Source Release: The authors release all code, pretrained models, and training data, providing reproducible and efficient training and inference recipes to foster community engagement and research.
The key findings are that significant reductions in model size and computational requirements for
VLAmodels are achievable without sacrificing competitive performance. The use of community-contributed data and an asynchronous inference strategy are crucial enablers for affordable and efficient robotics.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational grasp of several machine learning and robotics concepts is essential for a beginner:
-
Vision-Language Models (VLMs):
- Conceptual Definition: VLMs are a type of artificial intelligence model designed to process and understand information from both visual (e.g., images, videos) and textual (e.g., natural language descriptions) modalities simultaneously. They learn to associate visual content with linguistic meaning.
- How they work: Typically, a VLM consists of a
vision encoderthat processes images and alanguage model(often a decoder-only transformer) that processes text. Aprojection layeroradapteroften connects these two components, allowing them to communicate. VLMs are often pretrained on massive datasets of image-text pairs (e.g., "a cat sitting on a mat" with a corresponding image of a cat). - Example: If you show a VLM an image of a dog and ask "What is in this picture?", it can respond "A dog." Or, if you provide the text "a red car" and ask it to generate an image, it can do so.
-
Large Language Models (LLMs):
- Conceptual Definition: LLMs are deep learning models, typically based on the
Transformerarchitecture, that are trained on vast amounts of text data to understand, generate, and process human language. They can perform a wide array of language tasks, from answering questions to writing essays. - Key Feature: Their "largeness" refers to billions of
parameters(trainable weights) and the immense scale of their training data, which allows them to capture complex patterns and knowledge about language. - Example: GPT-3, Llama, Falcon.
- Conceptual Definition: LLMs are deep learning models, typically based on the
-
Vision-Language-Action (VLA) Models:
- Conceptual Definition: VLAs extend
VLMsby adding an "action" component, making them capable of controlling robots. These models take multimodal inputs (visual observations from cameras, natural language instructions) and output physical actions that a robot can execute. - Goal: To enable robots to understand high-level human commands (e.g., "pick up the red block") and translate them into a sequence of low-level motor commands (e.g., joint angles, gripper movements).
- Architecture: Often built by adapting or finetuning a pretrained
VLMwith robotics-specific data, adding anaction headoraction expertmodule.
- Conceptual Definition: VLAs extend
-
Transformers:
- Conceptual Definition: A neural network architecture introduced in 2017, which revolutionized sequence-to-sequence tasks, particularly in natural language processing. It's known for its efficiency in parallel processing compared to recurrent neural networks.
- Key Mechanism: Attention: The core of a Transformer is the
self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. - Attention Formula (Scaled Dot-Product Attention):
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
- (Query): Represents the query vector for the current element, used to compare against other elements.
- (Key): Represents the key vectors for all elements in the sequence, used to match with the query.
- (Value): Represents the value vectors for all elements, carrying the actual information to be weighted and aggregated.
- : The dimension of the key vectors, used to scale the dot products to prevent vanishing gradients.
- : A function that converts a vector of numbers into a probability distribution, ensuring weights sum to 1.
- : Calculates the dot product between queries and keys, measuring their similarity.
- The output is a weighted sum of the value vectors, where the weights are determined by the attention scores.
- Cross-Attention: Similar to self-attention, but
queriescome from one sequence (e.g., action tokens) andkeysandvaluescome from another sequence (e.g., visual features). This allows one modality to "attend to" or query information from another. - Causal Attention Mask: A mechanism used in autoregressive models (like language decoders) to prevent tokens from attending to future tokens in the sequence. This ensures that the prediction of the current token only depends on past tokens.
-
Flow Matching:
- Conceptual Definition: A generative modeling technique used to learn continuous-time dynamics that transform a simple distribution (e.g., Gaussian noise) into a complex target distribution (e.g., a distribution of actions). It trains a neural network to predict a vector field that moves samples from the noise distribution to the data distribution along a continuous path.
- Advantage: It can be more stable and efficient than traditional diffusion models for certain tasks, particularly in action generation, as it directly learns the vector field mapping.
-
Action Chunking:
- Conceptual Definition: Instead of predicting a single action at each timestep, action chunking involves predicting a sequence or "chunk" of future actions all at once.
- Benefits: This can improve efficiency by reducing the frequency of computationally expensive model inferences and can lead to smoother, more coordinated robot movements by providing a short-term plan.
- Trade-off: Longer chunks can lead to less reactive control if environmental conditions change rapidly during the execution of the chunk.
3.2. Previous Works
The paper contextualizes SmolVLA within the landscape of VLMs and VLAs, highlighting the trajectory of research in this area.
Vision-Language Models (VLMs)
- Early VLMs: Often built by integrating pretrained
vision encoders(likeCLIP(Radford et al., 2021) orSigLIP(Zhai et al., 2023)) with pretrainedLLMs(likeLlama(AI@Meta, 2024; Touvron et al., 2023)). - Training Paradigms: Typically involve multi-stage training:
- Large-scale pretraining on image-caption datasets (
LAION-COCO(Schuhmann et al., 2022)). Instruction-tuningon conversational datasets (Llava(Liu et al., 2023a),MiniGPT-4(Zhu et al., 2023)).
- Large-scale pretraining on image-caption datasets (
- Efficient VLMs: Recent efforts focus on reducing computational costs by training on smaller, more diverse datasets (
Moondream(Korrapati, 2024),Qwen-VL(Bai et al., 2025)) or adapting unimodal models with minimal tuning (Fuyu-8B(Bavishi et al., 2023)).SmolVLM-2(Marafioti et al., 2025), whichSmolVLAuses as its backbone, is an example of an efficient VLM optimized for multimodal video inputs. - Limitations of existing VLMs for Robotics: While powerful for perception, most VLMs are not designed for direct action generation and often lack the real-time responsiveness and specific inductive biases needed for robotic control.
Vision-Language-Action Models (VLAs)
- Emergence of VLAs: A growing area aiming to imbue robots with generalist skills by leveraging the reasoning and world knowledge embedded in pretrained
LLMsandVLMs. - Octo (Team et al., 2024) and RT-X (O'Neil et al., 2024): These are prominent early
VLAsystems. They typically finetune pretrainedVLMson robotics-specific datasets. However, they are known for beingresource-intensiveand often depend oncostly robotic platforms, limiting their accessibility.RT-Xnotably released theOpen X-Embodimentdataset, a large collection of robotics data. - OpenVLA (Kim et al., 2024): Released a 7-billion parameter
VLAtrained on publicly available data, generatingdiscrete action tokens. This discrete nature can be a limitation forcontinuous controltasks. - (Black et al., 2024) and DexVLA (Wen et al., 2025): These approaches address the
continuous controllimitation by proposingdiffusion-based decodersfor action generation. They adapt pretrainedVLMs(likeRDT-1B) and introduce large diffusion components (termedaction experts) trained directly on robot demonstrations. is a key baseline forSmolVLA. - ACT (Zhao et al., 2023): A
Conditional Variational Autencoder (CVAE)policy model with a transformer architecture, using aResNetvision encoder. It generates action chunks and is optimized with a regression objective. It's another important baseline. - TinyVLA (Zhou et al., 2024): Another small-scale
VLAmodel aiming for efficiency. - Limitations of Prior VLAs:
- Massive Model Sizes: Many
VLAsare very large, leading to high training and inference costs. - High Data Requirements: They often rely on enormous datasets, frequently collected by industrial labs, which can be hard to reproduce or access.
- Computational Expense: Inference often requires powerful GPUs, making real-world deployment on affordable robots challenging.
- Limited Responsiveness: Synchronous inference strategies can introduce blind lags, reducing robot reactivity.
- Massive Model Sizes: Many
3.3. Technological Evolution
The field has seen a rapid evolution from general-purpose Large Language Models (LLMs) to Vision-Language Models (VLMs), and now to Vision-Language-Action (VLA) models.
- Foundation of LLMs: The success of
LLMs(e.g.,GPT-4,Llama) demonstrated the power of large models trained on vast internet-scale datasets to acquire general reasoning and language generation capabilities. - Multimodal Shift to VLMs: Researchers recognized the potential of integrating visual perception with
LLMs. This led toVLMs, wherevision encodersare combined withLLMsto process both images and text. InitialVLMsoften involved complex multi-stage training and large model sizes. - Action Integration for VLAs: The next logical step was to connect
VLMsto robotic control, giving rise toVLAs. The goal was to leverage the rich world knowledge and reasoning ofVLMsto enable natural language instructions for robots. EarlyVLAsdemonstrated impressive generalization but inherited the challenges ofVLMsregarding size and computational cost. - Democratization and Efficiency (SmolVLA's Niche):
SmolVLAfits into the current trajectory by addressing the critical need foraffordability and efficiency. While priorVLAsfocused on maximizing performance,SmolVLAprioritizescompactness, reduced computational costs, anddeployability on consumer-grade hardware, all while maintaining competitive performance. This represents a shift towards making these powerful capabilities accessible to a broader robotics research community. The use ofcommunity-driven datasetsfurther reinforces this democratizing trend.
3.4. Differentiation Analysis
SmolVLA distinguishes itself from the main methods in related work through its core emphasis on efficiency and accessibility, without significantly compromising performance:
-
Model Size and Computational Footprint:
- Differentiation:
SmolVLAis explicitly designed to besmall(0.45 billion parameters) andefficient, capable of training on a single GPU and deploying on consumer-grade GPUs or CPUs. This stands in stark contrast toOctoandRT-X, which are typically massive, requiring significant computational resources. , a key baseline, has 3.3-3.5 billion parameters, makingSmolVLAroughly 7-8 times smaller.OpenVLAis 7 billion parameters. - Innovation:
SmolVLAachieves this through architectural innovations likeskipping layersin theVLMbackbone, using aminimal number of visual tokens, and employing asmaller VLM(SmolVLM-2).
- Differentiation:
-
Data Strategy:
- Differentiation: Unlike many industrial or academic
VLAefforts that rely on vast, often proprietary or hard-to-access datasets (e.g.,RT-X'sOpen X-Embodimentdataset),SmolVLAis pretrained exclusively onpublicly available, community-contributed datasets. It uses an order of magnitude less data (fewer than 30,000 episodes) compared to models likeOpenVLA(around 1 million trajectories). - Innovation: The paper also introduces methods for
standardizing and improving the quality of community data, such asVLM-based task annotationandcamera viewpoint normalization, addressing the inherent noise and heterogeneity of crowd-sourced data.
- Differentiation: Unlike many industrial or academic
-
Inference Mechanism:
- Differentiation: While many
VLAsystems might operate in a synchronous, open-loop fashion between observations,SmolVLAintroduces anasynchronous inference stack. - Innovation: This decouples
perception and action predictionfromaction execution, allowing the robot to execute actions while a new chunk is being computed, thus reducingblind lagsand increasingresponsivenessandcontrol rates. This is particularly critical for real-world deployment scenarios where latency is a concern.
- Differentiation: While many
-
Action Generation Method:
-
Differentiation:
SmolVLAutilizes aFlow Matching Transformeras itsaction expertforcontinuous action generation, similar to andDexVLA. This is an improvement overOpenVLAwhich generatesdiscrete action tokens, a limitation for fine-grained continuous control. -
Innovation:
SmolVLAfurther optimizes this byinterleaving cross-attention and causal self-attention layerswithin itsaction expert, which they find provides higher success rates and faster inference times.In essence,
SmolVLAoffers a paradigm shift towards making powerfulVLAcapabilities more accessible and practical for a wider range of users and robotic platforms by focusing on rigorous efficiency without sacrificing performance.
-
4. Methodology
4.1. Principles
The core idea behind SmolVLA is to build a Vision-Language-Action (VLA) model that is inherently small, efficient, and capable, optimized for affordable and efficient robotics. This is achieved by combining a compact, pretrained Vision-Language Model (VLM) for perception with an optimized action expert for continuous action generation. A key principle is to leverage community-contributed datasets for pretraining, making the model more accessible and reproducible. Furthermore, to enhance real-world responsiveness, SmolVLA employs an asynchronous inference stack that decouples computation from execution, allowing for higher control rates. The theoretical basis lies in adapting powerful transformer architectures and generative modeling techniques (Flow Matching) to efficiently predict robot actions conditioned on multimodal observations and language instructions.
4.2. Core Methodology In-depth (Layer by Layer)
SmolVLA is composed of two main interacting components: (i) a pretrained VLM responsible for perception, and (ii) an action expert that conditions and generates the actions. It also incorporates specific strategies for data handling and inference optimization.
The following figure (Figure 1 from the original paper) illustrates the overall architecture of SmolVLA:
该图像是一个示意图,展示了 SmolVLA 模型的结构。它由经过预训练的视觉语言模型和行动专家组成,利用跨注意力和自注意力机制来处理来自社区数据集的信息,并用于低成本机器人任务。模型输出低级动作 。
4.2.1. Vision-Language Model (VLM)
The VLM serves as the primary backbone for perceiving the robot's environment. It processes sensorimotor states, including images from multiple RGB cameras, and a language instruction.
- Choice of VLM Backbone: The authors choose
SmolVLM-2(Marafioti et al., 2025) due to its efficiency and optimization for multimodal video inputs.SmolVLM-2usesSigLIP(Zhai et al., 2023) as itsvision encoderto extract visual features, which are then fed into aSmolLM2 language decoder. - Image Sequence Processing: The
VLMcomponent processes sequences of images using its vision encoder. - Visual Token Reduction: To ensure efficiency,
SmolVLAsignificantly reduces the number of visual tokens processed per frame. WhileSmolVLM-2typically processes high-resolution images,SmolVLAlimits visual tokens to 64 per frame. This reduction impacts the token dimension and overall computational load. - Input Concatenation: Visual features, language tokens (from the instruction), and state tokens are concatenated and passed to the language decoder. The resulting features from the decoder layers are then used to condition the
action expert.
4.2.2. State, Action, and State Projectors
Linear projection layers are used at several points to ensure dimensional compatibility between different components:
- To project input
states(e.g., robot joint positions, gripper state) to match thelanguage model (LM)'s hidden dimension. - To project
actionsto match theaction expert's input dimension. - To adapt
VLMfeatures along with theaction expert's dimension.
4.2.3. Faster Inference Through Layer Skipping
A key efficiency optimization involves skipping computations within the VLM.
- Principle: Based on prior work (Shukor and Cord, 2024; Tang et al., 2023) showing that not all layers of a pretrained model are equally important for downstream tasks.
- Mechanism:
SmolVLAextracts features from theVLMup to a specified layer , effectively discarding the topL-Nlayers of thelanguage model decoder. - Configuration: The authors found that setting to half the total layers () provides a good trade-off between speed and performance, essentially halving the computational cost of the
LLMandaction expert.
4.2.4. Flow Matching Action Expert
The action expert, denoted as , is responsible for predicting a chunk of low-level actions.
- Input:
VLMfeatures () extracted from an observation at the -thVLMlayer, and noisy actions . - Output: An action chunk , representing a sequence of low-level commands.
- Architecture: The
action expertis built upon theTransformerarchitecture (Vaswani, 2017). - Training Objective (Flow Matching): The expert is trained using the objective defined by:
$
\begin{array} { r } { \mathcal{L}^{\tau}(\theta) = \mathbb{E}_{p(\mathbf{A}_t \mid \mathbf{o}_t), q(\mathbf{A}_t^{\tau} \mid \mathbf{A}t)} \left[ \big| \mathbf{v}{\theta}\big(\mathbf{A}_t^{\tau}, \mathbf{o}_t\big) - \mathbf{u}\big(\mathbf{A}_t^{\tau} \mid \mathbf{A}_t\big) \big|^2 \right] } \end{array}
$
Where:
- : The parameters of the action expert model.
- : Represents the
VLMfeatures extracted from an observation at time from the -thVLMlayer. - : The ground-truth action chunk that the robot should execute.
- : A noisy version of the action chunk, created by interpolating between the ground-truth action chunk and Gaussian noise: .
- : A scalar parameter sampled from a Beta distribution, controlling the interpolation factor.
- : Standard Gaussian noise, , where is the identity matrix.
- : The prediction of the action expert, which takes the noisy action chunk and
VLMfeatures as input. - : The target vector field that the expert is trained to output. This vector field effectively points from the noisy action towards the ground-truth action .
- Efficiency in Expert: To further improve inference efficiency, the
action expertuses a reduced hidden size of , where is theVLM's hidden dimension.
4.2.5. Interleaved Cross and Causal Self-Attention Layers
Within the action expert's Transformer architecture, a specific attention mechanism is employed to efficiently integrate VLM features and model action token dependencies:
- Mechanism: Unlike prior works relying exclusively on
self-attention (SA)orcross-attention (CA),SmolVLAinterleaveSAandCAlayers. This design choice is adapted from standardVLMarchitectures where each decoder block typically includes bothSAandCAlayers. - Cross-Attention (CA): Allows the action tokens (as queries) to
co-attendto theVLMfeatures (acting as keys and values). This is how theaction expertconditions its predictions on the visual and linguistic context provided by theVLM. - Causal Self-Attention (SA): Allows the action tokens within the predicted chunk to attend to each other, but only to preceding tokens in the sequence (using a
causal attention mask). This ensures that the prediction of only depends on , preventing future action leakage. - Benefit: The authors found that interleaving
CAandSAlayers provides higher success rates and faster inference times for robotic tasks.
4.2.6. Pretraining Data Collected by the Community
SmolVLA is pretrained on community-contributed datasets to improve accessibility and reproducibility.
- Challenges of Community Data:
- Heterogeneity: High variability in robot morphologies, sensors, actuation modes, and control schemes.
- Noise: Substantial noise in task annotations and inconsistencies in camera naming conventions.
- Data Collection Methods: Reliance on teleoperation by human experts.
- Dataset Curation: A subset of 481 community datasets from Hugging Face was selected, totaling approximately 22,900 episodes and 10.6 million frames. This is an order of magnitude smaller than typical datasets for larger
VLAs. - Task Annotation with VLM: To address noisy and vague task descriptions, an off-the-shelf
VLM(Qwen2.5-VL-3B-Instruct) was used toauto-generate concise task descriptions.- Process: Representative frames from each dataset were sampled and provided to the
VLMalong with the original (often noisy) instruction. - Prompt Example: The model was prompted to produce short, action-oriented sentences. The full prompt is:
Here is a current task description: {current_task}. Generate a very short, clear, and complete one-sentence describing the action performed by the robot arm (max 30 characters). Do not include unnecessary words. Be concise. Here is some examples: Pick up the cube and place it in the box, open the drawer and so on. :t directly with an action verb like "Pick", "Place","Open", etc. Similar to the provided examples, what is the main action done by the robot arm?
- Process: Representative frames from each dataset were sampled and provided to the
- Camera Viewpoint Normalization: To standardize inconsistent camera naming (e.g.,
images.laptopcould be top, side, or wrist view), cameras were manually mapped to standard viewpoints:top,wrist, andsideperspectives. These were then renamed as , , and , respectively.
4.2.7. Asynchronous Inference
To overcome the limitations of synchronous inference (which introduces blind lags and reduces responsiveness), SmolVLA implements an asynchronous (async) inference stack.
The following figure (Figure 2 from the original paper) depicts the asynchronous inference architecture:
该图像是示意图,展示了SmolVLA模型中PolicyServer与RobotClient之间的交互流程。图中标示了起始状态、策略执行过程以及如何接收行动。关键流程包括RobotClient从PolicyServer接收个动作,并在每次观察后执行动作。该流程通过三个步骤表现出环境延迟和推理延迟的影响,展示了机器人如何及时更新其行为队列以实现有效决策。
- Problem with Synchronous Inference:
- In a typical synchronous setup, a policy predicts an action chunk from an observation . The robot then executes all actions before a new observation is passed to the policy. This results in
open-loop inferencebetween observations and introducesblind lagswhile waiting for the next chunk to be computed.
- In a typical synchronous setup, a policy predicts an action chunk from an observation . The robot then executes all actions before a new observation is passed to the policy. This results in
- Asynchronous Solution: Decouples
action chunk prediction() fromaction execution().-
Architecture: A
RoBoTCLIENTsends an observation to aPoLICYSERVER(potentially remote, with GPUs). TheRoBoTCLIENTreceives an action chunk once inference is complete. -
Goal: To avoid execution lags by triggering new chunk evaluations while the robot is still executing actions from the previous chunk. This allows for continuous robot operation, especially beneficial in latency-constrained scenarios.
The following is Algorithm 1,
Asynchronous inference control-loop, from the original paper:
-
Explanation of Algorithm 1:
-
Inputs:
horizon T(total timesteps for execution),chunk size n(number of actions in one predicted chunk),threshold g(a fraction, , determining when to trigger a new prediction). -
Initialization (Lines 1-2):
- The robot captures its initial observation, .
- is sent to the
PoLICYSERVER. - The
PoLICYSERVERcomputes the first action chunk, , using the policy , and sends it back to theRoBoTCLIENT.
-
Main Control Loop (Lines 3-17):
-
Action Execution (Lines 4-5): At each timestep , the robot takes the first action from the current action queue (using
PoPFRONTwhich extracts and removes the first element), and then executes it. -
Triggering New Prediction (Lines 6-13):
- Condition Check (Line 6): The core of the asynchronous mechanism. If the number of remaining actions in the current queue falls below a certain fraction of the original
chunk size n(i.e., ), it indicates that a new action chunk might be needed soon. - Capture Observation (Line 7): A new observation, , is captured from the environment.
- Similarity Filter (Line 8): is a check (e.g., comparing observations in joint-space) to avoid sending redundant or nearly identical observations to the server, saving computational resources. If the observation is sufficiently different or if the queue is about to become empty regardless of similarity, processing is triggered.
- Asynchronous Inference Call (Line 9): A non-blocking call is made to the
PoLICYSERVERto start computing a new action chunk based on . This operation runs in the background. A handleasync_handleis returned to track its completion. - New Queue Prediction (Line 10): The new chunk is predicted by the policy .
- Aggregate Overlaps (Line 11): If there are overlapping timesteps between the currently executing chunk and the newly predicted chunk , a function aggregates them. This helps smooth transitions and combine information from both chunks.
- Condition Check (Line 6): The core of the asynchronous mechanism. If the number of remaining actions in the current queue falls below a certain fraction of the original
-
Handling Uncompleted Inference (Lines 14-16): If the asynchronous prediction for is not yet completed (
NOTCOMPLETED(async_handle)) when theRoBoTCLIENTneeds a new chunk, the robot continues to use the existing queue . This prevents the robot from becoming idle.The following figure (Figure 3 from the original paper) illustrates the action queue size evolution at runtime for various levels of :
该图像是图表,展示了在不同 值下,动作队列大小随推理时间步演变的情况。 (A) 显示了没有观察过滤的情况,而 (B) 则展示了应用观察过滤后的效果,突出在不同 值下的变化趋势。
-
-
Analytical Study of Asynchronous Inference:
- Let be the random variable modeling the time needed to receive an action chunk after sending an observation .
- comprises: (i) time to send from
RoBoTCLIENTtoPoLICYSERVER(), (ii) inference latency onPoLICYSERVER(), and (iii) time to send fromPoLICYSERVERtoRoBoTCLIENT(). - Assuming independence and negligible communication time: .
- Let be the environment's control cycle (e.g., 33 ms for 30 frames/second).
- To avoid exhausted queues, must hold, where is the chunk size.
-
Role of Threshold : dictates when a new observation is sent for processing.
- Sequential Limit (): The client waits for the entire chunk to be drained before requesting a new one. This reproduces fully sequential deployment, leading to idle seconds, as shown in Figure 3(A).
- Asynchronous Inference (): The client triggers new inference when of the chunk remains. This amortizes computation, keeping the queue from emptying. The new chunk aggregates overlaps with the old one.
- Compute-Intensive Limit (): New inference is triggered at every timestep. This is maximally reactive but incurs one forward pass per control tick, making it computationally very expensive. If , the queue will eventually deplete because prediction can't keep up with execution.
-
Observation Similarity Filter: Observations are compared in
joint-space. If sufficiently similar (below a threshold ), processing is skipped to save resources. However, if the queue becomes empty, the most recent observation is processed regardless of similarity (Figure 3(B)). This balances efficiency with ensuring the robot always has actions.
5. Experimental Setup
5.1. Datasets
SmolVLA is evaluated on a combination of simulated and real-world robotic manipulation tasks.
-
Pretraining Data:
- Source: A subset of 481
community datasetsobtained from Hugging Face. - Scale: Approximately 22,900 episodes and 10.6 million frames. This is an order of magnitude smaller than datasets used by prior large
VLAmodels (e.g., OpenVLA uses ~1 million trajectories). - Characteristics: These datasets are "noisy" in terms of task annotations and camera naming conventions, reflecting real-world heterogeneity. They were filtered based on modality type, episode count, overall data quality, and frame count.
- Selection Rationale: Chosen to demonstrate
SmolVLA's ability to learn from affordable, open-source, and diverse community contributions, rather than relying on curated, resource-intensive datasets.
- Source: A subset of 481
-
Simulation Environments (for Evaluation):
- LIBERO (Li et al., 2023a):
- Description: A benchmark assessing diverse visuomotor skills across four categories:
Spatial,Object,Goal, andLong, with 10 tasks per category (40 tasks total). - Dataset Used: A dataset (Kim et al., 2024; Pertsch et al., 2025) containing 1,693 episodes covering 40 tasks.
- Evaluation: 10 trials per task, reporting average success rates based on binary completion criteria.
- Robot:
Franka Emika Pandarobot.
- Description: A benchmark assessing diverse visuomotor skills across four categories:
- Meta-World (Yu et al., 2020):
- Description: Evaluates generalization across 50 tasks of varying difficulty:
easy,medium,hard, andvery hard(Seo et al., 2023). - Dataset Used: A new dataset collected by the authors, comprising 50 demonstrations for each of the 50 tasks, totaling 2,500 episodes.
- Evaluation: Assesses success rates based on whether the task is completed.
- Robot:
Sawyerrobot (simulator).
- Description: Evaluates generalization across 50 tasks of varying difficulty:
- LIBERO (Li et al., 2023a):
-
Real-World Tasks (for Evaluation):
-
SO-100 Robot Arm: Three datasets collected using the
SO-100robot arm (Knight et al., 2022). Each contains more than 50 demonstrations.- Pick-Place: Robot picks up a cube and places it in a box.
- Stacking: Robot picks up a red cube and places it on top of a blue cube.
- Sorting: Robot sorts cubes by color (red in right box, blue in left box). This is a longer-horizon task.
-
SO-101 Robot Arm: One dataset collected using the
SO-101arm (Knight et al., 2022) for thePick-Place-Legotask. More than 50 demonstrations.- Pick-Place-Lego: Robot picks up a small Lego brick and places it into a transparent box. This task requires high precision and advanced vision due to transparency.
SmolVLAis not pretrained on any datasets recorded for theSO101.
- Pick-Place-Lego: Robot picks up a small Lego brick and places it into a transparent box. This task requires high precision and advanced vision due to transparency.
-
Data Description: The datasets record trajectories relative to robot joint positions, gripper state, and camera images.
-
Source: These datasets are open-sourced on Hugging Face.
The following figure (Figure 4 from the original paper) shows the visual setup for real-world tasks with SO100 and SO101 robots:
该图像是示意图,展示了SmolVLA模型在执行抓取、堆叠和分类任务时的初始帧和最终帧对比。左侧的三个框分别标记为摘取、堆叠和分类的任务,展示了模型如何高效地处理这些动作。
-
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
5.2.1. Success Rate (%) (SR)
- Conceptual Definition: Success rate is a common metric in robotics and reinforcement learning that quantifies the percentage of trials or episodes in which an agent successfully completes a predefined task. It directly measures the agent's ability to achieve its objective.
- Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
- Symbol Explanation:
-
: The count of individual attempts or episodes where the robot achieved the task's completion criteria.
-
: The total count of all attempts or episodes conducted for the given task.
Specific Scoring for Real-world Tasks:
-
- Pick-Place: Assessed with a fine-grained score: 0 for failure, 0.5 for successfully grasping the cube, and 1 for successfully placing it into the box.
- Stacking: Assessed with a fine-grained score: 0 for failure, 0.5 for successfully grasping the red cube, and 1 for successfully placing it on top of the blue cube.
- Sorting: The paper doesn't detail a fine-grained score for sorting but uses a binary completion criterion, similar to simulation benchmarks.
- Pick-Place-Lego: Binary completion (0 or 1) indicating whether the Lego brick was successfully placed in the box.
5.2.2. Task Completion Time (s)
- Conceptual Definition: This metric measures the duration, in seconds, taken by the robot to complete a specific task from its start to its successful conclusion. It's a direct indicator of the efficiency and speed of the robot's policy.
- Mathematical Formula: The paper reports
Total(sum of times for multiple trials),Avg(average time per trial), andStd(standard deviation of time per trial). No specific single formula for this aggregated metric is provided, but it's the elapsed real-world time. - Symbol Explanation:
Total: Sum of completion times across all successful trials.Avg: Average completion time per successful trial.Std: Standard deviation of completion times across successful trials.
5.2.3. Performance in Fixed Time (# of Cubes)
- Conceptual Definition: This metric assesses the robot's throughput or productivity by counting how many sub-tasks (e.g., picking and placing a cube) it can successfully complete within a predetermined fixed time window. It highlights the policy's sustained efficiency and speed under time constraints.
- Mathematical Formula: The paper reports
Total(sum of cubes across trials),Avg(average cubes per trial), andStd(standard deviation of cubes per trial). This is a count-based metric within a fixed duration. - Symbol Explanation:
Total: Sum of cubes successfully manipulated across all trials within the fixed time.Avg: Average number of cubes manipulated per trial within the fixed time.Std: Standard deviation of cubes manipulated per trial within the fixed time.
5.3. Baselines
The paper compares SmolVLA against several popular and strong baseline models available in the LeRobot library (Cadene et al., 2024):
-
(Black et al., 2024):
- Description: A
VLAmodel that combines aVLMbackbone withFlow Matchingforaction chunk prediction. It takes observations (RGB images from multiple cameras), proprioceptive states, and a language instruction as inputs. - Parameters: Has a total model size of 3.3 billion parameters (a 3.5B variant is also mentioned).
- Pretraining: Pre-trained on 10,000 hours of cross-embodiment robotics data. Variants are tested: one initialized from
Paligemma-3B(a VLM) without robotics pretraining, and another with robotics pretraining (weights released by authors). - Why Representative: It's a state-of-the-art
VLAthat also usesFlow Matchingfor continuous action generation, making it a strong and directly comparable baseline forSmolVLA's core methodology. Its large size provides a contrast toSmolVLA's efficiency goals.
- Description: A
-
ACT (Zhao et al., 2023):
- Description: A
Conditional Variational Autencoder (CVAE)policy model featuring anencoder-decoder transformerarchitecture. It uses aResNetvision encoder pretrained onImageNet, while theCVAEitself is trained from scratch. It generates action chunks and is optimized using a regression objective to directly predict continuous actions. - Parameters: Approximately 80 million parameters.
- Why Representative: A well-established and efficient imitation learning baseline for continuous control in robotics, representing a different architectural choice compared to
VLM-basedVLAs.
- Description: A
-
Diffusion Policy (Khazatsky et al., 2024; Chi et al., 2023):
- Description: A general approach to visuomotor policy learning via
action diffusion. It uses a diffusion model to generate actions. - Why Representative: Diffusion models are a popular and strong class of generative models recently adapted for robot control, providing another benchmark for
SmolVLA'sFlow Matching-based action expert.
- Description: A general approach to visuomotor policy learning via
-
Octo (Team et al., 2024):
- Description: An
open-source generalist robot policydeveloped by a large collaboration. It's aVLAsystem pretrained on diverse robotics datasets. - Parameters: 0.09 billion parameters (a relatively smaller
VLAvariant). - Why Representative: Represents efforts towards generalist robot policies and is a common benchmark in simulation.
- Description: An
-
OpenVLA (Kim et al., 2024):
- Description: A 7-billion parameter
VLAtrained on publicly available data, designed to generatediscrete action tokens. - Why Representative: A large-scale, open
VLAthatSmolVLAaims to compete with in performance while being significantly smaller.
- Description: A 7-billion parameter
-
TinyVLA (Zhou et al., 2024):
- Description: As its name suggests, another
small-scale large multimodal model(SMM) that aims for efficiency. - Why Representative: A direct comparison for
SmolVLAin terms of small model size and performance on benchmarks like Meta-World.
- Description: As its name suggests, another
5.4. Implementation Details
- Framework: All experiments are conducted using the
LeRobot(Cadene et al., 2024) PyTorch-based framework, designed for real-world robotics. - Pretraining:
- Steps: Trained for 60,000 steps.
- Batch Size: Global batch size of 64.
- Learning Rate Schedule: Cosine learning rate schedule starting at
1e-4and decaying to2.5e-5after a 100-step warmup. - Optimizer:
AdamWwith and . - Image Resizing: Images are resized to 512x512 for consistency with the
VLMinput size. - VLM Backbone:
SmolVLM-2(Marafioti et al., 2025) is used, with theVLMlayers frozen during training of the action expert. - Action Expert: Trained with
Flow Matchingto output chunks of actions.
- Model Size: The main
SmolVLAmodel contains 450 million parameters, with approximately 100 million dedicated to theaction expert. Other variants with 0.24B and 2.25B parameters are also explored. - Efficiency Optimizations:
- Mixed Precision:
bfloat16precision is utilized. - JIT Compilation:
torch.compile()(Paszke, 2019) is used to JIT-compile PyTorch code into optimized kernels. - Sequence Length and Batch Size Management: To ensure compatibility with optimizations, sequence length and batch size are maintained, discarding excess frames.
- Multi-GPU Training:
Hugging Face Acceleratelibrary is used for distributed training. - Compute Cost: Pretraining was conducted using 4 GPUs and consumed approximately 30,000 GPU hours.
- Mixed Precision:
- Inference Modes:
- Real-world Evaluation:
Asynchronous inferenceis performed, where the model samples new observations and predicts action chunks at predetermined thresholds. - Simulation Evaluation:
Synchronous inferenceis performed, where a new action is predicted after each executed action, for a more reactive control loop.
- Real-world Evaluation:
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that SmolVLA achieves competitive, and often superior, performance compared to significantly larger VLA models, while operating at a fraction of their computational cost.
Simulation Evaluation
The following are the results from Table 2 of the original paper:
| Benchmark | Policy (# Params) | VLA Pt. | Success Rate (%) - Simulation | ||||
|---|---|---|---|---|---|---|---|
| Spatial | Object | Goal | Long | Avg. | |||
| LIBERO | Diffusion Policy (Khazatsky et al., 2024) | No | 78.3 | 92.5 | 68.3 | 50.5 | 72.4 |
| Octo (0.09B) (Team et al., 2024) | Yes | 78.9 | 85.7 | 84.6 | 51.1 | 75.1 | |
| OpenVLA (7B) (Kim et al., 2024) | Yes | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 | |
| π0 (Paligemma-3B) | No | 87 | 63 | 89 | 48 | 71.8 | |
| π0 (3.3B) | Yes | 90 | 86 | 95 | 73 | 86.0 | |
| SmolVLA (0.24B) | No | 87 | 93 | 88 | 63 | 82.75 | |
| SmolVLA (0.45B) | No | 90 | 96 | 92 | 71 | 87.3 | |
| SmolVLA (2.25B) | No | 93 | 94 | 91 | 77 | 88.75 | |
| Meta-World | Diffusion Policy (Chi et al., 2023) | No | 23.1 | 10.7 | 1.9 | 6.1 | 10.5 |
| TinyVLA (Zhou et al., 2024) | No | 77.6 | 21.5 | 11.4 | 15.8 | 31.6 | |
| π0 (3.5B-Paligemma) | No | 80.4 | 40.9 | 36.7 | 44.0 | 50.5 | |
| π0 (3.5B) | Yes | 71.8 | 48.2 | 41.7 | 30.0 | 47.9 | |
| SmolVLA (0.24B) | No | 86.43 | 46.36 | 35 | 60 | 56.95 | |
| SmolVLA (0.45B) | No | 82.5 | 41.8 | 45.0 | 60.0 | 57.3 | |
| SmolVLA (2.25B) | No | 87.14 | 51.82 | 70 | 64 | 68.24 | |
-
LIBERO Benchmark:
SmolVLA (0.45B)achieves an average success rate of 87.3%, which is higher thanDiffusion Policy (72.4%),Octo (0.09B, 75.1%),OpenVLA (7B, 76.5%), and \pi_0, a model approximately7 times largerandpretrained on extensive robotics data.- The
SmolVLA (2.25B)variant further improves to 88.75%, demonstrating scalability with size. - Notably,
SmolVLAis not pretrained on robotics data, yet its performance is competitive. It is also reported to be40% faster to trainand consumes6x less memorythan .
-
Meta-World Benchmark:
-
SmolVLA (0.45B)achieves an average success rate of 57.3%, significantly outperformingDiffusion Policy (10.5%),TinyVLA (31.6%), and both \pi_0`variants` (`Paligemma-3.5B` at 50.5% and robotics-pretrained `3.5B` at 47.9%). * The `SmolVLA (2.25B)` variant achieves the highest average success rate at **68.24%**. * This highlights `SmolVLA`'s strong generalization capabilities across diverse tasks, especially on more challenging environments. ### Real-World Evaluation The following are the results from Table 3 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td></td> <th colspan="4">Success Rate (%) - Real World</th> </tr> <tr> <td>Policy</td> <td>Pick-Place</td> <td>Stacking</td> <td>Sorting</td> <td>Avg.</td> </tr> </thead> <tbody> <tr> <td colspan="5">Single-task Training</td> </tr> <tr> <td>ACT</td> <td>70</td> <td>50</td> <td>25</td> <td>48.3</td> </tr> <tr> <td colspan="5">Multi-task Training</td> </tr> <tr> <td>π0 (3.5B)</td> <td>100</td> <td>40</td> <td>45</td> <td>61.7</td> </tr> <tr> <td>SmolVLA (0.45B)</td> <td>75</td> <td>90</td> <td>70</td> <td>78.3</td> </tr> </tbody> </table></div> * **SO100 Robot (Pick-Place, Stacking, Sorting):** * `SmolVLA (0.45B)` achieves an average success rate of **78.3%** in a multi-task setting. * This `outperforms ACT (48.3%)`, which is trained individually on each task, and \pi_0, a significantly larger model. -
SmolVLAshows particularly strong performance onStacking (90%)andSorting (70%), indicating robust control for complex and longer-horizon tasks.The following are the results from Table 4 of the original paper:
Policy Success Rate (%) - Real World In Distribution Out of Distribution Single-task Training ACT 70 40 SmolVLA (0.45B) 90 50
-
-
SO101 Robot (Pick-Place-Lego):
SmolVLA (0.45B)surpassesACTin bothin-distribution (90% vs 70%)andout-of-distribution (50% vs 40%)settings.- This demonstrates
SmolVLA's ability to generalize to a different robot embodiment and a task requiring higher precision, even with unseen object placements.
Effect of Pretraining and Multitask Learning
The following are the results from Table 5 of the original paper:
| Policy | VLA pt. | Success Rate (%) - Real World | |||
|---|---|---|---|---|---|
| Pick-Place | Stacking | Sorting | Avg. | ||
| Single-task Training | |||||
| SmolVLA (0.45B) | No | 55 | 45 | 20 | 40 |
| Multi-task Training | |||||
| SmolVLA (0.45B) | No | 80 | 40 | 35 | 51.7 |
| SmolVLA (0.45B) | Yes | 75 | 90 | 70 | 78.3 |
- Pretraining on
community datasets(indicated byVLA pt. Yes) significantly boostsSmolVLA's performance, increasing the average success rate from 51.7% (multi-task training without VLA pretraining) to 78.3%. This highlights the substantial knowledge transfer benefit from pretraining. Multi-task finetuningalso yields gains (51.7% vs 40% for single-task training without VLA pretraining), underscoring the importance of knowledge transfer across tasks for improved generalization.
6.2. Asynchronous Inference
The asynchronous inference strategy is evaluated for its impact on performance and speed.
The following are the results from Figure 5(a), (b), (c) of the original paper, presented as tables:
(a) | Performance (success rates).
| Inference | Success Rate (%) - Real World | |||
|---|---|---|---|---|
| Pick-Place | Stacking | Sorting | Avg | |
| Sync | 75 | 90 | 70 | 78.3 |
| Async | 80 | 90 | 50 | 73.3 |
(b) | Task completion time.
| Inference | Time (s) - Real World | ||
|---|---|---|---|
| Total | Avg | Std | |
| Sync | 137.5 | 13.75 | 2.42 |
| Async | 97.0 | 9.70 | 2.95 |
(c) | Performance in fixed time.
| Inference | # of Cubes - Real World | ||
|---|---|---|---|
| Total | Avg | Std | |
| Sync | 9 | 1.8 | 0.45 |
| Async | 19 | 3.8 | 1.3 |
- Performance (Success Rates):
Synchronous (Sync)inference achieves a slightly higher average success rate (78.3%) compared toAsynchronous (Async)(73.3%). This suggests that whileAsyncis faster, theSyncmode might offer a slight edge in control accuracy for some tasks where immediate, fresh predictions are critical. However, forPick-Place,Asyncactually performs better (80% vs 75%).Sortingtask shows the biggest drop forAsync(50% vs 70%). - Task Completion Time:
Asynchronous inferencedemonstrates a substantial speed advantage. It completes tasks on average in 9.7 seconds, which is approximately30% fasterthan thesynchronoussetting (13.75 seconds). This confirms the efficiency benefits of decoupling prediction from execution. - Performance in Fixed Time: In a fixed-time evaluation (presumably for the
Pick-Placetask, based on cube count),Asyncmode allows the robot to complete a significantlylarger number of actions(3.8 cubes on average vs. 1.8 forSync). This indicates that theAsyncmode, by avoiding prediction lags, enables the robot to remain active and execute more tasks within the same time frame. The higher standard deviation forAsyncin this metric suggests more variability in behavior, possibly due to the interaction of the threshold and observation filtering.
6.3. Ablation Studies / Parameter Analysis
A comprehensive ablation study was conducted on the LIBERO benchmark to assess the impact of key design choices.
6.3.1. Cross-attention (CA) vs. Self-attention (SA) between VLM and
The following are the results from Table 6 of the original paper:
| Attention mechanism | Success Rate (%) - LIBERO | ||||
|---|---|---|---|---|---|
| S | O | G | 10 | Avg | |
| CA | 87 | 92 | 83 | 54 | 79.0 |
| SA | 80 | 94 | 84 | 40 | 74.5 |
| CA+SA (ours) | 86 | 99 | 90 | 67 | 85.5 |
Cross-attention (CA)(79.0% Avg) outperformsSelf-attention (SA)(74.5% Avg) for the interaction betweenVLMfeatures and theaction expert.- The
interleaved CA+SAapproach (ours) yields the best results with an average of 85.5%, demonstrating that both mechanisms offer complementary strengths.CAlikely provides effective conditioning on global visual context, whileSAallows for robust internal action sequence modeling.
6.3.2. Causal vs. Bidirectional Attention on Action Tokens within
The following are the results from Table 7 of the original paper:
| Attention | Success Rate (%) - LIBERO | ||||
|---|---|---|---|---|---|
| mask | S | O | G | 10 | Avg |
| Bidir | 79 | 86 | 82 | 23 | 67.5 |
| Causal | 80 | 94 | 84 | 40 | 74.5 |
- Using a
causal attention mask(74.5% Avg) on action tokens within theaction expertsignificantly outperformsbidirectional attention(67.5% Avg). - This suggests that preventing future action leakage (i.e., making predictions only based on past actions) is crucial for improved performance in sequential action generation tasks.
6.3.3. Usage of LLM layers in VLM
The following are the results from Table 8 of the original paper:
| N | Success Rate (%) - LIBERO | ||||
|---|---|---|---|---|---|
| S | O | G | 10 | Avg | |
| 8 | 77 | 88 | 86 | 49 | 75.0 |
| 16 | 88 | 91 | 91 | 44 | 78.5 |
| 24 | 86 | 97 | 86 | 49 | 79.5 |
| 32 | 89 | 94 | 85 | 53 | 80.3 |
| Skip %2 | 84 | 90 | 83 | 45 | 75.5 |
| VLM-256M | 86 | 83 | 75 | 59 | 75.8 |
- The table investigates the effect of
skipping VLM layersby using features from only the first layers or by skipping every second layer (Skip %2). - Using features from the first layers (
80.3% Avg) yields the best performance among the listed options for layer skipping. This supports the idea that deeper layers might not always be necessary for optimal task performance, offering a good speed/performance trade-off. Skipping every second layer(Skip %2, 75.5% Avg) is a competitive baseline but performs worse than using the first layers directly.- Training a
smaller VLM(VLM-256M, 75.8% Avg) performs worse thanskipping layersfrom a largerVLM(e.g., the 500M parameter model used for ), suggesting that pruning a large model is more effective than training a small one from scratch for this task.
6.3.4. Action Expert Capacity
The following are the results from Table 9 of the original paper:
| Expert width (w.r.t. VLM) | Success Rate (%) - LIBERO | ||||
|---|---|---|---|---|---|
| S | 0 | G | 10 | Avg | |
| ×1.00 | 87 | 96 | 90 | 56 | 82.3 |
| ×0.75 | 82 | 89 | 84 | 55 | 77.5 |
| ×0.50 | 89 | 94 | 85 | 53 | 80.3 |
| ×0.25 | 76 | 97 | 83 | 39 | 73.8 |
- Adjusting the
hidden size(width) of theaction expertrelative to theVLM's dimension affects performance. - An expert width of (matching the VLM dimension) achieves the highest average success rate of 82.3%.
- However, reducing the expert's hidden size to (77.5% Avg) or even (80.3% Avg) still yields competitive results, indicating a good balance between performance and efficiency can be struck at smaller capacities. The initial choice of
0.75xfor efficiency is a reasonable trade-off.
6.3.5. Regression vs. Flow Matching Training Objective
The following are the results from Table 10 of the original paper:
| Training | Success Rate (%) - LIBERO | ||||
|---|---|---|---|---|---|
| objective | S | 0 | G | 10 | Avg |
| Flow matching | 89 | 94 | 85 | 53 | 80.25 |
| Regression | 92 | 85 | 86 | 38 | 75.25 |
Flow Matching(80.25% Avg) significantly outperforms a standardregression L1 loss(75.25% Avg) for training theaction expert.- This suggests that
Flow Matchingprovides a better inductive bias for modeling complex, multimodal action distributions, leading to more robust and accurate action predictions. This finding aligns with prior work like Black et al. (2024) and Chi et al. (2024).
6.3.6. States as Prefix vs. Suffix
The following are the results from Table 11 of the original paper:
| States | Attention | Success Rate (%) - LIBERO | ||||
|---|---|---|---|---|---|---|
| S | 0 | G | 10 | Avg | ||
| Prefix | CA | 89 | 94 | 85 | 53 | 80.3 |
| Suffix | CA | 86 | 82 | 78 | 47 | 73.3 |
| Prefix | SA | 62 | 74 | 57 | 20 | 53.3 |
| Suffix | SA | 80 | 92 | 80 | 47 | 74.8 |
- Feeding state information as a
prefixto theVLM(before visual and language tokens) generally leads to better performance than feeding it as asuffix(to theaction expert). - For the
CA(cross-attention) variant,Prefix(80.3% Avg) is much better thanSuffix(73.3% Avg). - However, for the
SA(self-attention) variant,Suffix(74.8% Avg) actually outperformsPrefix(53.3% Avg). This suggests that the optimal placement of state information might interact with the specific attention mechanism used. The coreSmolVLAusesCAprimarily to condition the expert, soPrefixstates to theVLMis the chosen strategy.
6.3.7. Action Chunk Size,
The following are the results from Table 12 of the original paper:
| Chunk | Success Rate (%) - LIBERO | ||||
|---|---|---|---|---|---|
| Size | S | 0 | G | 10 | Avg |
| 1 | 45 | 77 | 54 | 24 | 50.0 |
| 10 | 90 | 94 | 94 | 58 | 84.0 |
| 30 | 85 | 94 | 87 | 48 | 78.5 |
| 50 | 89 | 94 | 85 | 53 | 80.3 |
| 100 | 83 | 88 | 85 | 42 | 74.5 |
- The
chunk size(number of actions in a predicted sequence) has a significant impact on performance. - Very small (, 50.0% Avg) and very large (, 74.5% Avg) chunk sizes degrade performance. A chunk size of
1severely limits future planning and leads to poor performance. - Chunk sizes between
10and50provide a good balance, with achieving the highest average success rate of 84.0%. This suggests that an intermediate planning horizon is optimal forLIBEROtasks, balancing reactivity with coherent action sequences.
6.3.8. Action Execution Steps (Observation Update Frequency)
The following are the results from Table 13 of the original paper:
| Action | Success Rate (%) - LIBERO | ||||
|---|---|---|---|---|---|
| Steps | S | O | G | 10 | Avg |
| 1 | 89 | 94 | 85 | 53 | 80.3 |
| 10 | 89 | 94 | 91 | 57 | 82.8 |
| 30 | 76 | 91 | 74 | 42 | 70.8 |
| 50 | 54 | 70 | 58 | 25 | 51.8 |
- This ablation investigates how frequently new observations are sampled and used to predict new action chunks (i.e., how many executed actions before a new observation is processed).
- Sampling new observations more frequently (e.g., every
1or10steps) significantly improves performance compared to less frequent updates (e.g., every30or50steps). - Updating observations every
10 stepsachieves the highest average success rate of 82.8%. This highlights a critical trade-off: while less frequent updates might seem more efficient by reducing inference calls, they lead tostale observationsand degraded control accuracy. More frequent updates allow the robot to react better to environmental changes.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces SmolVLA, a compact, efficient, and lightweight Vision-Language-Action (VLA) model designed to make advanced robotics more accessible and affordable. SmolVLA achieves competitive performance, often matching or surpassing much larger VLA counterparts, while being capable of deployment on consumer-grade hardware (GPUs or even CPUs). Key to its efficiency are architectural innovations such as layer skipping in the VLM, using a minimal number of visual tokens, and an interleaved cross-attention and causal self-attention mechanism within its Flow Matching action expert. The model is pretrained on community-contributed datasets, demonstrating effective learning from diverse, open-source data. Furthermore, the proposed asynchronous inference stack enhances real-world responsiveness by decoupling action prediction from execution, leading to faster task completion times. SmolVLA represents a significant step towards democratizing robot learning by lowering computational barriers.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Limited Robot Type in Pretraining:
SmolVLAwas primarily pretrained on data from a single robot type (SO100). While finetuning to other robots (SO101) was demonstrated, pretraining onmulti-robot embodimentsis crucial for truly enhancing generalization to new robotic platforms. - Dataset Size and Scalability: The current pretraining dataset of approximately 23,000 trajectories is significantly smaller than those used by large
VLAs (e.g., OpenVLA's 1 million trajectories). Expanding the dataset size could substantially improve performance and generalization. - Architecture Scalability: Although efficient for consumer-grade hardware, exploring ways to scale the
SmolVLAarchitecture further without sacrificing speed or accessibility is an important direction. - VLM Backbone Choice: The reliance on an off-the-shelf
VLMbackbone (SmolVLM-2) pretrained on document reading andOCRtasks might not be optimal for real-world robotic action scenarios. Future work could exploreVLMbackbones specifically pretrained to align better with robotic environment demands. - Multimodal Datasets: Integrating diverse multimodal datasets (e.g.,
video and audio) could further improvegeneralizationandinstruction-following abilities. - Task Complexity and Longer Horizon: While effective for relatively simple and short-horizon tasks, scaling
SmolVLAto tacklelonger-horizon problemsremains a challenge. Incorporatinghierarchical policiesormulti-level planning mechanismsmay be necessary. - Learning Paradigms Limitation (Imitation Learning vs. Reinforcement Learning): The current approach primarily relies on
imitation learning. Exploringreinforcement learning (RL) techniques(Chen et al., 2025) forVLAscould offer significant performance benefits and more dexterous policy adaptation, especially for complex or long-horizon tasks.
7.3. Personal Insights & Critique
SmolVLA makes a highly valuable contribution to the field of robot learning by addressing the critical issue of accessibility. The current trend of increasingly massive foundation models risks centralizing research to only well-funded institutions. SmolVLA's focus on efficiency and deployability on consumer-grade hardware is a direct counter to this, fostering democratization and potentially spurring innovation from a broader community.
The architectural innovations, particularly layer skipping and the interleaved attention in the action expert, are clever ways to prune computational costs without a drastic performance hit. The asynchronous inference stack is a practical and necessary solution for real-world robotics, where latency directly impacts safety and task success. Its analytical study and performance gains in real-world speed are convincing.
Potential Issues/Areas for Improvement:
- Generalization to Diverse Robot Morphologies: While the paper acknowledges this limitation, the current approach of pretraining on a single robot type, even with diverse data, might struggle with significant changes in robot kinematics or sensor configurations. Future work could explore
robot-agnostic representationsorsim-to-real adaptationtechniques in conjunction withSmolVLA's efficiency. - Impact of Auto-Generated Annotations: The use of
VLMs for task annotation is innovative for handling noisy community data. However, the quality of these auto-generated annotations could have subtle biases or limitations. An ablation on human-verified vs.VLM-generated annotations could further clarify this impact. - Long-Horizon Task Performance: The performance drop in the
Sortingtask withasyncinference (from 70% to 50%) suggests that for tasks requiring more continuous monitoring and complex sequential decisions, theasyncmechanism's thresholds ( and similarity filtering) might need dynamic tuning or more sophisticated planning mechanisms to maintain accuracy. The trade-off between speed and accuracy will be more pronounced in such scenarios. - Robustness to Real-world Noise: While community datasets introduce noise, real-world deployment faces even greater challenges (lighting variations, unexpected object interactions, sensor failures).
SmolVLA's current evaluation on relatively controlled real-world tasks might not fully capture its robustness to extreme conditions.
Transferability and Applications:
-
Resource-Constrained Edge Devices: The architectural design principles (e.g., layer skipping, minimal visual tokens) could be directly applied to other
multimodal modelsdeployed on edge devices beyond robotics, such as smart cameras for surveillance or smart home assistants. -
Real-time Control Systems: The
asynchronous inference stackis a broadly applicable design pattern for any real-time system where computationally expensive predictions need to be made without interrupting continuous operation. This could extend to autonomous vehicles, industrial automation, or even real-time human-computer interaction systems. -
Foundation for Open Robotics:
SmolVLA's emphasis on open-source data and reproducible recipes positions it as a strong foundation for future open robotics research, allowing smaller labs and individual researchers to contribute to and benefit fromVLAadvancements.Overall,
SmolVLAis a well-executed and thoughtfully designed model that pushes the boundaries of efficientVLAdevelopment. Its commitment to open science and practical deployability makes it a highly relevant and inspiring piece of work for the future of robotics.
Similar papers
Recommended via semantic vector search.