1. Bibliographic Information

2. Executive Summary

4. Methodology

5. Experimental Setup

6. Results & Analysis

7. Conclusion & Reflections

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Abstract

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2.1. Background & Motivation

2.2. Main Contributions / Findings

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

6.1. Core Results Analysis

6.2. Asynchronous Inference

6.3. Ablation Studies / Parameter Analysis

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

TL;DR Summary

Vision-Language Models (VLMs)

Vision-Language-Action Models (VLAs)

4.2.1. Vision-Language Model (VLM)

4.2.2. State, Action, and State Projectors

4.2.3. Faster Inference Through Layer Skipping

4.2.4. Flow Matching Action Expert

4.2.5. Interleaved Cross and Causal Self-Attention Layers

4.2.6. Pretraining Data Collected by the Community

4.2.7. Asynchronous Inference

5.2.1. Success Rate (%) (SR)

5.2.2. Task Completion Time (s)

5.2.3. Performance in Fixed Time (# of Cubes)

Simulation Evaluation

Effect of Pretraining and Multitask Learning

6.3.1. Cross-attention (CA) vs. Self-attention (SA) between VLM and $\mathbf{v}_{\theta}$

6.3.2. Causal vs. Bidirectional Attention on Action Tokens within $\mathbf{v}_{\theta}$

6.3.3. Usage of LLM layers in VLM

6.3.4. Action Expert Capacity

6.3.5. Regression vs. Flow Matching Training Objective

6.3.6. States as Prefix vs. Suffix

6.3.7. Action Chunk Size, $n$

6.3.8. Action Execution Steps (Observation Update Frequency)

SmolVLA is a compact, efficient vision-language-action model that achieves competitive performance at reduced computational costs and can be deployed on consumer-grade hardware.

The central topic of this paper is SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics. It introduces a novel, compact, and computationally efficient model designed to democratize robotics by enabling deployment on consumer-grade hardware.

The paper lists numerous authors primarily affiliated with Hugging Face, with additional contributions from Sorbonne University, valeo.ai, and École Normale Supérieure Paris-Saclay. The core team members are highlighted with an asterisk: Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. Their diverse affiliations suggest a collaborative effort spanning academic research and industry, particularly in the domain of large language models and machine learning for robotics.

This paper was published as a preprint, indicated by its presence on Hugging Face Papers and arXiv. The provided publication date (2001-06-01T16:00:00.000Z) seems to be a placeholder or an error in the user's prompt, as the arXiv link 2506.01844 implies a publication date in 2025. Given it's a preprint, it has not yet undergone formal peer review in a journal or conference proceedings. However, Hugging Face is a highly influential platform in the machine learning and natural language processing community, making it a prominent venue for sharing research, particularly for models and datasets.

Based on the arXiv identifier 2506.01844, the intended publication year is 2025. The provided UTC timestamp (2001-06-01T16:00:00.000Z) is inconsistent with the arXiv ID and likely an error.

The paper introduces SmolVLA, a compact and efficient vision-language-action (VLA) model. The primary objective is to develop a VLA that achieves competitive performance while significantly reducing computational costs, enabling its deployment on consumer-grade hardware like standard GPUs or even CPUs. The model incorporates an asynchronous inference stack to enhance responsiveness by decoupling perception and action prediction from action execution, leading to higher control rates. A notable aspect is SmolVLA's pretraining on publicly available, community-contributed datasets. Despite its small size, SmolVLA demonstrates performance comparable to VLAs that are up to 10 times larger, as validated through evaluations on both simulated and real-world robotic benchmarks. The authors commit to releasing all code, pretrained models, and training data to foster reproducibility and broader participation in robotics research.

Original Source Link: https://huggingface.co/papers/2506.01844
PDF Link: https://arxiv.org/pdf/2506.01844.pdf The paper is available as a preprint on Hugging Face Papers and arXiv.

The field of robotics is increasingly moving towards foundation models, particularly Vision-Language Models (VLMs), as a strong basis for robotic policies. These models, pretrained on large-scale multimodal datasets, encode rich visual and linguistic knowledge, enabling natural language-driven perception and control. However, the core problem is that existing Vision-Language-Action (VLA) models are typically massive, often comprising billions of parameters. This leads to several significant challenges:

High training costs: The computational resources required for training these large models are prohibitive for most researchers and institutions.
Limited real-world deployability: Their immense size makes them impractical for deployment on resource-constrained platforms, such as affordable robots or consumer-grade hardware.
Reliance on academic/industrial datasets: Many existing VLAs are trained on proprietary or specialized datasets, overlooking the growing availability of community-collected data from more accessible robotic platforms. This limits accessibility and reproducibility within the broader robotics research community.

The problem is important because it hinders the democratization of robotics. High computational and hardware barriers prevent wider participation and innovation in robot learning. The paper's entry point is to address these limitations by developing a small, efficient, and community-driven VLA that drastically reduces both training and inference costs while retaining competitive performance. The innovative idea is to achieve this efficiency through specific architectural choices and a novel inference strategy, making advanced robotics more accessible.

The paper makes several primary contributions aimed at making VLAs more affordable and efficient:

Lightweight Architecture (SmolVLA): They present SmolVLA, a compact and efficient vision-language agent optimized for training on consumer-grade GPUs and deployment even on CPUs. Key design choices include:
- Skipping layers in the VLM: Reducing computational load during inference by using features from only the initial layers.
- Using a minimal number of visual tokens: Reducing the input dimension and processing requirements.
- Leveraging small pretrained VLMs: Building upon already efficient vision-language backbones.
- Interleaving self-attention and cross-attention layers: Optimizing the interaction between visual features and action tokens for better performance and speed.
Pretraining on Community-Driven Datasets: SmolVLA is trained end-to-end on fewer than 30,000 episodes (approximately 10.6 million frames) sourced exclusively from publicly available, community-contributed datasets. This demonstrates strong performance with significantly less data than prior art and highlights the value of open-source data.
Asynchronous Inference Stack: They introduce an asynchronous inference stack that decouples perception and action prediction from action execution. This allows for higher control rates and more responsive control by enabling chunked action generation and predictive processing, avoiding idle lags in robot operation.
Competitive Performance with Reduced Costs: Despite its compact size (0.45 billion parameters, with a 2.25 billion parameter variant also tested), SmolVLA achieves performance comparable to or even surpasses VLAs that are up to 10 times larger (e.g., 3.3 billion parameter $\pi_0$ ). This is demonstrated across a range of simulated (LIBERO, Meta-World) and real-world robotic benchmarks (SO-100, SO-101).
Reproducibility and Open-Source Release: The authors release all code, pretrained models, and training data, providing reproducible and efficient training and inference recipes to foster community engagement and research.

The key findings are that significant reductions in model size and computational requirements for VLA models are achievable without sacrificing competitive performance. The use of community-contributed data and an asynchronous inference strategy are crucial enablers for affordable and efficient robotics.

To understand this paper, a foundational grasp of several machine learning and robotics concepts is essential for a beginner:

Vision-Language Models (VLMs):
- Conceptual Definition: VLMs are a type of artificial intelligence model designed to process and understand information from both visual (e.g., images, videos) and textual (e.g., natural language descriptions) modalities simultaneously. They learn to associate visual content with linguistic meaning.
- How they work: Typically, a VLM consists of a vision encoder that processes images and a language model (often a decoder-only transformer) that processes text. A projection layer or adapter often connects these two components, allowing them to communicate. VLMs are often pretrained on massive datasets of image-text pairs (e.g., "a cat sitting on a mat" with a corresponding image of a cat).
- Example: If you show a VLM an image of a dog and ask "What is in this picture?", it can respond "A dog." Or, if you provide the text "a red car" and ask it to generate an image, it can do so.
Large Language Models (LLMs):
- Conceptual Definition: LLMs are deep learning models, typically based on the Transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. They can perform a wide array of language tasks, from answering questions to writing essays.
- Key Feature: Their "largeness" refers to billions of parameters (trainable weights) and the immense scale of their training data, which allows them to capture complex patterns and knowledge about language.
- Example: GPT-3, Llama, Falcon.
Vision-Language-Action (VLA) Models:
- Conceptual Definition: VLAs extend VLMs by adding an "action" component, making them capable of controlling robots. These models take multimodal inputs (visual observations from cameras, natural language instructions) and output physical actions that a robot can execute.
- Goal: To enable robots to understand high-level human commands (e.g., "pick up the red block") and translate them into a sequence of low-level motor commands (e.g., joint angles, gripper movements).
- Architecture: Often built by adapting or finetuning a pretrained VLM with robotics-specific data, adding an action head or action expert module.
Transformers:
- Conceptual Definition: A neural network architecture introduced in 2017, which revolutionized sequence-to-sequence tasks, particularly in natural language processing. It's known for its efficiency in parallel processing compared to recurrent neural networks.
- Key Mechanism: Attention: The core of a Transformer is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element.
- Attention Formula (Scaled Dot-Product Attention): $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
  - $Q$ (Query): Represents the query vector for the current element, used to compare against other elements.
  - $K$ (Key): Represents the key vectors for all elements in the sequence, used to match with the query.
  - $V$ (Value): Represents the value vectors for all elements, carrying the actual information to be weighted and aggregated.
  - $d_k$ : The dimension of the key vectors, used to scale the dot products to prevent vanishing gradients.
  - $\mathrm{softmax}(\cdot)$ : A function that converts a vector of numbers into a probability distribution, ensuring weights sum to 1.
  - $QK^T$ : Calculates the dot product between queries and keys, measuring their similarity.
  - The output is a weighted sum of the value vectors, where the weights are determined by the attention scores.
- Cross-Attention: Similar to self-attention, but queries come from one sequence (e.g., action tokens) and keys and values come from another sequence (e.g., visual features). This allows one modality to "attend to" or query information from another.
- Causal Attention Mask: A mechanism used in autoregressive models (like language decoders) to prevent tokens from attending to future tokens in the sequence. This ensures that the prediction of the current token only depends on past tokens.
Flow Matching:
- Conceptual Definition: A generative modeling technique used to learn continuous-time dynamics that transform a simple distribution (e.g., Gaussian noise) into a complex target distribution (e.g., a distribution of actions). It trains a neural network to predict a vector field that moves samples from the noise distribution to the data distribution along a continuous path.
- Advantage: It can be more stable and efficient than traditional diffusion models for certain tasks, particularly in action generation, as it directly learns the vector field mapping.
Action Chunking:
- Conceptual Definition: Instead of predicting a single action at each timestep, action chunking involves predicting a sequence or "chunk" of future actions $(a_t, a_{t+1}, \ldots, a_{t+n})$ all at once.
- Benefits: This can improve efficiency by reducing the frequency of computationally expensive model inferences and can lead to smoother, more coordinated robot movements by providing a short-term plan.
- Trade-off: Longer chunks can lead to less reactive control if environmental conditions change rapidly during the execution of the chunk.

The paper contextualizes SmolVLA within the landscape of VLMs and VLAs, highlighting the trajectory of research in this area.

Early VLMs: Often built by integrating pretrained vision encoders (like CLIP (Radford et al., 2021) or SigLIP (Zhai et al., 2023)) with pretrained LLMs (like Llama (AI@Meta, 2024; Touvron et al., 2023)).
Training Paradigms: Typically involve multi-stage training:
1. Large-scale pretraining on image-caption datasets (LAION-COCO (Schuhmann et al., 2022)).
2. Instruction-tuning on conversational datasets (Llava (Liu et al., 2023a), MiniGPT-4 (Zhu et al., 2023)).
Efficient VLMs: Recent efforts focus on reducing computational costs by training on smaller, more diverse datasets (Moondream (Korrapati, 2024), Qwen-VL (Bai et al., 2025)) or adapting unimodal models with minimal tuning (Fuyu-8B (Bavishi et al., 2023)). SmolVLM-2 (Marafioti et al., 2025), which SmolVLA uses as its backbone, is an example of an efficient VLM optimized for multimodal video inputs.
Limitations of existing VLMs for Robotics: While powerful for perception, most VLMs are not designed for direct action generation and often lack the real-time responsiveness and specific inductive biases needed for robotic control.

Emergence of VLAs: A growing area aiming to imbue robots with generalist skills by leveraging the reasoning and world knowledge embedded in pretrained LLMs and VLMs.
Octo (Team et al., 2024) and RT-X (O'Neil et al., 2024): These are prominent early VLA systems. They typically finetune pretrained VLMs on robotics-specific datasets. However, they are known for being resource-intensive and often depend on costly robotic platforms, limiting their accessibility. RT-X notably released the Open X-Embodiment dataset, a large collection of robotics data.
OpenVLA (Kim et al., 2024): Released a 7-billion parameter VLA trained on publicly available data, generating discrete action tokens. This discrete nature can be a limitation for continuous control tasks.
$\pi_0$ (Black et al., 2024) and DexVLA (Wen et al., 2025): These approaches address the continuous control limitation by proposing diffusion-based decoders for action generation. They adapt pretrained VLMs (like RDT-1B) and introduce large diffusion components (termed action experts) trained directly on robot demonstrations. $\pi_0$ is a key baseline for SmolVLA.
ACT (Zhao et al., 2023): A Conditional Variational Autencoder (CVAE) policy model with a transformer architecture, using a ResNet vision encoder. It generates action chunks and is optimized with a regression objective. It's another important baseline.
TinyVLA (Zhou et al., 2024): Another small-scale VLA model aiming for efficiency.
Limitations of Prior VLAs:
- Massive Model Sizes: Many VLAs are very large, leading to high training and inference costs.
- High Data Requirements: They often rely on enormous datasets, frequently collected by industrial labs, which can be hard to reproduce or access.
- Computational Expense: Inference often requires powerful GPUs, making real-world deployment on affordable robots challenging.
- Limited Responsiveness: Synchronous inference strategies can introduce blind lags, reducing robot reactivity.

The field has seen a rapid evolution from general-purpose Large Language Models (LLMs) to Vision-Language Models (VLMs), and now to Vision-Language-Action (VLA) models.

Foundation of LLMs: The success of LLMs (e.g., GPT-4, Llama) demonstrated the power of large models trained on vast internet-scale datasets to acquire general reasoning and language generation capabilities.
Multimodal Shift to VLMs: Researchers recognized the potential of integrating visual perception with LLMs. This led to VLMs, where vision encoders are combined with LLMs to process both images and text. Initial VLMs often involved complex multi-stage training and large model sizes.
Action Integration for VLAs: The next logical step was to connect VLMs to robotic control, giving rise to VLAs. The goal was to leverage the rich world knowledge and reasoning of VLMs to enable natural language instructions for robots. Early VLAs demonstrated impressive generalization but inherited the challenges of VLMs regarding size and computational cost.
Democratization and Efficiency (SmolVLA's Niche): SmolVLA fits into the current trajectory by addressing the critical need for affordability and efficiency. While prior VLAs focused on maximizing performance, SmolVLA prioritizes compactness, reduced computational costs, and deployability on consumer-grade hardware, all while maintaining competitive performance. This represents a shift towards making these powerful capabilities accessible to a broader robotics research community. The use of community-driven datasets further reinforces this democratizing trend.

SmolVLA distinguishes itself from the main methods in related work through its core emphasis on efficiency and accessibility, without significantly compromising performance:

Model Size and Computational Footprint:
- Differentiation: SmolVLA is explicitly designed to be small (0.45 billion parameters) and efficient, capable of training on a single GPU and deploying on consumer-grade GPUs or CPUs. This stands in stark contrast to Octo and RT-X, which are typically massive, requiring significant computational resources. $\pi_0$ , a key baseline, has 3.3-3.5 billion parameters, making SmolVLA roughly 7-8 times smaller. OpenVLA is 7 billion parameters.
- Innovation: SmolVLA achieves this through architectural innovations like skipping layers in the VLM backbone, using a minimal number of visual tokens, and employing a smaller VLM (SmolVLM-2).
Data Strategy:
- Differentiation: Unlike many industrial or academic VLA efforts that rely on vast, often proprietary or hard-to-access datasets (e.g., RT-X's Open X-Embodiment dataset), SmolVLA is pretrained exclusively on publicly available, community-contributed datasets. It uses an order of magnitude less data (fewer than 30,000 episodes) compared to models like OpenVLA (around 1 million trajectories).
- Innovation: The paper also introduces methods for standardizing and improving the quality of community data, such as VLM-based task annotation and camera viewpoint normalization, addressing the inherent noise and heterogeneity of crowd-sourced data.
Inference Mechanism:
- Differentiation: While many VLA systems might operate in a synchronous, open-loop fashion between observations, SmolVLA introduces an asynchronous inference stack.
- Innovation: This decouples perception and action prediction from action execution, allowing the robot to execute actions while a new chunk is being computed, thus reducing blind lags and increasing responsiveness and control rates. This is particularly critical for real-world deployment scenarios where latency is a concern.
Action Generation Method:
- Differentiation: SmolVLA utilizes a Flow Matching Transformer as its action expert for continuous action generation, similar to $\pi_0$ and DexVLA. This is an improvement over OpenVLA which generates discrete action tokens, a limitation for fine-grained continuous control.
- Innovation: SmolVLA further optimizes this by interleaving cross-attention and causal self-attention layers within its action expert, which they find provides higher success rates and faster inference times.
  
  In essence, SmolVLA offers a paradigm shift towards making powerful VLA capabilities more accessible and practical for a wider range of users and robotic platforms by focusing on rigorous efficiency without sacrificing performance.

The core idea behind SmolVLA is to build a Vision-Language-Action (VLA) model that is inherently small, efficient, and capable, optimized for affordable and efficient robotics. This is achieved by combining a compact, pretrained Vision-Language Model (VLM) for perception with an optimized action expert for continuous action generation. A key principle is to leverage community-contributed datasets for pretraining, making the model more accessible and reproducible. Furthermore, to enhance real-world responsiveness, SmolVLA employs an asynchronous inference stack that decouples computation from execution, allowing for higher control rates. The theoretical basis lies in adapting powerful transformer architectures and generative modeling techniques (Flow Matching) to efficiently predict robot actions conditioned on multimodal observations and language instructions.

SmolVLA is composed of two main interacting components: (i) a pretrained VLM responsible for perception, and (ii) an action expert that conditions and generates the actions. It also incorporates specific strategies for data handling and inference optimization.

The following figure (Figure 1 from the original paper) illustrates the overall architecture of SmolVLA:

$Figure 1 | SmolVLA. SmolVLA consists of a compact pretrained vision-language model, discarding the last $L - N$ layers (scissors icon). output `_ n` low-level actions chunk $a _ { t } , \\ldots , a _ { t + n }$ . SmolVLA is pretrained on public community datasets and evaluated on low-cost robots.$ 该图像是一个示意图，展示了 SmolVLA 模型的结构。它由经过预训练的视觉语言模型和行动专家组成，利用跨注意力和自注意力机制来处理来自社区数据集的信息，并用于低成本机器人任务。模型输出低级动作 $[a_{t}, a_{t+1}, ext{...}]$ 。

The VLM serves as the primary backbone for perceiving the robot's environment. It processes sensorimotor states, including images from multiple RGB cameras, and a language instruction.

Choice of VLM Backbone: The authors choose SmolVLM-2 (Marafioti et al., 2025) due to its efficiency and optimization for multimodal video inputs. SmolVLM-2 uses SigLIP (Zhai et al., 2023) as its vision encoder to extract visual features, which are then fed into a SmolLM2 language decoder.
Image Sequence Processing: The VLM component processes sequences of images using its vision encoder.
Visual Token Reduction: To ensure efficiency, SmolVLA significantly reduces the number of visual tokens processed per frame. While SmolVLM-2 typically processes high-resolution images, SmolVLA limits visual tokens to 64 per frame. This reduction impacts the token dimension and overall computational load.
Input Concatenation: Visual features, language tokens (from the instruction), and state tokens are concatenated and passed to the language decoder. The resulting features from the decoder layers are then used to condition the action expert.

Linear projection layers are used at several points to ensure dimensional compatibility between different components:

To project input states (e.g., robot joint positions, gripper state) to match the language model (LM)'s hidden dimension.
To project actions to match the action expert's input dimension.
To adapt VLM features along with the action expert's dimension.

A key efficiency optimization involves skipping computations within the VLM.

Principle: Based on prior work (Shukor and Cord, 2024; Tang et al., 2023) showing that not all layers of a pretrained model are equally important for downstream tasks.
Mechanism: SmolVLA extracts features from the VLM up to a specified layer $N$ , effectively discarding the top L-N layers of the language model decoder.
Configuration: The authors found that setting $N$ to half the total layers ( $N = L/2$ ) provides a good trade-off between speed and performance, essentially halving the computational cost of the LLM and action expert.

The action expert, denoted as $\mathbf{v}_{\theta}$ , is responsible for predicting a chunk of low-level actions.

Input: VLM features ( $\mathbf{o}_t$ ) extracted from an observation $o_t$ at the $N$ -th VLM layer, and noisy actions $\mathbf{A}_t^{\tau}$ .
Output: An action chunk $\mathbf{A}_t = (a_t, \ldots, a_{t+n})$ , representing a sequence of $n$ low-level commands.
Architecture: The action expert is built upon the Transformer architecture (Vaswani, 2017).
Training Objective (Flow Matching): The expert $\mathbf{v}_{\theta}$ $v_{θ}$ is trained using the objective defined by: $ \begin{array} { r } { \mathcal{L}^{\tau}(\theta) = \mathbb{E}_{p(\mathbf{A}_t \mid \mathbf{o}_t), q(\mathbf{A}_t^{\tau} \mid \mathbf{A}t)} \left[ \big| \mathbf{v}{\theta}\big(\mathbf{A}_t^{\tau}, \mathbf{o}_t\big) - \mathbf{u}\big(\mathbf{A}_t^{\tau} \mid \mathbf{A}_t\big) \big|^2 \right] } \end{array} $ Where:
- $\theta$ : The parameters of the action expert model.
- $\mathbf{o}_t$ : Represents the VLM features extracted from an observation $o_t$ at time $t$ from the $N$ -th VLM layer.
- $\mathbf{A}_t$ : The ground-truth action chunk that the robot should execute.
- $\mathbf{A}_t^{\tau}$ $A_{t}^{τ}$ : A noisy version of the action chunk, created by interpolating between the ground-truth action chunk and Gaussian noise: $\mathbf{A}_t^{\tau} = \tau \mathbf{A}_t + (1 - \tau) \epsilon$ $A_{t}^{τ} = τ A_{t} + (1 - τ) ϵ$ .
  - $\tau$ : A scalar parameter sampled from a Beta distribution, controlling the interpolation factor.
  - $\epsilon$ : Standard Gaussian noise, $\epsilon \sim \mathcal{N}(0, \mathbf{I})$ , where $\mathbf{I}$ is the identity matrix.
- $\mathbf{v}_{\theta}(\mathbf{A}_t^{\tau}, \mathbf{o}_t)$ : The prediction of the action expert, which takes the noisy action chunk and VLM features as input.
- $\mathbf{u}(\mathbf{A}_t^{\tau} \mid \mathbf{A}_t) = \epsilon - \mathbf{A}_t$ : The target vector field that the expert is trained to output. This vector field effectively points from the noisy action $\mathbf{A}_t^{\tau}$ towards the ground-truth action $\mathbf{A}_t$ .
Efficiency in Expert: To further improve inference efficiency, the action expert $\mathbf{v}_{\theta}$ uses a reduced hidden size of $\mathbf{0.75 \times d}$ , where $d$ is the VLM's hidden dimension.

Within the action expert's Transformer architecture, a specific attention mechanism is employed to efficiently integrate VLM features and model action token dependencies:

Mechanism: Unlike prior works relying exclusively on self-attention (SA) or cross-attention (CA), SmolVLA interleave SA and CA layers. This design choice is adapted from standard VLM architectures where each decoder block typically includes both SA and CA layers.
Cross-Attention (CA): Allows the action tokens (as queries) to co-attend to the VLM features (acting as keys and values). This is how the action expert conditions its predictions on the visual and linguistic context provided by the VLM.
Causal Self-Attention (SA): Allows the action tokens within the predicted chunk $\mathbf{A}_t$ to attend to each other, but only to preceding tokens in the sequence (using a causal attention mask). This ensures that the prediction of $a_{t+k}$ only depends on $a_t, \ldots, a_{t+k-1}$ , preventing future action leakage.
Benefit: The authors found that interleaving CA and SA layers provides higher success rates and faster inference times for robotic tasks.

SmolVLA is pretrained on community-contributed datasets to improve accessibility and reproducibility.

Challenges of Community Data:
- Heterogeneity: High variability in robot morphologies, sensors, actuation modes, and control schemes.
- Noise: Substantial noise in task annotations and inconsistencies in camera naming conventions.
- Data Collection Methods: Reliance on teleoperation by human experts.
Dataset Curation: A subset of 481 community datasets from Hugging Face was selected, totaling approximately 22,900 episodes and 10.6 million frames. This is an order of magnitude smaller than typical datasets for larger VLAs.

Task Annotation with VLM: To address noisy and vague task descriptions, an off-the-shelf VLM (Qwen2.5-VL-3B-Instruct) was used to auto-generate concise task descriptions.

Process: Representative frames from each dataset were sampled and provided to the VLM along with the original (often noisy) instruction.

Prompt Example: The model was prompted to produce short, action-oriented sentences. The full prompt is:

Here is a current task description: {current_task}. Generate a very short, clear, and complete one-sentence describing the action performed by the robot arm (max 30 characters). Do not include unnecessary words. Be concise.

Here is some examples: Pick up the cube and place it in the box, open the drawer and so on.

:t directly with an action verb like "Pick", "Place","Open", etc.

Similar to the provided examples, what is the main action done by the robot arm?

Camera Viewpoint Normalization: To standardize inconsistent camera naming (e.g., images.laptop could be top, side, or wrist view), cameras were manually mapped to standard viewpoints: top, wrist, and side perspectives. These were then renamed as $OBS_IMAGE_1$ , $OBS_IMAGE_2$ , and $OBS_IMAGE_3$ , respectively.

To overcome the limitations of synchronous inference (which introduces blind lags and reduces responsiveness), SmolVLA implements an asynchronous (async) inference stack.

The following figure (Figure 2 from the original paper) depicts the asynchronous inference architecture:

$该图像是示意图，展示了SmolVLA模型中PolicyServer与RobotClient之间的交互流程。图中标示了起始状态$o_0$、策略执行过程以及如何接收行动。关键流程包括RobotClient从PolicyServer接收$n$个动作，并在每次观察后执行动作。该流程通过三个步骤表现出环境延迟和推理延迟的影响，展示了机器人如何及时更新其行为队列以实现有效决策。$ 该图像是示意图，展示了SmolVLA模型中PolicyServer与RobotClient之间的交互流程。图中标示了起始状态 $o_0$ 、策略执行过程以及如何接收行动。关键流程包括RobotClient从PolicyServer接收 $n$ 个动作，并在每次观察后执行动作。该流程通过三个步骤表现出环境延迟和推理延迟的影响，展示了机器人如何及时更新其行为队列以实现有效决策。

Problem with Synchronous Inference:
- In a typical synchronous setup, a policy $\pi$ predicts an action chunk $\mathbf{A}_t = (a_t, \ldots, a_{t+n})$ from an observation $o_t$ . The robot then executes all $n$ actions before a new observation $o_{t+n}$ is passed to the policy. This results in open-loop inference between observations and introduces blind lags while waiting for the next chunk to be computed.
Asynchronous Solution: Decouples action chunk prediction ( $\mathbf{A}_t$ $A_{t}$ ) from action execution ( $a_t \gets \mathrm{PoPFRONT}(\mathbf{A}_t)$ $a_{t} \leftarrow PoPFRONT (A_{t})$ ).
- Architecture: A RoBoTCLIENT sends an observation $o_t$ to a PoLICYSERVER (potentially remote, with GPUs). The RoBoTCLIENT receives an action chunk $\mathbf{A}_t$ once inference is complete.
- Goal: To avoid execution lags by triggering new chunk evaluations while the robot is still executing actions from the previous chunk. This allows for continuous robot operation, especially beneficial in latency-constrained scenarios.
  
  The following is Algorithm 1, Asynchronous inference control-loop, from the original paper:

$Algorithm 1 Asynchronous inference control-loop 1: Input: horizon T, chunk size n, threshold g ∈ [0, 1] 2: Init: capture o_0; send o_0 to POLICYSERVER; receive A_0 ← π(o_0) 3: for t to T do 4: a_t ← PoPFRONT(A_t) 5: EXECUTE(a_t) execute action at step t 6: if |A_t| / n < g then queue below threshold 7: capture new observation, o_{t+1} 8: if NEEDSPROCESSING(o_{t+1}) then similarity filter, or triggers direct processing 9: Γ async_handle ← ASYNCINFER(o_{t+1}) Trigger new chunk prediction (non blocking) 10: A_{t+1} ← π(o_{t+1}) New queue is predicted with the policy 11: A_{t+1} ← f(A_t, A_{t+1}) aggregate overlaps (if any) 12: end if 13: end if 14: if NOTCOMPLETED(async_handle) then 15: A_{t+1} ← A_t No update on queue (inference is not over just yet) 16: end if 17: end for$ Explanation of Algorithm 1:

Inputs: horizon T (total timesteps for execution), chunk size n (number of actions in one predicted chunk), threshold g (a fraction, $0 \le g \le 1$ , determining when to trigger a new prediction).
Initialization (Lines 1-2):
- The robot captures its initial observation, $o_0$ .
- $o_0$ is sent to the PoLICYSERVER.
- The PoLICYSERVER computes the first action chunk, $\mathbf{A}_0$ , using the policy $\pi(o_0)$ , and sends it back to the RoBoTCLIENT.
Main Control Loop (Lines 3-17):
- Action Execution (Lines 4-5): At each timestep $t$ , the robot takes the first action $a_t$ from the current action queue $\mathbf{A}_t$ (using PoPFRONT which extracts and removes the first element), and then executes it.
- Triggering New Prediction (Lines 6-13):
  - Condition Check (Line 6): The core of the asynchronous mechanism. If the number of remaining actions in the current queue $\mathbf{A}_t$ falls below a certain fraction $g$ of the original chunk size n (i.e., $\frac{|\mathbf{A}_t|}{n} < g$ ), it indicates that a new action chunk might be needed soon.
  - Capture Observation (Line 7): A new observation, $o_{t+1}$ , is captured from the environment.
  - Similarity Filter (Line 8): $NEEDSPROCESSING(o_{t+1})$ is a check (e.g., comparing observations in joint-space) to avoid sending redundant or nearly identical observations to the server, saving computational resources. If the observation is sufficiently different or if the queue is about to become empty regardless of similarity, processing is triggered.
  - Asynchronous Inference Call (Line 9): A non-blocking call $ASYNCINFER(o_{t+1})$ is made to the PoLICYSERVER to start computing a new action chunk $\tilde{\mathbf{A}}_{t+1}$ based on $o_{t+1}$ . This operation runs in the background. A handle async_handle is returned to track its completion.
  - New Queue Prediction (Line 10): The new chunk $\tilde{\mathbf{A}}_{t+1}$ is predicted by the policy $\pi(o_{t+1})$ .
  - Aggregate Overlaps (Line 11): If there are overlapping timesteps between the currently executing chunk $\mathbf{A}_t$ and the newly predicted chunk $\tilde{\mathbf{A}}_{t+1}$ , a function $f(\cdot)$ aggregates them. This helps smooth transitions and combine information from both chunks.
- Handling Uncompleted Inference (Lines 14-16): If the asynchronous prediction for $\tilde{\mathbf{A}}_{t+1}$ is not yet completed (NOTCOMPLETED(async_handle)) when the RoBoTCLIENT needs a new chunk, the robot continues to use the existing queue $\mathbf{A}_t$ . This prevents the robot from becoming idle.
  
  The following figure (Figure 3 from the original paper) illustrates the action queue size evolution at runtime for various levels of $g$ :
  
  $Figure 3 | Action queue size evolution at runtime for various levels of $g$ when (A) not filtering out observation based on jospacilariynd tegut neuplicatservatin,measurinhilary jo$ 该图像是图表，展示了在不同 $g$ 值下，动作队列大小随推理时间步演变的情况。 (A) 显示了没有观察过滤的情况，而 (B) 则展示了应用观察过滤后的效果，突出在不同 $g$ 值下的变化趋势。
Analytical Study of Asynchronous Inference:
- Let $\ell$ be the random variable modeling the time needed to receive an action chunk $\mathbf{A}$ after sending an observation $o$ .
- $\ell$ comprises: (i) time to send $o$ from RoBoTCLIENT to PoLICYSERVER ( $t_{C \to S}$ ), (ii) inference latency on PoLICYSERVER ( $\ell_S$ ), and (iii) time to send $\mathbf{A}$ from PoLICYSERVER to RoBoTCLIENT ( $t_{S \to C}$ ).
- Assuming independence and negligible communication time: $\mathbb{E}[\ell] \simeq \mathbb{E}[\ell_S]$ .
- Let $\Delta t$ be the environment's control cycle (e.g., 33 ms for 30 frames/second).
- To avoid exhausted queues, $\mathbb{E}[\ell_S] < n \cdot \Delta t$ must hold, where $n$ is the chunk size.
Role of Threshold $g$ : $g$ dictates when a new observation is sent for processing.
- Sequential Limit ( $g=0$ ): The client waits for the entire chunk to be drained before requesting a new one. This reproduces fully sequential deployment, leading to $\mathbb{E}[\ell_S]$ idle seconds, as shown in Figure 3(A).
- Asynchronous Inference ( $g=0.7$ ): The client triggers new inference when $1-g = 0.3$ of the chunk remains. This amortizes computation, keeping the queue from emptying. The new chunk aggregates overlaps with the old one.
- Compute-Intensive Limit ( $g=1$ ): New inference is triggered at every timestep. This is maximally reactive but incurs one forward pass per control tick, making it computationally very expensive. If $\Delta t / \mathbb{E}[\ell_S] < 1$ , the queue will eventually deplete because prediction can't keep up with execution.
Observation Similarity Filter: Observations are compared in joint-space. If sufficiently similar (below a threshold $\epsilon$ ), processing is skipped to save resources. However, if the queue becomes empty, the most recent observation is processed regardless of similarity (Figure 3(B)). This balances efficiency with ensuring the robot always has actions.

SmolVLA is evaluated on a combination of simulated and real-world robotic manipulation tasks.

Pretraining Data:
- Source: A subset of 481 community datasets obtained from Hugging Face.
- Scale: Approximately 22,900 episodes and 10.6 million frames. This is an order of magnitude smaller than datasets used by prior large VLA models (e.g., OpenVLA uses ~1 million trajectories).
- Characteristics: These datasets are "noisy" in terms of task annotations and camera naming conventions, reflecting real-world heterogeneity. They were filtered based on modality type, episode count, overall data quality, and frame count.
- Selection Rationale: Chosen to demonstrate SmolVLA's ability to learn from affordable, open-source, and diverse community contributions, rather than relying on curated, resource-intensive datasets.
Simulation Environments (for Evaluation):
- LIBERO (Li et al., 2023a):
  - Description: A benchmark assessing diverse visuomotor skills across four categories: Spatial, Object, Goal, and Long, with 10 tasks per category (40 tasks total).
  - Dataset Used: A dataset (Kim et al., 2024; Pertsch et al., 2025) containing 1,693 episodes covering 40 tasks.
  - Evaluation: 10 trials per task, reporting average success rates based on binary completion criteria.
  - Robot: Franka Emika Panda robot.
- Meta-World (Yu et al., 2020):
  - Description: Evaluates generalization across 50 tasks of varying difficulty: easy, medium, hard, and very hard (Seo et al., 2023).
  - Dataset Used: A new dataset collected by the authors, comprising 50 demonstrations for each of the 50 tasks, totaling 2,500 episodes.
  - Evaluation: Assesses success rates based on whether the task is completed.
  - Robot: Sawyer robot (simulator).
Real-World Tasks (for Evaluation):
- SO-100 Robot Arm: Three datasets collected using the SO-100 robot arm (Knight et al., 2022). Each contains more than 50 demonstrations.
  - Pick-Place: Robot picks up a cube and places it in a box.
  - Stacking: Robot picks up a red cube and places it on top of a blue cube.
  - Sorting: Robot sorts cubes by color (red in right box, blue in left box). This is a longer-horizon task.
- SO-101 Robot Arm: One dataset collected using the SO-101 arm (Knight et al., 2022) for the Pick-Place-Lego task. More than 50 demonstrations.
  - Pick-Place-Lego: Robot picks up a small Lego brick and places it into a transparent box. This task requires high precision and advanced vision due to transparency. SmolVLA is not pretrained on any datasets recorded for the SO101.
- Data Description: The datasets record trajectories relative to robot joint positions, gripper state, and camera images.
- Source: These datasets are open-sourced on Hugging Face.
  
  The following figure (Figure 4 from the original paper) shows the visual setup for real-world tasks with SO100 and SO101 robots:
  
  该图像是示意图，展示了SmolVLA模型在执行抓取、堆叠和分类任务时的初始帧和最终帧对比。左侧的三个框分别标记为摘取、堆叠和分类的任务，展示了模型如何高效地处理这些动作。

For every evaluation metric mentioned in the paper, here is a complete explanation:

Conceptual Definition: Success rate is a common metric in robotics and reinforcement learning that quantifies the percentage of trials or episodes in which an agent successfully completes a predefined task. It directly measures the agent's ability to achieve its objective.
Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
Symbol Explanation:
- $\text{Number of Successful Trials}$ : The count of individual attempts or episodes where the robot achieved the task's completion criteria.
- $\text{Total Number of Trials}$ : The total count of all attempts or episodes conducted for the given task.
  
  Specific Scoring for Real-world Tasks:
Pick-Place: Assessed with a fine-grained score: 0 for failure, 0.5 for successfully grasping the cube, and 1 for successfully placing it into the box.
Stacking: Assessed with a fine-grained score: 0 for failure, 0.5 for successfully grasping the red cube, and 1 for successfully placing it on top of the blue cube.
Sorting: The paper doesn't detail a fine-grained score for sorting but uses a binary completion criterion, similar to simulation benchmarks.
Pick-Place-Lego: Binary completion (0 or 1) indicating whether the Lego brick was successfully placed in the box.

Conceptual Definition: This metric measures the duration, in seconds, taken by the robot to complete a specific task from its start to its successful conclusion. It's a direct indicator of the efficiency and speed of the robot's policy.
Mathematical Formula: The paper reports Total (sum of times for multiple trials), Avg (average time per trial), and Std (standard deviation of time per trial). No specific single formula for this aggregated metric is provided, but it's the elapsed real-world time.
Symbol Explanation:
- Total: Sum of completion times across all successful trials.
- Avg: Average completion time per successful trial.
- Std: Standard deviation of completion times across successful trials.

Conceptual Definition: This metric assesses the robot's throughput or productivity by counting how many sub-tasks (e.g., picking and placing a cube) it can successfully complete within a predetermined fixed time window. It highlights the policy's sustained efficiency and speed under time constraints.
Mathematical Formula: The paper reports Total (sum of cubes across trials), Avg (average cubes per trial), and Std (standard deviation of cubes per trial). This is a count-based metric within a fixed duration.
Symbol Explanation:
- Total: Sum of cubes successfully manipulated across all trials within the fixed time.
- Avg: Average number of cubes manipulated per trial within the fixed time.
- Std: Standard deviation of cubes manipulated per trial within the fixed time.

The paper compares SmolVLA against several popular and strong baseline models available in the LeRobot library (Cadene et al., 2024):

$\pi_0$ (Black et al., 2024):
- Description: A VLA model that combines a VLM backbone with Flow Matching for action chunk prediction. It takes observations (RGB images from multiple cameras), proprioceptive states, and a language instruction as inputs.
- Parameters: Has a total model size of 3.3 billion parameters (a 3.5B variant is also mentioned).
- Pretraining: Pre-trained on 10,000 hours of cross-embodiment robotics data. Variants are tested: one initialized from Paligemma-3B (a VLM) without robotics pretraining, and another with robotics pretraining (weights released by authors).
- Why Representative: It's a state-of-the-art VLA that also uses Flow Matching for continuous action generation, making it a strong and directly comparable baseline for SmolVLA's core methodology. Its large size provides a contrast to SmolVLA's efficiency goals.
ACT (Zhao et al., 2023):
- Description: A Conditional Variational Autencoder (CVAE) policy model featuring an encoder-decoder transformer architecture. It uses a ResNet vision encoder pretrained on ImageNet, while the CVAE itself is trained from scratch. It generates action chunks and is optimized using a regression objective to directly predict continuous actions.
- Parameters: Approximately 80 million parameters.
- Why Representative: A well-established and efficient imitation learning baseline for continuous control in robotics, representing a different architectural choice compared to VLM-based VLAs.
Diffusion Policy (Khazatsky et al., 2024; Chi et al., 2023):
- Description: A general approach to visuomotor policy learning via action diffusion. It uses a diffusion model to generate actions.
- Why Representative: Diffusion models are a popular and strong class of generative models recently adapted for robot control, providing another benchmark for SmolVLA's Flow Matching-based action expert.
Octo (Team et al., 2024):
- Description: An open-source generalist robot policy developed by a large collaboration. It's a VLA system pretrained on diverse robotics datasets.
- Parameters: 0.09 billion parameters (a relatively smaller VLA variant).
- Why Representative: Represents efforts towards generalist robot policies and is a common benchmark in simulation.
OpenVLA (Kim et al., 2024):
- Description: A 7-billion parameter VLA trained on publicly available data, designed to generate discrete action tokens.
- Why Representative: A large-scale, open VLA that SmolVLA aims to compete with in performance while being significantly smaller.
TinyVLA (Zhou et al., 2024):
- Description: As its name suggests, another small-scale large multimodal model (SMM) that aims for efficiency.
- Why Representative: A direct comparison for SmolVLA in terms of small model size and performance on benchmarks like Meta-World.

Framework: All experiments are conducted using the LeRobot (Cadene et al., 2024) PyTorch-based framework, designed for real-world robotics.
Pretraining:
- Steps: Trained for 60,000 steps.
- Batch Size: Global batch size of 64.
- Learning Rate Schedule: Cosine learning rate schedule starting at 1e-4 and decaying to 2.5e-5 after a 100-step warmup.
- Optimizer: AdamW with $\beta_1 = 0.9$ and $\beta_2 = 0.95$ .
- Image Resizing: Images are resized to 512x512 for consistency with the VLM input size.
- VLM Backbone: SmolVLM-2 (Marafioti et al., 2025) is used, with the VLM layers frozen during training of the action expert.
- Action Expert: Trained with Flow Matching to output chunks of $n=50$ actions.
Model Size: The main SmolVLA model contains 450 million parameters, with approximately 100 million dedicated to the action expert. Other variants with 0.24B and 2.25B parameters are also explored.
Efficiency Optimizations:
- Mixed Precision: bfloat16 precision is utilized.
- JIT Compilation: torch.compile() (Paszke, 2019) is used to JIT-compile PyTorch code into optimized kernels.
- Sequence Length and Batch Size Management: To ensure compatibility with optimizations, sequence length and batch size are maintained, discarding excess frames.
- Multi-GPU Training: Hugging Face Accelerate library is used for distributed training.
- Compute Cost: Pretraining was conducted using 4 GPUs and consumed approximately 30,000 GPU hours.
Inference Modes:
- Real-world Evaluation: Asynchronous inference is performed, where the model samples new observations and predicts action chunks at predetermined thresholds.
- Simulation Evaluation: Synchronous inference is performed, where a new action is predicted after each executed action, for a more reactive control loop.

The experimental results demonstrate that SmolVLA achieves competitive, and often superior, performance compared to significantly larger VLA models, while operating at a fraction of their computational cost.

The following are the results from Table 2 of the original paper:

Benchmark	Policy (# Params)	VLA Pt.	Spatial	Object	Goal	Long	Avg.
Benchmark	Policy (# Params)	VLA Pt.	Success Rate (%) - Simulation
LIBERO	Diffusion Policy (Khazatsky et al., 2024)	No	78.3	92.5	68.3	50.5	72.4
	Octo (0.09B) (Team et al., 2024)	Yes	78.9	85.7	84.6	51.1	75.1
	OpenVLA (7B) (Kim et al., 2024)	Yes	84.7	88.4	79.2	53.7	76.5
	π0 (Paligemma-3B)	No	87	63	89	48	71.8
	π0 (3.3B)	Yes	90	86	95	73	86.0
	SmolVLA (0.24B)	No	87	93	88	63	82.75
	SmolVLA (0.45B)	No	90	96	92	71	87.3
	SmolVLA (2.25B)	No	93	94	91	77	88.75
Meta-World	Diffusion Policy (Chi et al., 2023)	No	23.1	10.7	1.9	6.1	10.5
	TinyVLA (Zhou et al., 2024)	No	77.6	21.5	11.4	15.8	31.6
	π0 (3.5B-Paligemma)	No	80.4	40.9	36.7	44.0	50.5
	π0 (3.5B)	Yes	71.8	48.2	41.7	30.0	47.9
	SmolVLA (0.24B)	No	86.43	46.36	35	60	56.95
	SmolVLA (0.45B)	No	82.5	41.8	45.0	60.0	57.3
	SmolVLA (2.25B)	No	87.14	51.82	70	64	68.24

LIBERO Benchmark:
- SmolVLA (0.45B) achieves an average success rate of 87.3%, which is higher than Diffusion Policy (72.4%), Octo (0.09B, 75.1%), OpenVLA (7B, 76.5%), and $\pi_0`(Paligemma-3B, 71.8%)`. * It performs almost on par with$ \pi_0 $(3.3B, 86.0%)$ , a model approximately 7 times larger and pretrained on extensive robotics data.
- The SmolVLA (2.25B) variant further improves to 88.75%, demonstrating scalability with size.
- Notably, SmolVLA is not pretrained on robotics data, yet its performance is competitive. It is also reported to be 40% faster to train and consumes 6x less memory than $\pi_0$ .
Meta-World Benchmark:
- SmolVLA (0.45B) achieves an average success rate of 57.3%, significantly outperforming Diffusion Policy (10.5%), TinyVLA (31.6%), and both $\pi_0`variants` (`Paligemma-3.5B` at 50.5% and robotics-pretrained `3.5B` at 47.9%). * The `SmolVLA (2.25B)` variant achieves the highest average success rate at **68.24%**. * This highlights `SmolVLA`'s strong generalization capabilities across diverse tasks, especially on more challenging environments. ### Real-World Evaluation The following are the results from Table 3 of the original paper: <div class="table-wrapper"><table> <thead> <tr> <td></td> <th colspan="4">Success Rate (%) - Real World</th> </tr> <tr> <td>Policy</td> <td>Pick-Place</td> <td>Stacking</td> <td>Sorting</td> <td>Avg.</td> </tr> </thead> <tbody> <tr> <td colspan="5">Single-task Training</td> </tr> <tr> <td>ACT</td> <td>70</td> <td>50</td> <td>25</td> <td>48.3</td> </tr> <tr> <td colspan="5">Multi-task Training</td> </tr> <tr> <td>π0 (3.5B)</td> <td>100</td> <td>40</td> <td>45</td> <td>61.7</td> </tr> <tr> <td>SmolVLA (0.45B)</td> <td>75</td> <td>90</td> <td>70</td> <td>78.3</td> </tr> </tbody> </table></div> * **SO100 Robot (Pick-Place, Stacking, Sorting):** * `SmolVLA (0.45B)` achieves an average success rate of **78.3%** in a multi-task setting. * This `outperforms ACT (48.3%)`, which is trained individually on each task, and$ \pi_0 $(3.5B, 61.7%)$ , a significantly larger model.
- SmolVLA shows particularly strong performance on Stacking (90%) and Sorting (70%), indicating robust control for complex and longer-horizon tasks.
  
  The following are the results from Table 4 of the original paper:
  
  Policy Success Rate (%) - Real World
  
  In Distribution Out of Distribution
  
  Single-task Training
  
  ACT 70 40
  
  SmolVLA (0.45B) 90 50
SO101 Robot (Pick-Place-Lego):
- SmolVLA (0.45B) surpasses ACT in both in-distribution (90% vs 70%) and out-of-distribution (50% vs 40%) settings.
- This demonstrates SmolVLA's ability to generalize to a different robot embodiment and a task requiring higher precision, even with unseen object placements.

Policy	Success Rate (%) - Real World
Single-task Training
ACT	70	40
SmolVLA (0.45B)	90	50

The following are the results from Table 5 of the original paper:

Policy	VLA pt.	Pick-Place	Stacking	Sorting	Avg.
Policy	VLA pt.	Success Rate (%) - Real World
Single-task Training
SmolVLA (0.45B)	No	55	45	20	40
Multi-task Training
SmolVLA (0.45B)	No	80	40	35	51.7
SmolVLA (0.45B)	Yes	75	90	70	78.3

Pretraining on community datasets (indicated by VLA pt. Yes) significantly boosts SmolVLA's performance, increasing the average success rate from 51.7% (multi-task training without VLA pretraining) to 78.3%. This highlights the substantial knowledge transfer benefit from pretraining.
Multi-task finetuning also yields gains (51.7% vs 40% for single-task training without VLA pretraining), underscoring the importance of knowledge transfer across tasks for improved generalization.

The asynchronous inference strategy is evaluated for its impact on performance and speed.

The following are the results from Figure 5(a), (b), (c) of the original paper, presented as tables:

(a) | Performance (success rates).

Inference	Pick-Place	Stacking	Sorting	Avg
Inference	Success Rate (%) - Real World
Sync	75	90	70	78.3
Async	80	90	50	73.3

(b) | Task completion time.

Inference	Total	Avg	Std
Inference	Time (s) - Real World
Sync	137.5	13.75	2.42
Async	97.0	9.70	2.95

(c) | Performance in fixed time.

Inference	Total	Avg	Std
Inference	# of Cubes - Real World
Sync	9	1.8	0.45
Async	19	3.8	1.3

Performance (Success Rates): Synchronous (Sync) inference achieves a slightly higher average success rate (78.3%) compared to Asynchronous (Async) (73.3%). This suggests that while Async is faster, the Sync mode might offer a slight edge in control accuracy for some tasks where immediate, fresh predictions are critical. However, for Pick-Place, Async actually performs better (80% vs 75%). Sorting task shows the biggest drop for Async (50% vs 70%).
Task Completion Time: Asynchronous inference demonstrates a substantial speed advantage. It completes tasks on average in 9.7 seconds, which is approximately 30% faster than the synchronous setting (13.75 seconds). This confirms the efficiency benefits of decoupling prediction from execution.
Performance in Fixed Time: In a fixed-time evaluation (presumably for the Pick-Place task, based on cube count), Async mode allows the robot to complete a significantly larger number of actions (3.8 cubes on average vs. 1.8 for Sync). This indicates that the Async mode, by avoiding prediction lags, enables the robot to remain active and execute more tasks within the same time frame. The higher standard deviation for Async in this metric suggests more variability in behavior, possibly due to the interaction of the threshold $g$ and observation filtering.

A comprehensive ablation study was conducted on the LIBERO benchmark to assess the impact of key design choices.

The following are the results from Table 6 of the original paper:

Attention mechanism	S	O	G	10	Avg
Attention mechanism	Success Rate (%) - LIBERO
CA	87	92	83	54	79.0
SA	80	94	84	40	74.5
CA+SA (ours)	86	99	90	67	85.5

Cross-attention (CA) (79.0% Avg) outperforms Self-attention (SA) (74.5% Avg) for the interaction between VLM features and the action expert.
The interleaved CA+SA approach (ours) yields the best results with an average of 85.5%, demonstrating that both mechanisms offer complementary strengths. CA likely provides effective conditioning on global visual context, while SA allows for robust internal action sequence modeling.

The following are the results from Table 7 of the original paper:

Attention	Success Rate (%) - LIBERO
mask	S	O	G	10	Avg
Bidir	79	86	82	23	67.5
Causal	80	94	84	40	74.5

Using a causal attention mask (74.5% Avg) on action tokens within the action expert significantly outperforms bidirectional attention (67.5% Avg).
This suggests that preventing future action leakage (i.e., making predictions only based on past actions) is crucial for improved performance in sequential action generation tasks.

The following are the results from Table 8 of the original paper:

N	S	O	G	10	Avg
N	Success Rate (%) - LIBERO
8	77	88	86	49	75.0
16	88	91	91	44	78.5
24	86	97	86	49	79.5
32	89	94	85	53	80.3
Skip %2	84	90	83	45	75.5
VLM-256M	86	83	75	59	75.8

The table investigates the effect of skipping VLM layers by using features from only the first $N$ layers or by skipping every second layer (Skip %2).
Using features from the first $N=32$ layers (80.3% Avg) yields the best performance among the listed options for layer skipping. This supports the idea that deeper layers might not always be necessary for optimal task performance, offering a good speed/performance trade-off.
Skipping every second layer (Skip %2, 75.5% Avg) is a competitive baseline but performs worse than using the first $N$ layers directly.
Training a smaller VLM (VLM-256M, 75.8% Avg) performs worse than skipping layers from a larger VLM (e.g., the 500M parameter model used for $N=32$ ), suggesting that pruning a large model is more effective than training a small one from scratch for this task.

The following are the results from Table 9 of the original paper:

Expert width (w.r.t. VLM)	S	0	G	10	Avg
Expert width (w.r.t. VLM)	Success Rate (%) - LIBERO
×1.00	87	96	90	56	82.3
×0.75	82	89	84	55	77.5
×0.50	89	94	85	53	80.3
×0.25	76	97	83	39	73.8

Adjusting the hidden size (width) of the action expert relative to the VLM's dimension $d$ affects performance.
An expert width of $\times 1.00$ (matching the VLM dimension) achieves the highest average success rate of 82.3%.
However, reducing the expert's hidden size to $\times 0.75$ (77.5% Avg) or even $\times 0.50$ (80.3% Avg) still yields competitive results, indicating a good balance between performance and efficiency can be struck at smaller capacities. The initial choice of 0.75x for efficiency is a reasonable trade-off.

The following are the results from Table 10 of the original paper:

Training	Success Rate (%) - LIBERO
objective	S	0	G	10	Avg
Flow matching	89	94	85	53	80.25
Regression	92	85	86	38	75.25

Flow Matching (80.25% Avg) significantly outperforms a standard regression L1 loss (75.25% Avg) for training the action expert.
This suggests that Flow Matching provides a better inductive bias for modeling complex, multimodal action distributions, leading to more robust and accurate action predictions. This finding aligns with prior work like Black et al. (2024) and Chi et al. (2024).

The following are the results from Table 11 of the original paper:

States	Attention	S	0	G	10	Avg
States	Attention	Success Rate (%) - LIBERO
Prefix	CA	89	94	85	53	80.3
Suffix	CA	86	82	78	47	73.3
Prefix	SA	62	74	57	20	53.3
Suffix	SA	80	92	80	47	74.8

Feeding state information as a prefix to the VLM (before visual and language tokens) generally leads to better performance than feeding it as a suffix (to the action expert).
For the CA (cross-attention) variant, Prefix (80.3% Avg) is much better than Suffix (73.3% Avg).
However, for the SA (self-attention) variant, Suffix (74.8% Avg) actually outperforms Prefix (53.3% Avg). This suggests that the optimal placement of state information might interact with the specific attention mechanism used. The core SmolVLA uses CA primarily to condition the expert, so Prefix states to the VLM is the chosen strategy.

The following are the results from Table 12 of the original paper:

Chunk	Success Rate (%) - LIBERO
Size	S	0	G	10	Avg
1	45	77	54	24	50.0
10	90	94	94	58	84.0
30	85	94	87	48	78.5
50	89	94	85	53	80.3
100	83	88	85	42	74.5

The chunk size (number of actions $n$ in a predicted sequence) has a significant impact on performance.
Very small ( $n=1$ , 50.0% Avg) and very large ( $n=100$ , 74.5% Avg) chunk sizes degrade performance. A chunk size of 1 severely limits future planning and leads to poor performance.
Chunk sizes between 10 and 50 provide a good balance, with $n=10$ achieving the highest average success rate of 84.0%. This suggests that an intermediate planning horizon is optimal for LIBERO tasks, balancing reactivity with coherent action sequences.

The following are the results from Table 13 of the original paper:

This ablation investigates how frequently new observations are sampled and used to predict new action chunks (i.e., how many executed actions before a new observation is processed).
Sampling new observations more frequently (e.g., every 1 or 10 steps) significantly improves performance compared to less frequent updates (e.g., every 30 or 50 steps).
Updating observations every 10 steps achieves the highest average success rate of 82.8%. This highlights a critical trade-off: while less frequent updates might seem more efficient by reducing inference calls, they lead to stale observations and degraded control accuracy. More frequent updates allow the robot to react better to environmental changes.

The paper successfully introduces SmolVLA, a compact, efficient, and lightweight Vision-Language-Action (VLA) model designed to make advanced robotics more accessible and affordable. SmolVLA achieves competitive performance, often matching or surpassing much larger VLA counterparts, while being capable of deployment on consumer-grade hardware (GPUs or even CPUs). Key to its efficiency are architectural innovations such as layer skipping in the VLM, using a minimal number of visual tokens, and an interleaved cross-attention and causal self-attention mechanism within its Flow Matching action expert. The model is pretrained on community-contributed datasets, demonstrating effective learning from diverse, open-source data. Furthermore, the proposed asynchronous inference stack enhances real-world responsiveness by decoupling action prediction from execution, leading to faster task completion times. SmolVLA represents a significant step towards democratizing robot learning by lowering computational barriers.

The authors acknowledge several limitations and suggest future research directions:

Limited Robot Type in Pretraining: SmolVLA was primarily pretrained on data from a single robot type (SO100). While finetuning to other robots (SO101) was demonstrated, pretraining on multi-robot embodiments is crucial for truly enhancing generalization to new robotic platforms.
Dataset Size and Scalability: The current pretraining dataset of approximately 23,000 trajectories is significantly smaller than those used by large VLAs (e.g., OpenVLA's 1 million trajectories). Expanding the dataset size could substantially improve performance and generalization.
Architecture Scalability: Although efficient for consumer-grade hardware, exploring ways to scale the SmolVLA architecture further without sacrificing speed or accessibility is an important direction.
VLM Backbone Choice: The reliance on an off-the-shelf VLM backbone (SmolVLM-2) pretrained on document reading and OCR tasks might not be optimal for real-world robotic action scenarios. Future work could explore VLM backbones specifically pretrained to align better with robotic environment demands.
Multimodal Datasets: Integrating diverse multimodal datasets (e.g., video and audio) could further improve generalization and instruction-following abilities.
Task Complexity and Longer Horizon: While effective for relatively simple and short-horizon tasks, scaling SmolVLA to tackle longer-horizon problems remains a challenge. Incorporating hierarchical policies or multi-level planning mechanisms may be necessary.
Learning Paradigms Limitation (Imitation Learning vs. Reinforcement Learning): The current approach primarily relies on imitation learning. Exploring reinforcement learning (RL) techniques (Chen et al., 2025) for VLAs could offer significant performance benefits and more dexterous policy adaptation, especially for complex or long-horizon tasks.

SmolVLA makes a highly valuable contribution to the field of robot learning by addressing the critical issue of accessibility. The current trend of increasingly massive foundation models risks centralizing research to only well-funded institutions. SmolVLA's focus on efficiency and deployability on consumer-grade hardware is a direct counter to this, fostering democratization and potentially spurring innovation from a broader community.

The architectural innovations, particularly layer skipping and the interleaved attention in the action expert, are clever ways to prune computational costs without a drastic performance hit. The asynchronous inference stack is a practical and necessary solution for real-world robotics, where latency directly impacts safety and task success. Its analytical study and performance gains in real-world speed are convincing.

Potential Issues/Areas for Improvement:

Generalization to Diverse Robot Morphologies: While the paper acknowledges this limitation, the current approach of pretraining on a single robot type, even with diverse data, might struggle with significant changes in robot kinematics or sensor configurations. Future work could explore robot-agnostic representations or sim-to-real adaptation techniques in conjunction with SmolVLA's efficiency.
Impact of Auto-Generated Annotations: The use of VLMs for task annotation is innovative for handling noisy community data. However, the quality of these auto-generated annotations could have subtle biases or limitations. An ablation on human-verified vs. VLM-generated annotations could further clarify this impact.
Long-Horizon Task Performance: The performance drop in the Sorting task with async inference (from 70% to 50%) suggests that for tasks requiring more continuous monitoring and complex sequential decisions, the async mechanism's thresholds ( $g$ and similarity filtering) might need dynamic tuning or more sophisticated planning mechanisms to maintain accuracy. The trade-off between speed and accuracy will be more pronounced in such scenarios.
Robustness to Real-world Noise: While community datasets introduce noise, real-world deployment faces even greater challenges (lighting variations, unexpected object interactions, sensor failures). SmolVLA's current evaluation on relatively controlled real-world tasks might not fully capture its robustness to extreme conditions.

Transferability and Applications:

Resource-Constrained Edge Devices: The architectural design principles (e.g., layer skipping, minimal visual tokens) could be directly applied to other multimodal models deployed on edge devices beyond robotics, such as smart cameras for surveillance or smart home assistants.
Real-time Control Systems: The asynchronous inference stack is a broadly applicable design pattern for any real-time system where computationally expensive predictions need to be made without interrupting continuous operation. This could extend to autonomous vehicles, industrial automation, or even real-time human-computer interaction systems.
Foundation for Open Robotics: SmolVLA's emphasis on open-source data and reproducible recipes positions it as a strong foundation for future open robotics research, allowing smaller labs and individual researchers to contribute to and benefit from VLA advancements.

Overall, SmolVLA is a well-executed and thoughtfully designed model that pushes the boundaries of efficient VLA development. Its commitment to open science and practical deployability makes it a highly relevant and inspiring piece of work for the future of robotics.

Success Rate (%) - LIBERO