Hugging Face Releases SmolVLA: A Compact Vision-Language-Action Model for Affordable and Efficient Robotics
Introduction
Recent advancements in robotic control using large-scale vision-language-action (VLA) models have been hampered by high hardware and data requirements. Traditional VLA models often rely on transformer-based architectures with billions of parameters, leading to substantial memory and computational costs. This restricts experimentation to well-funded labs, leaving practitioners using cost-effective hardware at a disadvantage. Additionally, the proprietary nature of much VLA research and inconsistent methodologies has created barriers to open research. Furthermore, data heterogeneity across different robotic platforms complicates the generalization and cross-platform learning processes.
Hugging Face Introduces SmolVLA
Hugging Face has unveiled SmolVLA, a compact vision-language-action model designed to be affordable and efficient in deployment. In contrast to traditional VLA models, SmolVLA is trained exclusively on community-collected datasets and is optimized for operation on single-GPU or CPU environments. The model architecture combines a streamlined version of a pretrained vision-language model (SmolVLM-2) with a transformer-based action expert, enabling effective low-level control through natural language instructions and RGB camera inputs.
Architectural Overview and Design Trade-Offs
SmolVLA is composed of two main components:
- Perception Module (SmolVLM-2): This pretrained compact vision-language encoder analyzes sequences of RGB images, sensorimotor states, and language instructions. To ensure efficiency, the model employs downsampling to limit visual tokens and only utilizes the lower half of transformer layers, based on empirical insights revealing that earlier layers yield more transferable features.
- Action Expert: This lightweight transformer, trained using flow matching, predicts sequences of continuous control actions. The action expert alternates between self-attention and cross-attention layers, achieving a balance between internal action coherence and conditioning on perception inputs. Causal masking is employed to maintain temporal consistency.
To minimize computational demands, linear projections are utilized to align the token dimensions across modalities. Instead of generating single-step predictions, the model produces action chunks, reducing the frequency of inference calls. The training process leverages bfloat16 precision and Torch’s JIT compilation for optimized runtime performance.
Empirical Evaluation
SmolVLA has been assessed in both simulation frameworks (LIBERO and Meta-World) and real-world robotic tasks using low-cost SO100 and SO101 platforms. The model was trained from scratch on approximately 23,000 episodes across 481 community datasets, with task labels generated automatically through a vision-language model (VLM). Evaluation metrics focus on task-level success rates in both in-distribution and out-of-distribution conditions.
In the LIBERO benchmark, SmolVLA (0.45B) achieves an average success rate of 87.3%, closely competing with or outperforming larger models such as π₀ (3.3B). In the Meta-World framework, SmolVLA surpasses diffusion policies and smaller VLA models across various task difficulties. These outcomes highlight the model’s efficiency despite its smaller training footprint.
In practical applications, SmolVLA records an average success rate of 78.3% in pick-and-place, stacking, and sorting tasks, outperforming ACT (trained from scratch) and π₀ (fine-tuned). Additionally, SmolVLA demonstrates robust generalization across robotic embodiments, maintaining consistent performance on SO101 despite being exclusively trained on SO100 data.
Performance Implications of Asynchronous Inference
The asynchronous inference stack of SmolVLA enhances control efficiency by allowing prediction and execution to overlap. Compared to traditional synchronous inference, this method reduces average task time by approximately 30% and doubles the number of completed actions in fixed-time scenarios, which is critical for edge deployments where delays in inference can deteriorate real-time performance.
Conclusion
SmolVLA illustrates how compact, reproducible, and open-source VLA models can facilitate competent robotic control on low-cost hardware. Through strategic architectural decisions such as layer pruning, chunked action prediction, and asynchronous execution, SmolVLA achieves strong performance while significantly lowering computational costs.
The open nature of the training and deployment stack, alongside real-world evaluations, establishes a valuable foundation for ongoing research in efficient and accessible robotic learning. Future research directions include expanding datasets for cross-embodiment training, enhancing model capacity without increasing latency, and exploring joint training on multimodal datasets beyond just robotics.
Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and join our 95k+ ML SubReddit. Subscribe to our Newsletter.