SmolVLM: Redefining Small and Efficient Multimodal Models
SmolVLM is a series of compact, high-performing multimodal models introduced by HuggingFace and Stanford. Designed for resource-constrained environments, it challenges the traditional notion that only massive vision-language models (VLMs) can achieve state-of-the-art performance. For the Qwen25-VL model, while its performance is excellent, it typically requires around 15GB of VRAM and is primarily designed to run on GPUs. Although it can be quantized to Int4 using Intel's OpenVINO and run on Intel CPUs, the response time in this case exceeds 20 seconds. Therefore, there is a growing need for lightweight models that can operate quickly on CPUs or edge devices like mobile phones. In contrast, SmolVLM, when similarly quantized to Int4 and run on a CPU, delivers a response time of around 5 seconds.
1. Motivation and Objective
Large-scale VLMs like Flamingo and Idefics-80B offer impressive capabilities but demand extreme computational resources. SmolVLM addresses several key issues:
- Excessive visual tokenization resulting in GPU inefficiency
- Inheritance of large-model architectural patterns unsuited for small-scale deployment
- Lack of optimized compute-memory balance for inference on-device
2. Architecture Overview
The SmolVLM pipeline integrates visual and textual modalities via the following stages:
- Image Splitting & Frame Sampling: High-resolution images are split into sub-images; video inputs are downsampled temporally.
- Vision Encoder (SigLIP): Frozen encoder generates rich embeddings.
- Pixel Shuffle: Reduces token count by rearranging spatial information into depth channels.
- MLP Projection: Projects visual features into the language model input space.
- Multimodal Integration: Visual and text tokens are jointly processed using self-attention in SmolLM2.
![]() |
Pixel shuffle: Rearranges encoded images, trading spatial resolution for increased channel depth. This reduces visual token count while preserving information density. |
3. Architectural Insights
- Balanced Parameter Allocation: A smaller LM benefits more from compact vision encoders.
- Extended Context: Scaling the RoPE base allows up to 16k token contexts for larger models.
- Pixel Shuffle: A ratio of 4 is most effective in small models due to lower attention complexity.
- Frame Averaging: Found to harm video understanding—selective frame sampling is better.
4. Instruction Tuning & Tokenization Strategy
- Learned Positional Tokens: Significantly outperform string-based tokens, especially for OCR
- System Prompts and Media Markers: Improve zero-shot generalization for vision tasks
- Chain-of-Thought Regulation: Optimal when used sparingly (0.02–0.05%) for small VLMs
- Video Length Control: Best results achieved with average 3.5-minute clips
5. Equation Example
The multimodal input sequence is represented as:
$ \text{Input} = [V_1, V_2, \ldots, V_n; T_1, T_2, \ldots, T_m] $
where \(V_i\) are visual tokens and \(T_j\) are textual tokens processed jointly in the LLM.
6. Benchmark Comparisons
SmolVLM achieves remarkable efficiency and performance:
Model | TextVQAv2 | MathVista | OCRBench | GPU RAM |
---|---|---|---|---|
SmolVLM-256M | 50.2% | 35.9% | 52.6% | 0.8 GB |
SmolVLM-500M | 60.2% | 40.1% | 61.0% | 4.9 GB |
SmolVLM-2.2B | 73.0% | 51.5% | 72.9% | 4.9 GB |
7. Author Highlights & Key Takeaways
- Compact VLMs need deliberate architectural balance and token compression strategies
- RoPE extension and aggressive pixel shuffle are critical for compact LLM success
- On-device deployment enabled via ONNX and WebGPU support
8. How to use SmolVLM-Instruct model
The sample demo code for the SmolVLM-Instruct model is given at Huggingface page for the model as following code.
import torch from PIL import Image from transformers import AutoProcessor, AutoModelForVision2Seq from transformers.image_utils import load_image DEVICE = "cuda" if torch.cuda.is_available() else "cpu" # Load images image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg") image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg") # Initialize processor and model processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct") model = AutoModelForVision2Seq.from_pretrained( "HuggingFaceTB/SmolVLM-Instruct", torch_dtype=torch.bfloat16, _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager", ).to(DEVICE) # Create input messages messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "image"}, {"type": "text", "text": "Can you describe the two images?"} ] }, ] # Prepare inputs prompt = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt") inputs = inputs.to(DEVICE) # Generate outputs generated_ids = model.generate(**inputs, max_new_tokens=500) generated_texts = processor.batch_decode( generated_ids, skip_special_tokens=True, ) print(generated_texts[0])
9. References
- Marafioti et al., (2025), SmolVLM: Redefining Small and Efficient Multimodal Models
- Live Demo: HuggingFace SmolVLM2 Space
- Huggingface: HuggingFace SmolVLM2-Instruct
- Source Code: GitHub - smollm
Comments
Post a Comment