Skip to main content

SmolVLM: Redefining Small and Efficient Multimodal Models

SmolVLM: Redefining Small and Efficient Multimodal Models

SmolVLM Architecture. Images are split into subimages, frames are sampled from videos, and then encoded into visual features. These features are first rearranged via a pixel-shuffle operation, then mapped into the LLM input space as visual tokens using an MLP projection. Visual tokens are then concatenated/interleaved with text embeddings (orange/red). This combined sequence is passed to the LLM for text output.

SmolVLM is a series of compact, high-performing multimodal models introduced by HuggingFace and Stanford. Designed for resource-constrained environments, it challenges the traditional notion that only massive vision-language models (VLMs) can achieve state-of-the-art performance. For the Qwen25-VL model, while its performance is excellent, it typically requires around 15GB of VRAM and is primarily designed to run on GPUs. Although it can be quantized to Int4 using Intel's OpenVINO and run on Intel CPUs, the response time in this case exceeds 20 seconds. Therefore, there is a growing need for lightweight models that can operate quickly on CPUs or edge devices like mobile phones. In contrast, SmolVLM, when similarly quantized to Int4 and run on a CPU, delivers a response time of around 5 seconds.

1. Motivation and Objective

Large-scale VLMs like Flamingo and Idefics-80B offer impressive capabilities but demand extreme computational resources. SmolVLM addresses several key issues:

  • Excessive visual tokenization resulting in GPU inefficiency
  • Inheritance of large-model architectural patterns unsuited for small-scale deployment
  • Lack of optimized compute-memory balance for inference on-device


2. Architecture Overview

The SmolVLM pipeline integrates visual and textual modalities via the following stages:

  • Image Splitting & Frame Sampling: High-resolution images are split into sub-images; video inputs are downsampled temporally.
  • Vision Encoder (SigLIP): Frozen encoder generates rich embeddings.
  • Pixel Shuffle: Reduces token count by rearranging spatial information into depth channels.
  • MLP Projection: Projects visual features into the language model input space.
  • Multimodal Integration: Visual and text tokens are jointly processed using self-attention in SmolLM2.
Pixel shuffle: Rearranges encoded images, trading spatial resolution for increased channel depth. This reduces visual token count while preserving information density.


3. Architectural Insights

  • Balanced Parameter Allocation: A smaller LM benefits more from compact vision encoders.
  • Extended Context: Scaling the RoPE base allows up to 16k token contexts for larger models.
  • Pixel Shuffle: A ratio of 4 is most effective in small models due to lower attention complexity.
  • Frame Averaging: Found to harm video understanding—selective frame sampling is better.
Performance analysis of SmolVLM configurations. (Left) Impact of vision encoder and language model sizes. Smaller language models (135M) benefit less from larger vision encoders (SigLIP-SO-400M, 428M) compared to SigLIP-B/16 (93M), while larger language models gain more from powerful encoders. (Middle-left) Performance significantly improves with increased context lengths (2k to 16k tokens). (Middle-right) Optimal pixel shuffle factor (PS=2 vs. PS=4) varies by model size. (Right) Frame averaging reduces video performance, with a rapid decline as more frames are averaged. Metrics average CIDEr (captioning) and accuracy (visual question answering).


4. Instruction Tuning & Tokenization Strategy

  • Learned Positional Tokens: Significantly outperform string-based tokens, especially for OCR
  • System Prompts and Media Markers: Improve zero-shot generalization for vision tasks
  • Chain-of-Thought Regulation: Optimal when used sparingly (0.02–0.05%) for small VLMs
  • Video Length Control: Best results achieved with average 3.5-minute clips
Tokenization Strategy Comparisons. (Left) Training loss curves illustrating the “OCR loss plague” when using string-based tokens in smaller models. (Center) Aggregated evaluation metrics showing consistently higher scores with learned tokens (orange). (Right) Scatter plot of OpenCompass-Image vs. OpenCompass-Video: learned tokens dominate the higher-scoring region, especially in image-intensive tasks.


5. Equation Example

The multimodal input sequence is represented as:

$ \text{Input} = [V_1, V_2, \ldots, V_n; T_1, T_2, \ldots, T_m] $

where \(V_i\) are visual tokens and \(T_j\) are textual tokens processed jointly in the LLM.


6. Benchmark Comparisons

SmolVLM achieves remarkable efficiency and performance:

ModelTextVQAv2MathVistaOCRBenchGPU RAM
SmolVLM-256M50.2%35.9%52.6%0.8 GB
SmolVLM-500M60.2%40.1%61.0%4.9 GB
SmolVLM-2.2B73.0%51.5%72.9%4.9 GB


7. Author Highlights & Key Takeaways

  • Compact VLMs need deliberate architectural balance and token compression strategies
  • RoPE extension and aggressive pixel shuffle are critical for compact LLM success
  • On-device deployment enabled via ONNX and WebGPU support


8. How to use SmolVLM-Instruct model

The sample demo code for the SmolVLM-Instruct model is given at Huggingface page for the model as following code.


import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Can you describe the two images?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])


9. References



Comments

Popular

Understanding SentencePiece: A Language-Independent Tokenizer for AI Engineers

In the realm of Natural Language Processing (NLP), tokenization plays a pivotal role in preparing text data for machine learning models. Traditional tokenization methods often rely on language-specific rules and pre-tokenized inputs, which can be limiting when dealing with diverse languages and scripts. Enter SentencePiece—a language-independent tokenizer and detokenizer designed to address these challenges and streamline the preprocessing pipeline for neural text processing systems. What is SentencePiece? SentencePiece is an open-source tokenizer and detokenizer developed by Google, tailored for neural-based text processing tasks such as Neural Machine Translation (NMT). Unlike conventional tokenizers that depend on whitespace and language-specific rules, SentencePiece treats the input text as a raw byte sequence, enabling it to process languages without explicit word boundaries, such as Japanese, Chinese, and Korean. This approach allows SentencePiece to train subword models di...

Mastering the Byte Pair Encoding (BPE) Tokenizer for NLP and LLMs

Byte Pair Encoding (BPE) is one of the most important and widely adopted subword tokenization algorithms in modern Natural Language Processing (NLP), especially in training Large Language Models (LLMs) like GPT. This guide provides a deep technical dive into how BPE works, compares it with other tokenizers like WordPiece and SentencePiece, and explains its practical implementation with Python code. This article is optimized for AI engineers building real-world models and systems. 1. What is Byte Pair Encoding? BPE was originally introduced as a data compression algorithm by Gage in 1994. It replaces the most frequent pair of bytes in a sequence with a single, unused byte. In 2015, Sennrich et al. adapted BPE for NLP to address the out-of-vocabulary (OOV) problem in neural machine translation. Instead of working with full words, BPE decomposes them into subword units that can be recombined to represent rare or unseen words. 2. Why Tokenization Matters in LLMs Tokenization is th...

ZeRO: Deep Memory Optimization for Training Trillion-Parameter Models

In 2020, Microsoft researchers introduced ZeRO (Zero Redundancy Optimizer) via their paper "ZeRO: Memory Optimization Towards Training Trillion Parameter Models" (arXiv:1910.02054). ZeRO is a memory optimization technique that eliminates redundancy in distributed training, enabling efficient scaling to trillion-parameter models. This provides an in-depth technical breakdown of ZeRO's partitioning strategies, memory usage analysis, and integration with DeepSpeed. 1. What is ZeRO? ZeRO eliminates redundant memory copies of model states across GPUs. Instead of replicating parameters, gradients, and optimizer states across each GPU, ZeRO partitions them across all devices. This results in near-linear memory savings as the number of GPUs increases. 2. Limitations of Traditional Data Parallelism In standard data-parallel training, every GPU maintains: Model Parameters $\theta$ Gradients $\nabla \theta$ Optimizer States $O(\theta)$ This causes memory usage ...