SmolVLM: Redefining Small and Efficient Multimodal Models

SmolVLM Architecture. Images are split into subimages, frames are sampled from videos, and then encoded into visual features. These features are first rearranged via a pixel-shuffle operation, then mapped into the LLM input space as visual tokens using an MLP projection. Visual tokens are then concatenated/interleaved with text embeddings (orange/red). This combined sequence is passed to the LLM for text output.

SmolVLM is a series of compact, high-performing multimodal models introduced by HuggingFace and Stanford. Designed for resource-constrained environments, it challenges the traditional notion that only massive vision-language models (VLMs) can achieve state-of-the-art performance. For the Qwen25-VL model, while its performance is excellent, it typically requires around 15GB of VRAM and is primarily designed to run on GPUs. Although it can be quantized to Int4 using Intel's OpenVINO and run on Intel CPUs, the response time in this case exceeds 20 seconds. Therefore, there is a growing need for lightweight models that can operate quickly on CPUs or edge devices like mobile phones. In contrast, SmolVLM, when similarly quantized to Int4 and run on a CPU, delivers a response time of around 5 seconds.

1. Motivation and Objective

Large-scale VLMs like Flamingo and Idefics-80B offer impressive capabilities but demand extreme computational resources. SmolVLM addresses several key issues:

Excessive visual tokenization resulting in GPU inefficiency
Inheritance of large-model architectural patterns unsuited for small-scale deployment
Lack of optimized compute-memory balance for inference on-device

2. Architecture Overview

The SmolVLM pipeline integrates visual and textual modalities via the following stages:

Image Splitting & Frame Sampling: High-resolution images are split into sub-images; video inputs are downsampled temporally.
Vision Encoder (SigLIP): Frozen encoder generates rich embeddings.
Pixel Shuffle: Reduces token count by rearranging spatial information into depth channels.
MLP Projection: Projects visual features into the language model input space.
Multimodal Integration: Visual and text tokens are jointly processed using self-attention in SmolLM2.

Pixel shuffle: Rearranges encoded images, trading spatial resolution for increased channel depth. This reduces visual token count while preserving information density.

3. Architectural Insights

Balanced Parameter Allocation: A smaller LM benefits more from compact vision encoders.
Extended Context: Scaling the RoPE base allows up to 16k token contexts for larger models.
Pixel Shuffle: A ratio of 4 is most effective in small models due to lower attention complexity.
Frame Averaging: Found to harm video understanding—selective frame sampling is better.

Performance analysis of SmolVLM configurations. (Left) Impact of vision encoder and language model sizes. Smaller language models (135M) benefit less from larger vision encoders (SigLIP-SO-400M, 428M) compared to SigLIP-B/16 (93M), while larger language models gain more from powerful encoders. (Middle-left) Performance significantly improves with increased context lengths (2k to 16k tokens). (Middle-right) Optimal pixel shuffle factor (PS=2 vs. PS=4) varies by model size. (Right) Frame averaging reduces video performance, with a rapid decline as more frames are averaged. Metrics average CIDEr (captioning) and accuracy (visual question answering).

4. Instruction Tuning & Tokenization Strategy

Learned Positional Tokens: Significantly outperform string-based tokens, especially for OCR
System Prompts and Media Markers: Improve zero-shot generalization for vision tasks
Chain-of-Thought Regulation: Optimal when used sparingly (0.02–0.05%) for small VLMs
Video Length Control: Best results achieved with average 3.5-minute clips

Tokenization Strategy Comparisons. (Left) Training loss curves illustrating the “OCR loss plague” when using string-based tokens in smaller models. (Center) Aggregated evaluation metrics showing consistently higher scores with learned tokens (orange). (Right) Scatter plot of OpenCompass-Image vs. OpenCompass-Video: learned tokens dominate the higher-scoring region, especially in image-intensive tasks.

5. Equation Example

The multimodal input sequence is represented as:

$ \text{Input} = [V_1, V_2, \ldots, V_n; T_1, T_2, \ldots, T_m] $

where $V_i$ are visual tokens and $T_j$ are textual tokens processed jointly in the LLM.

6. Benchmark Comparisons

SmolVLM achieves remarkable efficiency and performance:

Model	TextVQAv2	MathVista	OCRBench	GPU RAM
SmolVLM-256M	50.2%	35.9%	52.6%	0.8 GB
SmolVLM-500M	60.2%	40.1%	61.0%	4.9 GB
SmolVLM-2.2B	73.0%	51.5%	72.9%	4.9 GB

7. Author Highlights & Key Takeaways

Compact VLMs need deliberate architectural balance and token compression strategies
RoPE extension and aggressive pixel shuffle are critical for compact LLM success
On-device deployment enabled via ONNX and WebGPU support

8. How to use SmolVLM-Instruct model

The sample demo code for the SmolVLM-Instruct model is given at Huggingface page for the model as following code.


import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Can you describe the two images?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

9. References

Marafioti et al., (2025), SmolVLM: Redefining Small and Efficient Multimodal Models
Live Demo: HuggingFace SmolVLM2 Space
Huggingface: HuggingFace SmolVLM2-Instruct
Source Code: GitHub - smollm

AI Practitioner

Search This Blog