Skip to main content

SmolVLM: Redefining Small and Efficient Multimodal Models

SmolVLM: Redefining Small and Efficient Multimodal Models

SmolVLM Architecture. Images are split into subimages, frames are sampled from videos, and then encoded into visual features. These features are first rearranged via a pixel-shuffle operation, then mapped into the LLM input space as visual tokens using an MLP projection. Visual tokens are then concatenated/interleaved with text embeddings (orange/red). This combined sequence is passed to the LLM for text output.

SmolVLM is a series of compact, high-performing multimodal models introduced by HuggingFace and Stanford. Designed for resource-constrained environments, it challenges the traditional notion that only massive vision-language models (VLMs) can achieve state-of-the-art performance. For the Qwen25-VL model, while its performance is excellent, it typically requires around 15GB of VRAM and is primarily designed to run on GPUs. Although it can be quantized to Int4 using Intel's OpenVINO and run on Intel CPUs, the response time in this case exceeds 20 seconds. Therefore, there is a growing need for lightweight models that can operate quickly on CPUs or edge devices like mobile phones. In contrast, SmolVLM, when similarly quantized to Int4 and run on a CPU, delivers a response time of around 5 seconds.

1. Motivation and Objective

Large-scale VLMs like Flamingo and Idefics-80B offer impressive capabilities but demand extreme computational resources. SmolVLM addresses several key issues:

  • Excessive visual tokenization resulting in GPU inefficiency
  • Inheritance of large-model architectural patterns unsuited for small-scale deployment
  • Lack of optimized compute-memory balance for inference on-device


2. Architecture Overview

The SmolVLM pipeline integrates visual and textual modalities via the following stages:

  • Image Splitting & Frame Sampling: High-resolution images are split into sub-images; video inputs are downsampled temporally.
  • Vision Encoder (SigLIP): Frozen encoder generates rich embeddings.
  • Pixel Shuffle: Reduces token count by rearranging spatial information into depth channels.
  • MLP Projection: Projects visual features into the language model input space.
  • Multimodal Integration: Visual and text tokens are jointly processed using self-attention in SmolLM2.
Pixel shuffle: Rearranges encoded images, trading spatial resolution for increased channel depth. This reduces visual token count while preserving information density.


3. Architectural Insights

  • Balanced Parameter Allocation: A smaller LM benefits more from compact vision encoders.
  • Extended Context: Scaling the RoPE base allows up to 16k token contexts for larger models.
  • Pixel Shuffle: A ratio of 4 is most effective in small models due to lower attention complexity.
  • Frame Averaging: Found to harm video understanding—selective frame sampling is better.
Performance analysis of SmolVLM configurations. (Left) Impact of vision encoder and language model sizes. Smaller language models (135M) benefit less from larger vision encoders (SigLIP-SO-400M, 428M) compared to SigLIP-B/16 (93M), while larger language models gain more from powerful encoders. (Middle-left) Performance significantly improves with increased context lengths (2k to 16k tokens). (Middle-right) Optimal pixel shuffle factor (PS=2 vs. PS=4) varies by model size. (Right) Frame averaging reduces video performance, with a rapid decline as more frames are averaged. Metrics average CIDEr (captioning) and accuracy (visual question answering).


4. Instruction Tuning & Tokenization Strategy

  • Learned Positional Tokens: Significantly outperform string-based tokens, especially for OCR
  • System Prompts and Media Markers: Improve zero-shot generalization for vision tasks
  • Chain-of-Thought Regulation: Optimal when used sparingly (0.02–0.05%) for small VLMs
  • Video Length Control: Best results achieved with average 3.5-minute clips
Tokenization Strategy Comparisons. (Left) Training loss curves illustrating the “OCR loss plague” when using string-based tokens in smaller models. (Center) Aggregated evaluation metrics showing consistently higher scores with learned tokens (orange). (Right) Scatter plot of OpenCompass-Image vs. OpenCompass-Video: learned tokens dominate the higher-scoring region, especially in image-intensive tasks.


5. Equation Example

The multimodal input sequence is represented as:

$ \text{Input} = [V_1, V_2, \ldots, V_n; T_1, T_2, \ldots, T_m] $

where \(V_i\) are visual tokens and \(T_j\) are textual tokens processed jointly in the LLM.


6. Benchmark Comparisons

SmolVLM achieves remarkable efficiency and performance:

ModelTextVQAv2MathVistaOCRBenchGPU RAM
SmolVLM-256M50.2%35.9%52.6%0.8 GB
SmolVLM-500M60.2%40.1%61.0%4.9 GB
SmolVLM-2.2B73.0%51.5%72.9%4.9 GB


7. Author Highlights & Key Takeaways

  • Compact VLMs need deliberate architectural balance and token compression strategies
  • RoPE extension and aggressive pixel shuffle are critical for compact LLM success
  • On-device deployment enabled via ONNX and WebGPU support


8. How to use SmolVLM-Instruct model

The sample demo code for the SmolVLM-Instruct model is given at Huggingface page for the model as following code.


import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Can you describe the two images?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])


9. References



Comments

Popular

How to Save and Retrieve a Vector Database using LangChain, FAISS, and Gemini Embeddings

How to Save and Retrieve a Vector Database using LangChain, FAISS, and Gemini Embeddings Efficient storage and retrieval of vector databases is foundational for building intelligent retrieval-augmented generation (RAG) systems using large language models (LLMs). In this guide, we’ll walk through a professional-grade Python implementation that utilizes LangChain with FAISS and Google Gemini Embeddings to store document embeddings and retrieve similar information. This setup is highly suitable for advanced machine learning (ML) and deep learning (DL) engineers who work with semantic search and retrieval pipelines. Why Vector Databases Matter in LLM Applications Traditional keyword-based search systems fall short when it comes to understanding semantic meaning. Vector databases store high-dimensional embeddings of text data, allowing for approximate nearest-neighbor (ANN) searches based on semantic similarity. These capabilities are critical in applications like: Question Ans...

Building an MCP Agent with UV, Python & mcp-use

Model Context Protocol (MCP) is an open protocol designed to enable AI agents to interact with external tools and data in a standardized way. MCP is composed of three components: server , client , and host . MCP host The MCP host acts as the interface between the user and the agent   (such as Claude Desktop or IDE) and plays the role of connecting to external tools or data through MCP clients and servers. Previously, Anthropic’s Claude Desktop was introduced as a host, but it required a separate desktop app, license, and API key management, leading to dependency on the Claude ecosystem.   mcp-use is an open-source Python/Node package that connects LangChain LLMs (e.g., GPT-4, Claude, Groq) to MCP servers in just six lines of code, eliminating dependencies and supporting multi-server and multi-model setups. MCP Client The MCP client manages the MCP protocol within the host and is responsible for connecting to MCP servers that provide the necessary functions for the ...

RF-DETR: Overcoming the Limitations of DETR in Object Detection

RF-DETR (Region-Focused DETR), proposed in April 2025, is an advanced object detection architecture designed to overcome fundamental drawbacks of the original DETR (DEtection TRansformer) . In this technical article, we explore RF-DETR's contributions, architecture, and how it compares with both DETR and the improved model D-FINE . We also provide experimental benchmarks and discuss its real-world applicability. RF-DETR Architecture diagram for object detection Limitations of DETR DETR revolutionized object detection by leveraging the Transformer architecture, enabling end-to-end learning without anchor boxes or NMS (Non-Maximum Suppression). However, DETR has notable limitations: Slow convergence, requiring heavy data augmentation and long training schedules Degraded performance on low-resolution objects and complex scenes Lack of locality due to global self-attention mechanisms Key Innovations in RF-DETR RF-DETR intr...