Skip to main content

Posts

Showing posts with the label HuggingFace

SmolVLM: Redefining Small and Efficient Multimodal Models

SmolVLM: Redefining Small and Efficient Multimodal Models SmolVLM Architecture. Images are split into subimages, frames are sampled from videos, and then encoded into visual features. These features are first rearranged via a pixel-shuffle operation, then mapped into the LLM input space as visual tokens using an MLP projection. Visual tokens are then concatenated/interleaved with text embeddings (orange/red). This combined sequence is passed to the LLM for text output. SmolVLM is a series of compact, high-performing multimodal models introduced by HuggingFace and Stanford. Designed for resource-constrained environments, it challenges the traditional notion that only massive vision-language models (VLMs) can achieve state-of-the-art performance. For the Qwen25-VL model, while its performance is excellent, it typically requires around 15GB of VRAM and is primarily designed to run on GPUs. Although it can be quantized to Int4 using Intel's OpenVINO and run on Intel CPUs, the ...