Skip to main content

Posts

Showing posts from May, 2025

How to Fine-Tune LLaMA 3.2-1B-Instruct for Korean Instruction Tasks with LoRA and Hugging Face

LLaMA 3.2-1B-Instruct is a lightweight instruction-tuned language model released by Meta. It is designed to handle a wide range of instruction-based tasks with relatively low computational resources. Although the model was trained with multilingual capabilities, its performance on languages not included in its training set—such as Korean—is limited. This tutorial demonstrates how to fine-tune this open-source model on a Korean dataset using Hugging Face Transformers and PEFT (specifically LoRA), enabling it to better respond to Korean instructions. 1. Prerequisites Before running the example code below, ensure you have the following libraries installed: pip install torch transformers datasets peft accelerate mlflow huggingface_hub To use the LLaMA model or KoAlpaca datasets, you'll need a Hugging Face token. Additionally, you may need to handle potential CUDA Out-Of-Memory (OOM) errors. The following code takes care of both: from huggingface_hub import login login("y...

Understanding SentencePiece: A Language-Independent Tokenizer for AI Engineers

In the realm of Natural Language Processing (NLP), tokenization plays a pivotal role in preparing text data for machine learning models. Traditional tokenization methods often rely on language-specific rules and pre-tokenized inputs, which can be limiting when dealing with diverse languages and scripts. Enter SentencePiece—a language-independent tokenizer and detokenizer designed to address these challenges and streamline the preprocessing pipeline for neural text processing systems. What is SentencePiece? SentencePiece is an open-source tokenizer and detokenizer developed by Google, tailored for neural-based text processing tasks such as Neural Machine Translation (NMT). Unlike conventional tokenizers that depend on whitespace and language-specific rules, SentencePiece treats the input text as a raw byte sequence, enabling it to process languages without explicit word boundaries, such as Japanese, Chinese, and Korean. This approach allows SentencePiece to train subword models di...

Mastering the Byte Pair Encoding (BPE) Tokenizer for NLP and LLMs

Byte Pair Encoding (BPE) is one of the most important and widely adopted subword tokenization algorithms in modern Natural Language Processing (NLP), especially in training Large Language Models (LLMs) like GPT. This guide provides a deep technical dive into how BPE works, compares it with other tokenizers like WordPiece and SentencePiece, and explains its practical implementation with Python code. This article is optimized for AI engineers building real-world models and systems. 1. What is Byte Pair Encoding? BPE was originally introduced as a data compression algorithm by Gage in 1994. It replaces the most frequent pair of bytes in a sequence with a single, unused byte. In 2015, Sennrich et al. adapted BPE for NLP to address the out-of-vocabulary (OOV) problem in neural machine translation. Instead of working with full words, BPE decomposes them into subword units that can be recombined to represent rare or unseen words. 2. Why Tokenization Matters in LLMs Tokenization is th...

RoFormer and Rotary Position Embedding: Revolutionizing Positional Encoding in Transformers

Implementation of RoPE Rotary Position Embedding (RoPE) is a positional encoding method introduced in the 2021 RoFormer paper ( https://arxiv.org/pdf/2104.09864 ). This technique overcomes the limitations of absolute positional encoding and enhances a Transformer model's ability to capture sequence order and relative positions effectively. 1. Limitations of Traditional Positional Encoding Since Transformers cannot inherently model token order, positional encodings are added to token embeddings. Early models used sinusoidal encodings, and later learnable embeddings were introduced. However, these approaches have several drawbacks: They encode absolute rather than relative positions, reducing contextual precision Struggle with generalizing to sequences of varying lengths Increase model parameters and often degrade in long-range dependencies ...

ZeRO: Deep Memory Optimization for Training Trillion-Parameter Models

In 2020, Microsoft researchers introduced ZeRO (Zero Redundancy Optimizer) via their paper "ZeRO: Memory Optimization Towards Training Trillion Parameter Models" (arXiv:1910.02054). ZeRO is a memory optimization technique that eliminates redundancy in distributed training, enabling efficient scaling to trillion-parameter models. This provides an in-depth technical breakdown of ZeRO's partitioning strategies, memory usage analysis, and integration with DeepSpeed. 1. What is ZeRO? ZeRO eliminates redundant memory copies of model states across GPUs. Instead of replicating parameters, gradients, and optimizer states across each GPU, ZeRO partitions them across all devices. This results in near-linear memory savings as the number of GPUs increases. 2. Limitations of Traditional Data Parallelism In standard data-parallel training, every GPU maintains: Model Parameters $\theta$ Gradients $\nabla \theta$ Optimizer States $O(\theta)$ This causes memory usage ...

FlashAttention: High-Speed, Memory-Efficient Attention for Transformers

Transformers have become the standard architecture in NLP and vision, but the quadratic complexity of attention in both computation and memory makes it a bottleneck for long sequences. FlashAttention , introduced in 2022, proposes a memory-aware exact attention algorithm that significantly boosts performance by optimizing GPU memory usage. 2. Bottlenecks in Standard Attention The classic attention operation is defined as: $Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d}}) V$ Here, intermediate results like the $QK^T$ matrix are materialized and stored in GPU HBM, leading to extensive memory I/O and $O(n^2)$ memory consumption, which severely limits sequence length and throughput. 3. Core Ideas of FlashAttention 1 FlashAttention rethinks the attention operation with the following ideas: Tile-based streaming computation : Avoids storing $QK^T$ by breaking computations into tiles and using GPU SRAM and registers. Online softmax accumulation : Uses a streaming algorithm t...

KV Cache in Large Language Models: Design, Optimization, and Inference Acceleration

KV Cache in Large Language Models: Design, Optimization, and Inference Acceleration Transformers are the backbone of modern large language models (LLMs), but inference over long contexts becomes computationally expensive. The KV Cache —short for Key/Value Cache—is one of the most important optimizations that enable scalable and low-latency LLM deployment, particularly for autoregressive decoding. This article explores how KV Cache works, how it's designed, and how to implement and optimize it based on the seminal paper Efficiently Scaling Transformer Inference . 1. What is KV Cache? In Transformer-based models, each attention layer computes attention weights over the entire input sequence. When generating one token at a time during inference, this results in quadratic computational complexity. KV Caching stores the intermediate key and value tensors of past tokens so they do not need to be recomputed at each step. This enables linear time generation and is especiall...

SmolVLM: Redefining Small and Efficient Multimodal Models

SmolVLM: Redefining Small and Efficient Multimodal Models SmolVLM Architecture. Images are split into subimages, frames are sampled from videos, and then encoded into visual features. These features are first rearranged via a pixel-shuffle operation, then mapped into the LLM input space as visual tokens using an MLP projection. Visual tokens are then concatenated/interleaved with text embeddings (orange/red). This combined sequence is passed to the LLM for text output. SmolVLM is a series of compact, high-performing multimodal models introduced by HuggingFace and Stanford. Designed for resource-constrained environments, it challenges the traditional notion that only massive vision-language models (VLMs) can achieve state-of-the-art performance. For the Qwen25-VL model, while its performance is excellent, it typically requires around 15GB of VRAM and is primarily designed to run on GPUs. Although it can be quantized to Int4 using Intel's OpenVINO and run on Intel CPUs, the ...

Complete Guide to XGBoost Algorithm with Python and Scikit-learn

Understanding the XGBoost Algorithm with Detailed Explanation and Python Implementation XGBoost, short for "Extreme Gradient Boosting", is a powerful algorithm widely used in machine learning, especially for regression and classification problems. It is known for delivering high performance and is frequently used in Kaggle competitions. In this article, we’ll explore XGBoost’s key features, a basic Python implementation, and a practical example using the Scikit-learn library. Key Features of XGBoost Boosting: Combines multiple weak learners (typically decision trees) sequentially to create a strong learner. Each tree corrects the errors of the previous one. Gradient Boosting: Adds trees based on the gradient of the loss function, optimizing using gradient descent. Regularization: Applies L1 and L2 regularization to control model complexity and prevent overfitting. Tree Pruning: Uses max depth pruning to reduce unnecessary complexity. Handling Missing Values: Aut...

llama-prompt-ops: Comprehensive Guide to Meta's Llama Prompt Optimization Toolkit

llama-prompt-ops: A Full Guide to Meta's Prompt Optimization Toolkit for Llama Source: https://github.com/meta-llama/llama-prompt-ops 1. What is llama-prompt-ops? llama-prompt-ops is an open-source Python package developed by Meta AI to streamline prompt optimization and conversion tailored for Llama models (such as Llama 2 and Llama 3). It helps automatically convert prompts written for other LLMs (like GPT or Claude) into a structure and format that performs better with Llama models. It also supports template-based rewrites and best practices recommended by Meta. 2. Key Features Cross-LLM prompt conversion: Automatically rewrite prompts from other models into Llama-compatible format Prompt structure optimization: Aligns prompts with Meta’s recommended instruction templates Template-based generation: Predefined prompt templates for various use cases Instruction enhancement: Refines wording and formatting for better Llama co...

Gradient-Free Explanation AI for CNN Models

1. Introduction to Explainable Artificial Intelligence (XAI) Explainable Artificial Intelligence (XAI) refers to techniques that make the decision-making process of AI models interpretable and understandable to humans. Despite their high performance, image classification models based on Convolutional Neural Networks (CNNs) have often been criticized for operating as opaque "black boxes." To address this challenge, Class Activation Mapping (CAM) techniques have been developed. CAM enables visual interpretation of which specific regions of an input image influenced a model’s classification decision . These techniques are widely used for model interpretability , especially in critical fields like medical imaging, autonomous driving, and security , where trust and explainability are crucial. Methods such as CAM, Grad-CAM, and Score-CAM visually highlight the regions in an image that most contributed to the model’s prediction, helping explain what features the CNN has focused o...

Retrieval-Augmented Generation (RAG) for Advanced ML Engineers

Understanding Retrieval-Augmented Generation (RAG): Architecture, Variants, and Best Practices Retrieval-Augmented Generation (RAG) is a hybrid approach that combines large language models (LLMs) with external knowledge retrieval systems. Instead of relying solely on the parametric knowledge embedded within LLM weights, RAG enables dynamic, non-parametric access to external sources—most commonly via vector databases—allowing LLMs to generate factually grounded and context-rich responses.  The simplest form of RAG can be seen when a user of generative AI includes specific domain knowledge—such as a URL or a PDF document—along with their prompt to get more accurate responses. In this case, the user manually attaches external references to help the AI generate answers based on specialized information. A RAG system automates this process. It stores various domain-specific documents in a database and, whenever a user asks a question, it retrieves relevant information and appends it...