AI Practitioner

Posts

Understanding Distance Metrics in Machine Learning with PyTorch Examples

Distance metrics play a crucial role in machine learning, especially in tasks like clustering, classification, and recommendation systems. In this blog, we will explore popular distance metrics including Cosine, Euclidean, Mahalanobis, Hellinger, Jaccard, Manhattan, Correlation, Dice, and Hamming distances. We will also provide PyTorch implementations for each metric. 1. Cosine Distance Measures the cosine of the angle between two non-zero vectors. Often used in text similarity and document clustering. import torch x = torch.tensor([1.0, 2.0, 3.0]) y = torch.tensor([4.0, 5.0, 6.0]) cosine_distance = 1 - torch.nn.functional.cosine_similarity(x.unsqueeze(0), y.unsqueeze(0)) 2. Euclidean Distance Represents the straight-line distance between two points in Euclidean space. euclidean_distance = torch.dist(x, y, p=2) 3. Mahalanobis Distance Accounts for the correlation between variables and scales distances accordingly. Useful in anomaly detection. cov = torch.cov(torch.stack([x,...

Relational Deep Learning: Learning from Relational Databases using GNNs

Relational Deep Learning (RDL) proposes a unified graph-based way to model multi-table databases for end-to-end learning using GNNs. This retains relational semantics, avoids joins, and supports temporal reasoning. It’s a paradigm shift that bridges the gap between ML and databases. 1. Motivation: From Tables to Graphs Traditional Setup Relational databases store structured data across multiple normalized tables , each capturing different types of entities (e.g., users, orders, products). These tables are linked by foreign-key (FK) and primary-key (PK) constraints. To train machine learning models, these databases are typically flattened into a single table using joins , and domain experts manually select and engineer features. Problems: Joins are expensive and brittle (schema changes break pipelines). Manual feature engineering is time-consuming and lacks relational awareness. Loss of information about cross-entity relationships . 2. Core Idea: Learn Direc...

Building an MCP Agent with UV, Python & mcp-use

Model Context Protocol (MCP) is an open protocol designed to enable AI agents to interact with external tools and data in a standardized way. MCP is composed of three components: server , client , and host . MCP host The MCP host acts as the interface between the user and the agent (such as Claude Desktop or IDE) and plays the role of connecting to external tools or data through MCP clients and servers. Previously, Anthropic’s Claude Desktop was introduced as a host, but it required a separate desktop app, license, and API key management, leading to dependency on the Claude ecosystem. mcp-use is an open-source Python/Node package that connects LangChain LLMs (e.g., GPT-4, Claude, Groq) to MCP servers in just six lines of code, eliminating dependencies and supporting multi-server and multi-model setups. MCP Client The MCP client manages the MCP protocol within the host and is responsible for connecting to MCP servers that provide the necessary functions for the ...

How to Fine-Tune LLaMA 3.2-1B-Instruct for Korean Instruction Tasks with LoRA and Hugging Face

LLaMA 3.2-1B-Instruct is a lightweight instruction-tuned language model released by Meta. It is designed to handle a wide range of instruction-based tasks with relatively low computational resources. Although the model was trained with multilingual capabilities, its performance on languages not included in its training set—such as Korean—is limited. This tutorial demonstrates how to fine-tune this open-source model on a Korean dataset using Hugging Face Transformers and PEFT (specifically LoRA), enabling it to better respond to Korean instructions. 1. Prerequisites Before running the example code below, ensure you have the following libraries installed: pip install torch transformers datasets peft accelerate mlflow huggingface_hub To use the LLaMA model or KoAlpaca datasets, you'll need a Hugging Face token. Additionally, you may need to handle potential CUDA Out-Of-Memory (OOM) errors. The following code takes care of both: from huggingface_hub import login login("y...

Understanding SentencePiece: A Language-Independent Tokenizer for AI Engineers

In the realm of Natural Language Processing (NLP), tokenization plays a pivotal role in preparing text data for machine learning models. Traditional tokenization methods often rely on language-specific rules and pre-tokenized inputs, which can be limiting when dealing with diverse languages and scripts. Enter SentencePiece—a language-independent tokenizer and detokenizer designed to address these challenges and streamline the preprocessing pipeline for neural text processing systems. What is SentencePiece? SentencePiece is an open-source tokenizer and detokenizer developed by Google, tailored for neural-based text processing tasks such as Neural Machine Translation (NMT). Unlike conventional tokenizers that depend on whitespace and language-specific rules, SentencePiece treats the input text as a raw byte sequence, enabling it to process languages without explicit word boundaries, such as Japanese, Chinese, and Korean. This approach allows SentencePiece to train subword models di...

Mastering the Byte Pair Encoding (BPE) Tokenizer for NLP and LLMs

Byte Pair Encoding (BPE) is one of the most important and widely adopted subword tokenization algorithms in modern Natural Language Processing (NLP), especially in training Large Language Models (LLMs) like GPT. This guide provides a deep technical dive into how BPE works, compares it with other tokenizers like WordPiece and SentencePiece, and explains its practical implementation with Python code. This article is optimized for AI engineers building real-world models and systems. 1. What is Byte Pair Encoding? BPE was originally introduced as a data compression algorithm by Gage in 1994. It replaces the most frequent pair of bytes in a sequence with a single, unused byte. In 2015, Sennrich et al. adapted BPE for NLP to address the out-of-vocabulary (OOV) problem in neural machine translation. Instead of working with full words, BPE decomposes them into subword units that can be recombined to represent rare or unseen words. 2. Why Tokenization Matters in LLMs Tokenization is th...

RoFormer and Rotary Position Embedding: Revolutionizing Positional Encoding in Transformers

Implementation of RoPE Rotary Position Embedding (RoPE) is a positional encoding method introduced in the 2021 RoFormer paper ( https://arxiv.org/pdf/2104.09864 ). This technique overcomes the limitations of absolute positional encoding and enhances a Transformer model's ability to capture sequence order and relative positions effectively. 1. Limitations of Traditional Positional Encoding Since Transformers cannot inherently model token order, positional encodings are added to token embeddings. Early models used sinusoidal encodings, and later learnable embeddings were introduced. However, these approaches have several drawbacks: They encode absolute rather than relative positions, reducing contextual precision Struggle with generalizing to sequences of varying lengths Increase model parameters and often degrade in long-range dependencies ...

ZeRO: Deep Memory Optimization for Training Trillion-Parameter Models

In 2020, Microsoft researchers introduced ZeRO (Zero Redundancy Optimizer) via their paper "ZeRO: Memory Optimization Towards Training Trillion Parameter Models" (arXiv:1910.02054). ZeRO is a memory optimization technique that eliminates redundancy in distributed training, enabling efficient scaling to trillion-parameter models. This provides an in-depth technical breakdown of ZeRO's partitioning strategies, memory usage analysis, and integration with DeepSpeed. 1. What is ZeRO? ZeRO eliminates redundant memory copies of model states across GPUs. Instead of replicating parameters, gradients, and optimizer states across each GPU, ZeRO partitions them across all devices. This results in near-linear memory savings as the number of GPUs increases. 2. Limitations of Traditional Data Parallelism In standard data-parallel training, every GPU maintains: Model Parameters $\theta$ Gradients $\nabla \theta$ Optimizer States $O(\theta)$ This causes memory usage ...

FlashAttention: High-Speed, Memory-Efficient Attention for Transformers

Transformers have become the standard architecture in NLP and vision, but the quadratic complexity of attention in both computation and memory makes it a bottleneck for long sequences. FlashAttention , introduced in 2022, proposes a memory-aware exact attention algorithm that significantly boosts performance by optimizing GPU memory usage. 2. Bottlenecks in Standard Attention The classic attention operation is defined as: $Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d}}) V$ Here, intermediate results like the $QK^T$ matrix are materialized and stored in GPU HBM, leading to extensive memory I/O and $O(n^2)$ memory consumption, which severely limits sequence length and throughput. 3. Core Ideas of FlashAttention 1 FlashAttention rethinks the attention operation with the following ideas: Tile-based streaming computation : Avoids storing $QK^T$ by breaking computations into tiles and using GPU SRAM and registers. Online softmax accumulation : Uses a streaming algorithm t...

KV Cache in Large Language Models: Design, Optimization, and Inference Acceleration

KV Cache in Large Language Models: Design, Optimization, and Inference Acceleration Transformers are the backbone of modern large language models (LLMs), but inference over long contexts becomes computationally expensive. The KV Cache —short for Key/Value Cache—is one of the most important optimizations that enable scalable and low-latency LLM deployment, particularly for autoregressive decoding. This article explores how KV Cache works, how it's designed, and how to implement and optimize it based on the seminal paper Efficiently Scaling Transformer Inference . 1. What is KV Cache? In Transformer-based models, each attention layer computes attention weights over the entire input sequence. When generating one token at a time during inference, this results in quadratic computational complexity. KV Caching stores the intermediate key and value tensors of past tokens so they do not need to be recomputed at each step. This enables linear time generation and is especiall...

SmolVLM: Redefining Small and Efficient Multimodal Models

SmolVLM: Redefining Small and Efficient Multimodal Models SmolVLM Architecture. Images are split into subimages, frames are sampled from videos, and then encoded into visual features. These features are first rearranged via a pixel-shuffle operation, then mapped into the LLM input space as visual tokens using an MLP projection. Visual tokens are then concatenated/interleaved with text embeddings (orange/red). This combined sequence is passed to the LLM for text output. SmolVLM is a series of compact, high-performing multimodal models introduced by HuggingFace and Stanford. Designed for resource-constrained environments, it challenges the traditional notion that only massive vision-language models (VLMs) can achieve state-of-the-art performance. For the Qwen25-VL model, while its performance is excellent, it typically requires around 15GB of VRAM and is primarily designed to run on GPUs. Although it can be quantized to Int4 using Intel's OpenVINO and run on Intel CPUs, the ...

AI Practitioner

Search This Blog

Posts

Understanding Z-Test and P-Value with ML Use Cases

Understanding Distance Metrics in Machine Learning with PyTorch Examples

Relational Deep Learning: Learning from Relational Databases using GNNs

Building an MCP Agent with UV, Python & mcp-use

How to Fine-Tune LLaMA 3.2-1B-Instruct for Korean Instruction Tasks with LoRA and Hugging Face

Understanding SentencePiece: A Language-Independent Tokenizer for AI Engineers

Mastering the Byte Pair Encoding (BPE) Tokenizer for NLP and LLMs

RoFormer and Rotary Position Embedding: Revolutionizing Positional Encoding in Transformers

ZeRO: Deep Memory Optimization for Training Trillion-Parameter Models

FlashAttention: High-Speed, Memory-Efficient Attention for Transformers

KV Cache in Large Language Models: Design, Optimization, and Inference Acceleration

SmolVLM: Redefining Small and Efficient Multimodal Models

Using Gemini API in LangChain: Step-by-Step Tutorial

Building an MCP Agent with UV, Python & mcp-use

Understanding Accuracy, Precision, Recall, and F1 Score in ML/DL Models

Using Gemini API in LangChain: Step-by-Step Tutorial

Building an MCP Agent with UV, Python & mcp-use

Understanding Accuracy, Precision, Recall, and F1 Score in ML/DL Models