AI Practitioner

Posts

Showing posts with the label neural networks

What is a Transformer? Understanding Transformer Architecture in NLP

What is a Transformer? The Transformer is a neural network architecture introduced by Vaswani et al. in the 2017 paper "Attention is All You Need." It revolutionized natural language processing by replacing sequential models like RNNs and LSTMs. Transformers process entire sentences in parallel using self-attention , effectively addressing the difficulty of learning from long input sequences and enabling high computational efficiency by overcoming the limitations of sequential processing. 1. Transformer Components and Overcoming RNN/LSTM Limitations The Transformer is composed of an encoder and a decoder, with each block consisting of the following key components: Self-Attention: Learns the relationships between tokens within the input sequence by enabling each token to attend to all others, effectively capturing long-range dependencies and rich contextual information. Multi-Head Attention (MHA): Divides self-attention into multiple parallel heads. Each head focuses o...

Understanding Softmax in Deep Learning: A Beginner's Guide

What is Softmax? Softmax is a mathematical function that transforms a vector of real-valued scores (logits) into a probability distribution over predicted output classes. It is commonly used in the output layer of classification models , especially in multi-class classification problems. Mathematical Definition Given a vector of real numbers $z=[z_1,z_2,...,z_K]$, the softmax function outputs a vector $\sigma(z)$ where: $\sigma(Z_i)=\frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}} \text{(for i=1, ..., K)}$ Each element $\sigma(z_i)\in (0,1)$ and the elements sum to 1: $\sum_{i=0}^{K}\sigma(z_i)=1$. Why Use Softmax? It converts raw scores (logits) into probabilities. It helps the model assign confidence to predictions. It is differentiable , enabling gradient-based optimization during training. Impact on Model Performance Classification Accuracy In combination with the cross-entropy loss , softmax all...

Convolution Operator and Layer Explained in Deep Learning

What is a Convolution Layer in Deep Learning? A convolution layer is a building block of Convolutional Neural Networks (CNNs). It's mostly used to process image data . Instead of connecting every pixel of the input to every neuron (as in a fully connected layer), a convolution layer slides a small filter (kernel) across the image and extracts features like edges, textures, or patterns. Key Terms Input : The image or feature map (e.g., 6x6 pixels). Kernel(Filter) : A small matrix (e.g., 3x3 or 5x5) that moves across the image. Stride : How many steps the filter moves at a time. Padding : Adding extra pixels around the image to control the output size. Feature Map : The result of the convolution operation. How Convolution Works Let’s walk through an example with no padding and stride = 1 . 1. Input: 6x6 Matrix Input: [ [9, 4, 1, 6, 5], [1, 1, 1, 0, 2], [1, 2, 1, 1, 3], [2, 1, 0, 3, 0], [1, 4, 2, 5, 6] ] 2. Kernel: ...

What is the Vanishing Gradient Problem in Deep Learning?

Vanishing Gradient is a common problem in training deep neural networks, especially in very deep architectures. It makes it difficult for the model to learn from data during training. What is Vanishing Gradient? In deep learning, training happens through a method called backpropagation , where the model adjusts its weights using gradients (a kind of slope) of the loss function with respect to each weight. These gradients tell the model how much to change each weight to improve performance. However, in deep neural networks (many layers), the gradients can get very small as they are propagated backward through the layers. This is called vanishing gradient . As a result: Early layers (closer to the input) receive almost no updates . The network stops learning or learns very slowly . When Does Vanishing Gradient Happen? Very Deep Networks : The more layers, the more chance gradients will shrink as th...

Overfitting vs Underfitting in Deep Learning: Key Differences

When training a deep learning model, you want it to learn patterns from the training data so it can make accurate predictions on new, unseen data . However, sometimes models learn too little or too much. This leads to underfitting or overfitting . Let’s break them down in simple terms, backed by examples, visuals, and some light math. 1. What Is the Goal of Training a Model? Imagine you're trying to teach a model to predict house prices based on features like size, location, and number of rooms. Your goal is to find a function f(x) that maps your input features x (like size, rooms) to a prediction ŷ (the house price), such that the prediction is close to the actual price y . ŷ = f (x;θ) MSE = (1/n) ∑ (yᵢ - ŷᵢ) ² 2. Underfitting Underfitting happens when your model is too simple to capture the patterns in the data. It doesn’t learn enough from the training data and performs poorl...