Skip to main content

Posts

Showing posts with the label neural networks

What is a Transformer? Understanding Transformer Architecture in NLP

What is a Transformer? The Transformer is a neural network architecture introduced by Vaswani et al. in the 2017 paper "Attention is All You Need." It revolutionized natural language processing by replacing sequential models like RNNs and LSTMs. Transformers process entire sentences in parallel using self-attention , effectively addressing the difficulty of learning from long input sequences and enabling high computational efficiency by overcoming the limitations of sequential processing. 1. Transformer Components and Overcoming RNN/LSTM Limitations The Transformer is composed of an encoder and a decoder, with each block consisting of the following key components: Self-Attention: Learns the relationships between tokens within the input sequence by enabling each token to attend to all others, effectively capturing long-range dependencies and rich contextual information. Multi-Head Attention (MHA): Divides self-attention into multiple parallel heads. Each head focuses o...

Understanding Softmax in Deep Learning: A Beginner's Guide

What is Softmax? Softmax  is a mathematical function that transforms a vector of real-valued scores (logits) into a probability distribution over predicted output classes. It is commonly used in the  output layer of classification models , especially in  multi-class classification  problems. Mathematical Definition Given a vector of real numbers $z=[z_1,z_2,...,z_K]$, the softmax function outputs a vector   $\sigma(z)$ where: $\sigma(Z_i)=\frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}} \text{(for i=1, ..., K)}$ Each element $\sigma(z_i)\in (0,1)$ and the elements sum to 1: $\sum_{i=0}^{K}\sigma(z_i)=1$. Why Use Softmax? It converts  raw scores (logits)  into probabilities. It helps the model  assign confidence  to predictions. It is  differentiable , enabling gradient-based optimization during training. Impact on Model Performance Classification Accuracy In combination with the  cross-entropy loss , softmax all...

Convolution Operator and Layer Explained in Deep Learning

  What is a Convolution Layer in Deep Learning? A  convolution layer  is a building block of Convolutional Neural Networks (CNNs). It's mostly used to process  image data . Instead of connecting every pixel of the input to every neuron (as in a fully connected layer), a convolution layer  slides a small filter (kernel)  across the image and  extracts features  like edges, textures, or patterns. Key Terms Input : The image or feature map (e.g., 6x6 pixels). Kernel(Filter) : A small matrix (e.g., 3x3 or 5x5) that moves across the image. Stride : How many steps the filter moves at a time. Padding : Adding extra pixels around the image to control the output size. Feature Map : The result of the convolution operation. How Convolution Works Let’s walk through an example with  no padding  and  stride = 1 . 1. Input: 6x6 Matrix Input: [ [9, 4, 1, 6, 5], [1, 1, 1, 0, 2], [1, 2, 1, 1, 3], [2, 1, 0, 3, 0], [1, 4, 2, 5, 6] ] 2. Kernel: ...

What is the Vanishing Gradient Problem in Deep Learning?

Vanishing   Gradient   is a common problem in training deep neural networks, especially in very deep architectures. It makes it difficult for the model to learn from data during training.  What is Vanishing Gradient? In deep learning, training happens through a method called  backpropagation , where the model adjusts its weights using  gradients  (a kind of slope) of the loss function with respect to each weight. These gradients tell the model how much to change each weight to improve performance. However, in deep neural networks (many layers), the  gradients can get very small  as they are propagated backward through the layers. This is called  vanishing   gradient . As a result: Early layers (closer to the input) receive almost  no updates . The network  stops learning  or learns  very slowly . When Does Vanishing Gradient Happen? Very Deep Networks : The more layers, the more chance gradients will shrink as th...

Overfitting vs Underfitting in Deep Learning: Key Differences

When training a deep learning model, you want it to   learn patterns   from the training data so it can make   accurate predictions on new, unseen data . However, sometimes models learn too little or too much. This leads to   underfitting   or   overfitting . Let’s break them down in simple terms, backed by examples, visuals, and some light math. 1. What Is the Goal of Training a Model? Imagine you're trying to teach a model to  predict house prices  based on features like size, location, and number of rooms. Your goal is to find a function  f(x)  that maps your input features  x  (like size, rooms) to a prediction  ŷ  (the house price), such that the prediction is  close to the actual price y . ŷ = f (x;θ) MSE = (1/n) ∑ (yᵢ - ŷᵢ) ² 2. Underfitting Underfitting happens when your model is  too simple  to capture the patterns in the data. It doesn’t learn enough from the training data and performs poorl...