Skip to main content

What is a Transformer? Understanding Transformer Architecture in NLP

What is a Transformer?

The Transformer is a neural network architecture introduced by Vaswani et al. in the 2017 paper "Attention is All You Need." It revolutionized natural language processing by replacing sequential models like RNNs and LSTMs. Transformers process entire sentences in parallel using self-attention, effectively addressing the difficulty of learning from long input sequences and enabling high computational efficiency by overcoming the limitations of sequential processing.


1. Transformer Components and Overcoming RNN/LSTM Limitations

The Transformer is composed of an encoder and a decoder, with each block consisting of the following key components:

  • Self-Attention: Learns the relationships between tokens within the input sequence by enabling each token to attend to all others, effectively capturing long-range dependencies and rich contextual information.
  • Multi-Head Attention (MHA): Divides self-attention into multiple parallel heads. Each head focuses on different subsets of relationships in the input data. The outputs of these heads are then aggregated using learned weights.
  • Positional Encoding: Since the Transformer does not inherently process sequences in order, positional encoding provides information about the relative or absolute position of tokens in the input.
  • Feedforward Network (FFN): Applies non-linear transformations to the features from the attention layers to deepen the network and enhance semantic understanding.
  • Layer Normalization & Residual Connections (Add & Norm): Improves training stability and convergence using skip connections and layer normalization, independent of input batch size.

Traditional RNNs and LSTMs suffer from poor parallelization and long-sequence memory degradation. The Transformer overcomes these issues with parallel operations and self-attention, enabling efficient handling of large-scale models and datasets.


2. Self-Attention Mechanism: Query, Key, Value

Self-Attention allows each token in a sequence to learn the relationships with all other tokens, capturing semantic associations and dynamic context. For example, in the sentence “The cat sat on the mat,” the word “cat” may be more closely related to “sat” than to “mat.” Each token is transformed into three vectors:

  • Query (Q): Determines how much attention the token should pay to other tokens.
  • Key (K): Represents how much a token should be attended to by others.
  • Value (V): The actual information carried by the token.

These are computed using learnable matrices $W^Q, W^K, W^V$.

2.1. Embedding to Query, Key, and Value

$Q = XW^Q,\quad K = XW^K,\quad V = XW^V$

Where:

  • X: Embedding matrix of input sequence (size: $n \times d$)
  • $W^Q, W^K, W^V$: Learnable matrices of size $d \times d_k$
  • Q, K, V: Output matrices of size $n \times d_k$

2.2. Attention Score Computation (Dot Product of Q and K)

Computing similarity by taking the dot product between each Query and all Keys:

$\text{Attention Scores} = QK^T$

Size: $n \times n$ (similarity matrix for all token pairs)

2.3. Scaling

To stabilize gradients when softmaxing large values:

$\text{Scaled Scores} = \frac{QK^T}{\sqrt{d_k}}$

2.4. Applying Softmax

Softmax is applied to transform scores into attention weights:

$\alpha_{ij} = \text{softmax}\left( \frac{q_i \cdot k_j}{\sqrt{d_k}} \right)$

These weights indicate the importance of each token.

2.5. Computing the Final Output with Weighted Sum

Multiply the weights with corresponding Value vectors to get the output:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Output: A new representation of size $n \times d_k$


3. Multi-Head Attention vs Single-Head Attention

Multi-Head Attention is a core component of the Transformer that allows simultaneous learning of different contextual relationships. It captures linguistic subtleties and semantic diversity that a single attention head might miss. Simply put, it performs multiple self-attentions in parallel and combines the results. Each head learns its own Query (Q), Key (K), and Value (V) representations.

3.1. Computation Process

For input embedding X, each attention head computes:

  • Learnable weights: W_i^Q, W_i^K, W_i^V

Qᵢ = X × Wᵢ^Q
Kᵢ = X × Wᵢ^K
Vᵢ = X × Wᵢ^V

headᵢ = softmax(Qᵢ × Kᵢᵀ / √dₖ) × Vᵢ

This is done for h heads in parallel. The outputs are concatenated and passed through a final linear projection:


MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) × Wᴼ

Where Wᴼ is the output projection matrix.

3.2. Advantages of Multi-Head Attention

  • Learn diverse relationships: Each head can focus on different types of information like grammar, semantics, and position.
  • Parallel computation: All attention operations are performed in parallel, unlike RNNs.
  • Improved expressiveness: Combines multiple perspectives for a richer representation.

3.3. Comparison: Single-Head vs Multi-Head

Aspect Single-Head Attention Multi-Head Attention
Learning Scope One representation space Multiple representation spaces
Diversity of Information Limited Rich
Efficiency Simpler Efficient via parallelism
Expressiveness Lower Higher

3.4. Example Understanding

Sentence: “The cat didn’t cross the road because it was tired.”

  • One head may focus on the relationship between “cat” and “tired”
  • Another head may capture the temporal link to “didn’t cross”

3.5. Summary of Multi-Head Attention Core Ideas

  • Enables learning of multiple relationships simultaneously
  • Computes Q, K, V in separate subspaces
  • Concatenates the results and applies a final linear transformation

4. The Necessity of Positional Encoding and Absolute vs Relative Encoding

Transformers do not inherently consider the order of sequences, so positional information of input tokens must be explicitly provided. To address this, Positional Encoding was introduced. Unlike RNNs, Transformers process input in parallel. In other words, they cannot recognize input order, so information indicating each word’s position must be added to help the model understand grammatical and semantic structure correctly.

4.1. Absolute Positional Encoding

In the original Transformer paper ("Attention is All You Need"), absolute positional encoding using sine and cosine functions is used. For each position pos and embedding dimension i, the values are calculated as follows:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This method indirectly encodes the relative distance between positions, allowing the model to infer token positions from periodic functions. It is static and non-trainable, and directly adds order information. Advantages include simplicity, intuitiveness, and reusability during inference, but it may generalize poorly to sequences of different lengths and does not clearly encode relative positions within sentences.

4.2. Relative Positional Encoding

Relative positional encoding encodes the distance between tokens rather than their absolute positions. This enables stronger generalization regardless of sentence length. Representative methods include RoPE (Rotary Positional Embedding) and ALiBi (Attention with Linear Biases).

4.3. Summary Comparison: Absolute vs Relative Encoding

Category Absolute Encoding Relative Encoding (RoPE / ALiBi)
Reference Basis Fixed token position Distance between tokens
Generalization Low High
Trainability Usually fixed (non-trainable) RoPE: non-trainable, ALiBi: non-trainable
Recent Usage BERT, base Transformer GPT-NeoX, LLaMA, T5, etc.

5. Comparison of Encoder-Only, Decoder-Only, and Encoder-Decoder Architectures

Encoder-only (e.g., BERT): Processes the entire input at once and is suitable for understanding-oriented tasks such as sentence classification and question answering.

Decoder-only (e.g., GPT): Uses an autoregressive structure to predict the next word based on previous words, making it strong for generation tasks.

Encoder-decoder (e.g., T5, BART): Summarizes input through the encoder and generates output sentences via the decoder. It is suitable for multi-purpose tasks like machine translation, summarization, and question answering.


6. Computational Limitations of Standard Transformers and Their Solutions

Standard Transformers have time and memory complexity proportional to the square of the input length (O(n²)) in attention operations. Major approaches to address this include:

  • Sparse Attention: Compute attention only between certain token pairs (e.g., Longformer, BigBird)
  • Linear Attention: Refactor attention operations to reduce time complexity to linear (e.g., Performer, Linear Transformer)
  • Compressed Attention: Reduce the number of keys/values to decrease computation (e.g., Linformer, Reformer)
  • Segment-based Processing: Split sequences into chunks and process sequentially (e.g., RetNet, RWKV)

References

  • Vaswani et al. (2017), "Attention is All You Need"
  • Google AI Blog: Self-Attention Mechanism
  • HuggingFace Transformers Documentation
  • Papers on RoFormer, Performer, BigBird, Linformer

Comments

Popular

Understanding SentencePiece: A Language-Independent Tokenizer for AI Engineers

In the realm of Natural Language Processing (NLP), tokenization plays a pivotal role in preparing text data for machine learning models. Traditional tokenization methods often rely on language-specific rules and pre-tokenized inputs, which can be limiting when dealing with diverse languages and scripts. Enter SentencePiece—a language-independent tokenizer and detokenizer designed to address these challenges and streamline the preprocessing pipeline for neural text processing systems. What is SentencePiece? SentencePiece is an open-source tokenizer and detokenizer developed by Google, tailored for neural-based text processing tasks such as Neural Machine Translation (NMT). Unlike conventional tokenizers that depend on whitespace and language-specific rules, SentencePiece treats the input text as a raw byte sequence, enabling it to process languages without explicit word boundaries, such as Japanese, Chinese, and Korean. This approach allows SentencePiece to train subword models di...

Mastering the Byte Pair Encoding (BPE) Tokenizer for NLP and LLMs

Byte Pair Encoding (BPE) is one of the most important and widely adopted subword tokenization algorithms in modern Natural Language Processing (NLP), especially in training Large Language Models (LLMs) like GPT. This guide provides a deep technical dive into how BPE works, compares it with other tokenizers like WordPiece and SentencePiece, and explains its practical implementation with Python code. This article is optimized for AI engineers building real-world models and systems. 1. What is Byte Pair Encoding? BPE was originally introduced as a data compression algorithm by Gage in 1994. It replaces the most frequent pair of bytes in a sequence with a single, unused byte. In 2015, Sennrich et al. adapted BPE for NLP to address the out-of-vocabulary (OOV) problem in neural machine translation. Instead of working with full words, BPE decomposes them into subword units that can be recombined to represent rare or unseen words. 2. Why Tokenization Matters in LLMs Tokenization is th...

ZeRO: Deep Memory Optimization for Training Trillion-Parameter Models

In 2020, Microsoft researchers introduced ZeRO (Zero Redundancy Optimizer) via their paper "ZeRO: Memory Optimization Towards Training Trillion Parameter Models" (arXiv:1910.02054). ZeRO is a memory optimization technique that eliminates redundancy in distributed training, enabling efficient scaling to trillion-parameter models. This provides an in-depth technical breakdown of ZeRO's partitioning strategies, memory usage analysis, and integration with DeepSpeed. 1. What is ZeRO? ZeRO eliminates redundant memory copies of model states across GPUs. Instead of replicating parameters, gradients, and optimizer states across each GPU, ZeRO partitions them across all devices. This results in near-linear memory savings as the number of GPUs increases. 2. Limitations of Traditional Data Parallelism In standard data-parallel training, every GPU maintains: Model Parameters $\theta$ Gradients $\nabla \theta$ Optimizer States $O(\theta)$ This causes memory usage ...