What is a Transformer?
The Transformer is a neural network architecture introduced by Vaswani et al. in the 2017 paper "Attention is All You Need." It revolutionized natural language processing by replacing sequential models like RNNs and LSTMs. Transformers process entire sentences in parallel using self-attention, effectively addressing the difficulty of learning from long input sequences and enabling high computational efficiency by overcoming the limitations of sequential processing.
1. Transformer Components and Overcoming RNN/LSTM Limitations
The Transformer is composed of an encoder and a decoder, with each block consisting of the following key components:
- Self-Attention: Learns the relationships between tokens within the input sequence by enabling each token to attend to all others, effectively capturing long-range dependencies and rich contextual information.
- Multi-Head Attention (MHA): Divides self-attention into multiple parallel heads. Each head focuses on different subsets of relationships in the input data. The outputs of these heads are then aggregated using learned weights.
- Positional Encoding: Since the Transformer does not inherently process sequences in order, positional encoding provides information about the relative or absolute position of tokens in the input.
- Feedforward Network (FFN): Applies non-linear transformations to the features from the attention layers to deepen the network and enhance semantic understanding.
- Layer Normalization & Residual Connections (Add & Norm): Improves training stability and convergence using skip connections and layer normalization, independent of input batch size.
Traditional RNNs and LSTMs suffer from poor parallelization and long-sequence memory degradation. The Transformer overcomes these issues with parallel operations and self-attention, enabling efficient handling of large-scale models and datasets.
2. Self-Attention Mechanism: Query, Key, Value
Self-Attention allows each token in a sequence to learn the relationships with all other tokens, capturing semantic associations and dynamic context. For example, in the sentence “The cat sat on the mat,” the word “cat” may be more closely related to “sat” than to “mat.” Each token is transformed into three vectors:
- Query (Q): Determines how much attention the token should pay to other tokens.
- Key (K): Represents how much a token should be attended to by others.
- Value (V): The actual information carried by the token.
These are computed using learnable matrices $W^Q, W^K, W^V$.
2.1. Embedding to Query, Key, and Value
$Q = XW^Q,\quad K = XW^K,\quad V = XW^V$
Where:
- X: Embedding matrix of input sequence (size: $n \times d$)
- $W^Q, W^K, W^V$: Learnable matrices of size $d \times d_k$
- Q, K, V: Output matrices of size $n \times d_k$
2.2. Attention Score Computation (Dot Product of Q and K)
Computing similarity by taking the dot product between each Query and all Keys:
$\text{Attention Scores} = QK^T$
Size: $n \times n$ (similarity matrix for all token pairs)
2.3. Scaling
To stabilize gradients when softmaxing large values:
$\text{Scaled Scores} = \frac{QK^T}{\sqrt{d_k}}$
2.4. Applying Softmax
Softmax is applied to transform scores into attention weights:
$\alpha_{ij} = \text{softmax}\left( \frac{q_i \cdot k_j}{\sqrt{d_k}} \right)$
These weights indicate the importance of each token.
2.5. Computing the Final Output with Weighted Sum
Multiply the weights with corresponding Value vectors to get the output:
$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
Output: A new representation of size $n \times d_k$
3. Multi-Head Attention vs Single-Head Attention
Multi-Head Attention is a core component of the Transformer that allows simultaneous learning of different contextual relationships. It captures linguistic subtleties and semantic diversity that a single attention head might miss. Simply put, it performs multiple self-attentions in parallel and combines the results. Each head learns its own Query (Q), Key (K), and Value (V) representations.
3.1. Computation Process
For input embedding X
, each attention head computes:
- Learnable weights:
W_i^Q
,W_i^K
,W_i^V
Qᵢ = X × Wᵢ^Q
Kᵢ = X × Wᵢ^K
Vᵢ = X × Wᵢ^V
headᵢ = softmax(Qᵢ × Kᵢᵀ / √dₖ) × Vᵢ
This is done for h
heads in parallel. The outputs are concatenated and passed through a final linear projection:
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) × Wᴼ
Where Wᴼ
is the output projection matrix.
3.2. Advantages of Multi-Head Attention
- Learn diverse relationships: Each head can focus on different types of information like grammar, semantics, and position.
- Parallel computation: All attention operations are performed in parallel, unlike RNNs.
- Improved expressiveness: Combines multiple perspectives for a richer representation.
3.3. Comparison: Single-Head vs Multi-Head
Aspect | Single-Head Attention | Multi-Head Attention |
---|---|---|
Learning Scope | One representation space | Multiple representation spaces |
Diversity of Information | Limited | Rich |
Efficiency | Simpler | Efficient via parallelism |
Expressiveness | Lower | Higher |
3.4. Example Understanding
Sentence: “The cat didn’t cross the road because it was tired.”
- One head may focus on the relationship between “cat” and “tired”
- Another head may capture the temporal link to “didn’t cross”
3.5. Summary of Multi-Head Attention Core Ideas
- Enables learning of multiple relationships simultaneously
- Computes Q, K, V in separate subspaces
- Concatenates the results and applies a final linear transformation
4. The Necessity of Positional Encoding and Absolute vs Relative Encoding
Transformers do not inherently consider the order of sequences, so positional information of input tokens must be explicitly provided. To address this, Positional Encoding was introduced. Unlike RNNs, Transformers process input in parallel. In other words, they cannot recognize input order, so information indicating each word’s position must be added to help the model understand grammatical and semantic structure correctly.
4.1. Absolute Positional Encoding
In the original Transformer paper ("Attention is All You Need"), absolute positional encoding using sine and cosine functions is used. For each position pos
and embedding dimension i
, the values are calculated as follows:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
This method indirectly encodes the relative distance between positions, allowing the model to infer token positions from periodic functions. It is static and non-trainable, and directly adds order information. Advantages include simplicity, intuitiveness, and reusability during inference, but it may generalize poorly to sequences of different lengths and does not clearly encode relative positions within sentences.
4.2. Relative Positional Encoding
Relative positional encoding encodes the distance between tokens rather than their absolute positions. This enables stronger generalization regardless of sentence length. Representative methods include RoPE (Rotary Positional Embedding) and ALiBi (Attention with Linear Biases).
4.3. Summary Comparison: Absolute vs Relative Encoding
Category | Absolute Encoding | Relative Encoding (RoPE / ALiBi) |
---|---|---|
Reference Basis | Fixed token position | Distance between tokens |
Generalization | Low | High |
Trainability | Usually fixed (non-trainable) | RoPE: non-trainable, ALiBi: non-trainable |
Recent Usage | BERT, base Transformer | GPT-NeoX, LLaMA, T5, etc. |
5. Comparison of Encoder-Only, Decoder-Only, and Encoder-Decoder Architectures
Encoder-only (e.g., BERT): Processes the entire input at once and is suitable for understanding-oriented tasks such as sentence classification and question answering.
Decoder-only (e.g., GPT): Uses an autoregressive structure to predict the next word based on previous words, making it strong for generation tasks.
Encoder-decoder (e.g., T5, BART): Summarizes input through the encoder and generates output sentences via the decoder. It is suitable for multi-purpose tasks like machine translation, summarization, and question answering.
6. Computational Limitations of Standard Transformers and Their Solutions
Standard Transformers have time and memory complexity proportional to the square of the input length (O(n²)) in attention operations. Major approaches to address this include:
- Sparse Attention: Compute attention only between certain token pairs (e.g., Longformer, BigBird)
- Linear Attention: Refactor attention operations to reduce time complexity to linear (e.g., Performer, Linear Transformer)
- Compressed Attention: Reduce the number of keys/values to decrease computation (e.g., Linformer, Reformer)
- Segment-based Processing: Split sequences into chunks and process sequentially (e.g., RetNet, RWKV)
References
- Vaswani et al. (2017), "Attention is All You Need"
- Google AI Blog: Self-Attention Mechanism
- HuggingFace Transformers Documentation
- Papers on RoFormer, Performer, BigBird, Linformer
Comments
Post a Comment