What is a Transformer? Understanding Transformer Architecture in NLP

What is a Transformer?

The Transformer is a neural network architecture introduced by Vaswani et al. in the 2017 paper "Attention is All You Need." It revolutionized natural language processing by replacing sequential models like RNNs and LSTMs. Transformers process entire sentences in parallel using self-attention, effectively addressing the difficulty of learning from long input sequences and enabling high computational efficiency by overcoming the limitations of sequential processing.

1. Transformer Components and Overcoming RNN/LSTM Limitations

The Transformer is composed of an encoder and a decoder, with each block consisting of the following key components:

Self-Attention: Learns the relationships between tokens within the input sequence by enabling each token to attend to all others, effectively capturing long-range dependencies and rich contextual information.
Multi-Head Attention (MHA): Divides self-attention into multiple parallel heads. Each head focuses on different subsets of relationships in the input data. The outputs of these heads are then aggregated using learned weights.
Positional Encoding: Since the Transformer does not inherently process sequences in order, positional encoding provides information about the relative or absolute position of tokens in the input.
Feedforward Network (FFN): Applies non-linear transformations to the features from the attention layers to deepen the network and enhance semantic understanding.
Layer Normalization & Residual Connections (Add & Norm): Improves training stability and convergence using skip connections and layer normalization, independent of input batch size.

Traditional RNNs and LSTMs suffer from poor parallelization and long-sequence memory degradation. The Transformer overcomes these issues with parallel operations and self-attention, enabling efficient handling of large-scale models and datasets.

2. Self-Attention Mechanism: Query, Key, Value

Self-Attention allows each token in a sequence to learn the relationships with all other tokens, capturing semantic associations and dynamic context. For example, in the sentence “The cat sat on the mat,” the word “cat” may be more closely related to “sat” than to “mat.” Each token is transformed into three vectors:

Query (Q): Determines how much attention the token should pay to other tokens.
Key (K): Represents how much a token should be attended to by others.
Value (V): The actual information carried by the token.

These are computed using learnable matrices $W^Q, W^K, W^V$.

2.1. Embedding to Query, Key, and Value

$Q = XW^Q,\quad K = XW^K,\quad V = XW^V$

Where:

X: Embedding matrix of input sequence (size: $n \times d$)
$W^Q, W^K, W^V$: Learnable matrices of size $d \times d_k$
Q, K, V: Output matrices of size $n \times d_k$

2.2. Attention Score Computation (Dot Product of Q and K)

Computing similarity by taking the dot product between each Query and all Keys:

$\text{Attention Scores} = QK^T$

Size: $n \times n$ (similarity matrix for all token pairs)

2.3. Scaling

To stabilize gradients when softmaxing large values:

$\text{Scaled Scores} = \frac{QK^T}{\sqrt{d_k}}$

2.4. Applying Softmax

Softmax is applied to transform scores into attention weights:

$\alpha_{ij} = \text{softmax}\left( \frac{q_i \cdot k_j}{\sqrt{d_k}} \right)$

These weights indicate the importance of each token.

2.5. Computing the Final Output with Weighted Sum

Multiply the weights with corresponding Value vectors to get the output:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Output: A new representation of size $n \times d_k$

3. Multi-Head Attention vs Single-Head Attention

Multi-Head Attention is a core component of the Transformer that allows simultaneous learning of different contextual relationships. It captures linguistic subtleties and semantic diversity that a single attention head might miss. Simply put, it performs multiple self-attentions in parallel and combines the results. Each head learns its own Query (Q), Key (K), and Value (V) representations.

3.1. Computation Process

For input embedding X, each attention head computes:

Learnable weights: W_i^Q, W_i^K, W_i^V


Qᵢ = X × Wᵢ^Q
Kᵢ = X × Wᵢ^K
Vᵢ = X × Wᵢ^V

headᵢ = softmax(Qᵢ × Kᵢᵀ / √dₖ) × Vᵢ

This is done for h heads in parallel. The outputs are concatenated and passed through a final linear projection:


MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) × Wᴼ

Where Wᴼ is the output projection matrix.

3.2. Advantages of Multi-Head Attention

Learn diverse relationships: Each head can focus on different types of information like grammar, semantics, and position.
Parallel computation: All attention operations are performed in parallel, unlike RNNs.
Improved expressiveness: Combines multiple perspectives for a richer representation.

3.3. Comparison: Single-Head vs Multi-Head

Aspect	Single-Head Attention	Multi-Head Attention
Learning Scope	One representation space	Multiple representation spaces
Diversity of Information	Limited	Rich
Efficiency	Simpler	Efficient via parallelism
Expressiveness	Lower	Higher

3.4. Example Understanding

Sentence: “The cat didn’t cross the road because it was tired.”

One head may focus on the relationship between “cat” and “tired”
Another head may capture the temporal link to “didn’t cross”

3.5. Summary of Multi-Head Attention Core Ideas

Enables learning of multiple relationships simultaneously
Computes Q, K, V in separate subspaces
Concatenates the results and applies a final linear transformation

4. The Necessity of Positional Encoding and Absolute vs Relative Encoding

Transformers do not inherently consider the order of sequences, so positional information of input tokens must be explicitly provided. To address this, Positional Encoding was introduced. Unlike RNNs, Transformers process input in parallel. In other words, they cannot recognize input order, so information indicating each word’s position must be added to help the model understand grammatical and semantic structure correctly.

4.1. Absolute Positional Encoding

In the original Transformer paper ("Attention is All You Need"), absolute positional encoding using sine and cosine functions is used. For each position pos and embedding dimension i, the values are calculated as follows:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This method indirectly encodes the relative distance between positions, allowing the model to infer token positions from periodic functions. It is static and non-trainable, and directly adds order information. Advantages include simplicity, intuitiveness, and reusability during inference, but it may generalize poorly to sequences of different lengths and does not clearly encode relative positions within sentences.

4.2. Relative Positional Encoding

Relative positional encoding encodes the distance between tokens rather than their absolute positions. This enables stronger generalization regardless of sentence length. Representative methods include RoPE (Rotary Positional Embedding) and ALiBi (Attention with Linear Biases).

4.3. Summary Comparison: Absolute vs Relative Encoding

Category	Absolute Encoding	Relative Encoding (RoPE / ALiBi)
Reference Basis	Fixed token position	Distance between tokens
Generalization	Low	High
Trainability	Usually fixed (non-trainable)	RoPE: non-trainable, ALiBi: non-trainable
Recent Usage	BERT, base Transformer	GPT-NeoX, LLaMA, T5, etc.

5. Comparison of Encoder-Only, Decoder-Only, and Encoder-Decoder Architectures

Encoder-only (e.g., BERT): Processes the entire input at once and is suitable for understanding-oriented tasks such as sentence classification and question answering.

Decoder-only (e.g., GPT): Uses an autoregressive structure to predict the next word based on previous words, making it strong for generation tasks.

Encoder-decoder (e.g., T5, BART): Summarizes input through the encoder and generates output sentences via the decoder. It is suitable for multi-purpose tasks like machine translation, summarization, and question answering.

6. Computational Limitations of Standard Transformers and Their Solutions

Standard Transformers have time and memory complexity proportional to the square of the input length (O(n²)) in attention operations. Major approaches to address this include:

Sparse Attention: Compute attention only between certain token pairs (e.g., Longformer, BigBird)
Linear Attention: Refactor attention operations to reduce time complexity to linear (e.g., Performer, Linear Transformer)
Compressed Attention: Reduce the number of keys/values to decrease computation (e.g., Linformer, Reformer)
Segment-based Processing: Split sequences into chunks and process sequentially (e.g., RetNet, RWKV)

References

Vaswani et al. (2017), "Attention is All You Need"
Google AI Blog: Self-Attention Mechanism
HuggingFace Transformers Documentation
Papers on RoFormer, Performer, BigBird, Linformer

Understanding SentencePiece: A Language-Independent Tokenizer for AI Engineers

In the realm of Natural Language Processing (NLP), tokenization plays a pivotal role in preparing text data for machine learning models. Traditional tokenization methods often rely on language-specific rules and pre-tokenized inputs, which can be limiting when dealing with diverse languages and scripts. Enter SentencePiece—a language-independent tokenizer and detokenizer designed to address these challenges and streamline the preprocessing pipeline for neural text processing systems. What is SentencePiece? SentencePiece is an open-source tokenizer and detokenizer developed by Google, tailored for neural-based text processing tasks such as Neural Machine Translation (NMT). Unlike conventional tokenizers that depend on whitespace and language-specific rules, SentencePiece treats the input text as a raw byte sequence, enabling it to process languages without explicit word boundaries, such as Japanese, Chinese, and Korean. This approach allows SentencePiece to train subword models di...

AI Practitioner

Search This Blog