Understanding Activation Functions in Deep Learning

Activation functions are a core component of deep learning models, especially neural networks. They determine whether a neuron should be activated or not, introducing non-linearity into the model. Without activation functions, a neural network would behave like a linear regression model, no matter how many layers it had.

Role of Activation Functions

Non-linearity: Real-world data is non-linear. Activation functions allow the model to learn complex patterns.
Enabling deep learning: Deep networks rely on stacking layers. Without non-linearity, stacking layers is meaningless.
Gradient flow: The function affects how gradients propagate during backpropagation, impacting training speed and performance.

Evolution of Activation Functions

Let’s walk through a timeline of activation functions and why newer ones replaced earlier ones.

1. Sigmoid Function

Formula: $\sigma(x)=\frac{1}{1+e^{-x}}(x)$
Range: (0, 1)
Problems:
- Vanishing gradients for large/small inputs
- Outputs not centered at 0
Good for binary classification

2. Tanh (Hyperbolic Tangent)

Formula: $\tanh(x)=\frac{e^x-e^{-1}}{e^x+e^{-x}}(x)$
Range: (-1, 1)
Zero-centered output
Still suffers from vanishing gradients

3. ReLU (Rectified Linear Unit)

Formula: $ReLU(x)=max(0,x)$
Simple and efficient
Sparse activation (some neurons deactivate)
Dying ReLU problem (neurons output zero for all inputs)

4. Leaky ReLU

Formula: $LeakyReLU(x)=max(\alpha x,x),\text{where alpha is small(e.g., 0.01)}$
Solves dying ReLU by allowing small negative slope
Still not perfect, requires tuning

5. ELU (Exponential Linear Unit)

Formula: $ELU(x)=\left\{\begin{matrix} x & if x> 0 \\ \alpha(e^x-1) & if x\leq 0 \\ \end{matrix}\right.$

Smooth at 0
Can produce negative outputs, pushing mean activations closer to zero
Slightly more expensive to compute

6. Swish (by Google)

Formula: $Swish(x)=x\cdot \sigma(x)$
Smooth, non-monotonic
Often outperforms ReLU in deeper models
Slightly more computational overhead

7. Mish (newer, similar to Swish)

Formula: $Mish(x)=x\cdot \tanh(ln(1+e^x))$
Promotes smoother gradients
Shows better generalization in some tasks
Complex and computationally heavier

Comparison Diagram of Activation Functions

Here’s a plot comparing their shapes:

Reference

Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer. (2020). PowerNorm: Rethinking Batch Normalization in Transformers. arXiv:2003.07845 [cs].
Prajit Ramachandran, Barret Zoph, Quoc V. Le. (2017). Swish: A Self-Gated Activation Function. arXiv:1710.05941 [cs.NE].
Diganta Misra. (2019). Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv:1908.08681 [cs.LG].

AI Practitioner

Search This Blog

Understanding Activation Functions in Deep Learning

Role of Activation Functions

Evolution of Activation Functions

1. Sigmoid Function

2. Tanh (Hyperbolic Tangent)

3. ReLU (Rectified Linear Unit)

4. Leaky ReLU

5. ELU (Exponential Linear Unit)

6. Swish (by Google)

7. Mish (newer, similar to Swish)

Comparison Diagram of Activation Functions

Reference

Labels

Comments

Post a Comment

Popular

Understanding SentencePiece: A Language-Independent Tokenizer for AI Engineers

Mastering the Byte Pair Encoding (BPE) Tokenizer for NLP and LLMs

Using Gemini API in LangChain: Step-by-Step Tutorial

Building an MCP Agent with UV, Python & mcp-use

Using Gemini API in LangChain: Step-by-Step Tutorial

Understanding Accuracy, Precision, Recall, and F1 Score in ML/DL Models

Building an MCP Agent with UV, Python & mcp-use

Using Gemini API in LangChain: Step-by-Step Tutorial

Understanding Accuracy, Precision, Recall, and F1 Score in ML/DL Models