Activation functions are a core component of deep learning models, especially neural networks. They determine whether a neuron should be activated or not, introducing non-linearity into the model. Without activation functions, a neural network would behave like a linear regression model, no matter how many layers it had.
Role of Activation Functions
- Non-linearity: Real-world data is non-linear. Activation functions allow the model to learn complex patterns.
- Enabling deep learning: Deep networks rely on stacking layers. Without non-linearity, stacking layers is meaningless.
- Gradient flow: The function affects how gradients propagate during backpropagation, impacting training speed and performance.
Evolution of Activation Functions
Let’s walk through a timeline of activation functions and why newer ones replaced earlier ones.
1. Sigmoid Function
- Formula: $\sigma(x)=\frac{1}{1+e^{-x}}(x)$
- Range: (0, 1)
- Problems:
- Vanishing gradients for large/small inputs
- Outputs not centered at 0
- Good for binary classification
2. Tanh (Hyperbolic Tangent)
- Formula: $\tanh(x)=\frac{e^x-e^{-1}}{e^x+e^{-x}}(x)$
- Range: (-1, 1)
- Zero-centered output
- Still suffers from vanishing gradients
3. ReLU (Rectified Linear Unit)
- Formula: $ReLU(x)=max(0,x)$
- Simple and efficient
- Sparse activation (some neurons deactivate)
- Dying ReLU problem (neurons output zero for all inputs)
4. Leaky ReLU
- Formula: $LeakyReLU(x)=max(\alpha x,x),\text{where alpha is small(e.g., 0.01)}$
- Solves dying ReLU by allowing small negative slope
- Still not perfect, requires tuning
5. ELU (Exponential Linear Unit)
- Formula: $ELU(x)=\left\{\begin{matrix} x & if x> 0 \\ \alpha(e^x-1) & if x\leq 0 \\ \end{matrix}\right.$
- Smooth at 0
- Can produce negative outputs, pushing mean activations closer to zero
- Slightly more expensive to compute
6. Swish (by Google)
- Formula: $Swish(x)=x\cdot \sigma(x)$
- Smooth, non-monotonic
- Often outperforms ReLU in deeper models
- Slightly more computational overhead
7. Mish (newer, similar to Swish)
- Formula: $Mish(x)=x\cdot \tanh(ln(1+e^x))$
- Promotes smoother gradients
- Shows better generalization in some tasks
- Complex and computationally heavier
Comparison Diagram of Activation Functions
Here’s a plot comparing their shapes:
Reference
- Sheng Shen, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer. (2020). PowerNorm: Rethinking Batch Normalization in Transformers. arXiv:2003.07845 [cs].
- Prajit Ramachandran, Barret Zoph, Quoc V. Le. (2017). Swish: A Self-Gated Activation Function. arXiv:1710.05941 [cs.NE].
- Diganta Misra. (2019). Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv:1908.08681 [cs.LG].
Comments
Post a Comment