What is Softmax?
Softmax is a mathematical function that transforms a vector of real-valued scores (logits) into a probability distribution over predicted output classes. It is commonly used in the output layer of classification models, especially in multi-class classification problems.
Mathematical Definition
Given a vector of real numbers $z=[z_1,z_2,...,z_K]$, the softmax function outputs a vector $\sigma(z)$ where:
$\sigma(Z_i)=\frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}} \text{(for i=1, ..., K)}$
Each element $\sigma(z_i)\in (0,1)$ and the elements sum to 1: $\sum_{i=0}^{K}\sigma(z_i)=1$.Why Use Softmax?
- It converts raw scores (logits) into probabilities.
- It helps the model assign confidence to predictions.
- It is differentiable, enabling gradient-based optimization during training.
Impact on Model Performance
Classification Accuracy
In combination with the cross-entropy loss, softmax allows effective training of deep models by penalizing confident wrong predictions more heavily than uncertain ones. This leads to better convergence and improved classification accuracy ([Goodfellow et al., 2016]).
Overconfidence and Calibration
Softmax can amplify small differences in logits into large differences in probabilities. While this is good for decisiveness, it can lead to overconfidence, especially when the model is uncertain. Techniques like label smoothing, temperature scaling, or Bayesian modeling help in these cases ([Guo et al., 2017]).
Python Implementation from Scratch
import numpy as np
def softmax(logits):
"""Compute softmax values for each set of scores in logits."""
exp_shifted = np.exp(logits - np.max(logits)) # for numerical stability
return exp_shifted / np.sum(exp_shifted)
# Test Cases
assert np.allclose(softmax([1.0, 2.0, 3.0]),
[0.09003057, 0.24472847, 0.66524096], atol=1e-6)
assert np.allclose(np.sum(softmax([2.0, 2.0, 2.0])), 1.0)
assert np.all(softmax([5, 1, -2]) > 0)
print("All tests passed!")
Softmax in PyTorch
In PyTorch, torch.nn.functional.softmax is commonly used in model definitions and evaluation. Here’s how you use it:
import torch
import torch.nn.functional as F
# Logits from model output
logits = torch.tensor([1.0, 2.0, 3.0])
probs = F.softmax(logits, dim=0)
print(probs) # Tensor with probabilities summing to 1
In a model:
import torch.nn as nn
class MyClassifier(nn.Module):
def __init__(self, input_dim, output_dim):
super().__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self, x):
logits = self.linear(x)
return F.softmax(logits, dim=1)
Important: In practice, don’t apply softmax before nn.CrossEntropyLoss, as this loss function includes softmax internally for better numerical stability.
References
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Section 6.2 – Output Units
- Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML. [Paper Link]
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. [Chapter 4 – Probabilistic Generative Models]
Comments
Post a Comment