Softmax Activation Function
What is an Activation Function?
An activation (or transfer) function maps a neuron’s weighted inputs plus bias to its output, adding non-linearity so the model can learn complex patterns beyond simple linear ones.
Activation Functions are also known as Transfer Function in the context of Neural Networks.
- Math functions that calculate weighted sum of inputs and adds bias to give non-linearity to output of neuron.
- Decides whether a neuron should be activated (“fired”) or not.
- This helps Neural Network to use important information and suppress not so useful data points.
- Adds non-linearity to Neural Network to tackle complex problems.
- Real-world problems are non-linear. Recognizing cats vs. dogs
- Without activation functions,
f(z) = z
, linear regression model, multiple linear layers form up to one big linear equation; useless for non-linear problems.
What are linear and non-linear problems?
A linear pattern is like a straight-line rule of thumb.
If you study twice as long, you score twice as high in your exams. Simple analogy, neat and slightly predictable.
A non-linear (complex) pattern is more like real life.
Studying a little earlier before exams could help a lot at first, then extra hours give smaller dopamine boosts, and maybe after a point you may burn out and your exam does not go well; your scores are average.
The scenario bends, twists, and changes depending on the real life events, not just a straight line.
That’s why neural networks need non-linearity: life isn't straight-line simple.
Softmax Function
non-linear, extension of Sigmoid
- Softmax converts a vector of raw scores (logits) which could be any real numbers, positive or negative into a probability distribution.
- Used in the last layer (output layer) of neural network for multi-class classification problems
- Output range (0,1) and normalizes positive values that sum to 1.
- Specially for selecting one class out of many classes.
- Outputs a vector of probabilities: Class with highest probability value is chosen with confidence.
Softmax - Mathematical Derivation
- Combination of multiple Sigmoid/ Logistic functions.
- Calculates the relative probabilities of each Sigmoids.
- Numerator exponentiates the input
- Denominator makes all outputs sum to 1.
Given logits,
the softmax function for class i
is:
zi = sigmoids at any particular neuron
exp(zi) = exponential of zi
∑j exp(zj) = summation of all exp(zj) where j is all sigmoids in the network.
How to apply Softmax?
Assume 3 classes, i.e. 3 neurons in the output layer. Suppose our output from the neurons is [3.2, 1.2, 0.5].
Applying Softmax function
Input: [3.2, 1.2, 0.5] - logits
- Step 1: Subtract the max from all
Max value is 3.2, so subtract from each[ 3.2 - 3.2, 1.2 - 3.2, 0.5 - 3.2 ] - Step 2: Exponentiate
e0 = 1.0 , e-2 = 0.1353, e-2.7 = 0.0672 - Step 3: Sum of exponentials
1.0 + 0.1353 + 0.0672 = 1.2025 - Step 4: Divide exponential by the sum
z1 = 1.0/ 1.2025 = 0.8317z2 = 0.1353 / 1.2025 = 0.1125z3 = 0.0672 / 1.2025 = 0.0558
For eg: To classify image into one of three classes: [bird, fruit, flower]
If softmax(output)
:
[0.8317, 0.1125, 0.0558]
Class 1: 83.17% probability
Class 2: 11.25% probability
Class 3: 5.58% probability
Show algorithm an image of a bird.
Algorithm thinks 83% probability that its a bird, 11% fruit and 5% flower.
Algorithm will predict bird.
Now, let’s try this example in Python Code with NumPy, PyTorch and TensorFlow.
How to implement Softmax Function in Python?
We will write simple code for implementing Softmax activation function in 3 most popular platforms viz. Numpy, PyTorch and TensorFlow.
All code samples are executable in Google Colab easily.
Softmax in Numpy
import numpy as np
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)
logits = np.array([3.2, 1.2, 0.5])
probabilities = softmax(logits)
print(probabilities)
Softmax in PyTorch
import torch
import torch.nn.functional as F
logits = torch.tensor([3.2, 1.2, 0.5])
probabilities = F.softmax(logits, dim=0)
print(probabilities)
Softmax in TensorFlow
import tensorflow as tf
logits = tf.constant([3.2, 1.2, 0.5])
probabilities = tf.nn.softmax(logits)
print(probabilities)
Applications of Softmax
- Multi-class classification problems
- NLP - next word prediction
- Reinforcement Learning (train robot)
- Distillation - teach smaller models
- Sentiment analysis (+ve, -ve, neutral)
A primary example for the use case of Softmax can be MNIST dataset - 70k grayscale images of handwritten digits (0-9)
Advancements in Softmax Function
Adaptive Softmax
- Faster, memory efficient for large number of classes
- For eg: Instead of treating all words equally, it treats frequent and rare words differently.
Candidate Sampling
- Sample a few positive & negative examples (called candidates) during training.
- Calculate for small or random set of candidate classes.
Sparsemax
- Produces sparse outputs
- Cuts off small values to exactly zero.
- Sparsemax output: [1,0,0]