Softmax Activation Function

What is an Activation Function?

An activation (or transfer) function maps a neuron’s weighted inputs plus bias to its output, adding non-linearity so the model can learn complex patterns beyond simple linear ones.

Activation Functions are also known as Transfer Function in the context of Neural Networks.

Math functions that calculate weighted sum of inputs and adds bias to give non-linearity to output of neuron.
Decides whether a neuron should be activated (“fired”) or not.
This helps Neural Network to use important information and suppress not so useful data points.
Adds non-linearity to Neural Network to tackle complex problems.
Real-world problems are non-linear. Recognizing cats vs. dogs
Without activation functions, f(z) = z, linear regression model, multiple linear layers form up to one big linear equation; useless for non-linear problems.

What are linear and non-linear problems?

A linear pattern is like a straight-line rule of thumb.

If you study twice as long, you score twice as high in your exams. Simple analogy, neat and slightly predictable.

A non-linear (complex) pattern is more like real life.

Studying a little earlier before exams could help a lot at first, then extra hours give smaller dopamine boosts, and maybe after a point you may burn out and your exam does not go well; your scores are average.

The scenario bends, twists, and changes depending on the real life events, not just a straight line.

That’s why neural networks need non-linearity: life isn't straight-line simple.

Softmax Function

non-linear, extension of Sigmoid

Softmax converts a vector of raw scores (logits) which could be any real numbers, positive or negative into a probability distribution.
Used in the last layer (output layer) of neural network for multi-class classification problems
Output range (0,1) and normalizes positive values that sum to 1.
Specially for selecting one class out of many classes.
Outputs a vector of probabilities: Class with highest probability value is chosen with confidence.

Softmax - Mathematical Derivation

Combination of multiple Sigmoid/ Logistic functions.
Calculates the relative probabilities of each Sigmoids.
Numerator exponentiates the input
Denominator makes all outputs sum to 1.

softmax ( z i ) = exp ( z i ) ∑ j exp ( z j )

Given logits,

z = [ z 1 , z 2 , z 3 , … , z n ]

the softmax function for class i is:

z_i = sigmoids at any particular neuron

exp(z_i) = exponential of z_i

∑_j exp(z_j) = summation of all exp(z_j) where j is all sigmoids in the network.

How to apply Softmax?

Assume 3 classes, i.e. 3 neurons in the output layer. Suppose our output from the neurons is [3.2, 1.2, 0.5].

Applying Softmax function

Input: [3.2, 1.2, 0.5] - logits

Step 1: Subtract the max from all
Max value is 3.2, so subtract from each[ 3.2 - 3.2, 1.2 - 3.2, 0.5 - 3.2 ]
Step 2: Exponentiate
e⁰ = 1.0 , e^-2 = 0.1353, e^-2.7 = 0.0672
Step 3: Sum of exponentials
1.0 + 0.1353 + 0.0672 = 1.2025
Step 4: Divide exponential by the sum
z₁ = 1.0/ 1.2025 = 0.8317z₂ = 0.1353 / 1.2025 = 0.1125z₃ = 0.0672 / 1.2025 = 0.0558

For eg: To classify image into one of three classes: [bird, fruit, flower]

If softmax(output):

[0.8317, 0.1125, 0.0558]

Class 1: 83.17% probability
Class 2: 11.25% probability
Class 3: 5.58% probability

Show algorithm an image of a bird.

Algorithm thinks 83% probability that its a bird, 11% fruit and 5% flower.

Algorithm will predict bird.

Now, let’s try this example in Python Code with NumPy, PyTorch and TensorFlow.

How to implement Softmax Function in Python?

We will write simple code for implementing Softmax activation function in 3 most popular platforms viz. Numpy, PyTorch and TensorFlow.

All code samples are executable in Google Colab easily.

Softmax in Numpy

import numpy as np

def softmax(x):
  e_x = np.exp(x - np.max(x))
  return e_x / e_x.sum(axis=0)

logits = np.array([3.2, 1.2, 0.5])
probabilities = softmax(logits)

print(probabilities)

Softmax in PyTorch

import torch
import torch.nn.functional as F

logits = torch.tensor([3.2, 1.2, 0.5])
probabilities = F.softmax(logits, dim=0)
print(probabilities)

Softmax in TensorFlow

import tensorflow as tf

logits = tf.constant([3.2, 1.2, 0.5])
probabilities = tf.nn.softmax(logits)
print(probabilities)

Applications of Softmax

Multi-class classification problems
NLP - next word prediction
Reinforcement Learning (train robot)
Distillation - teach smaller models
Sentiment analysis (+ve, -ve, neutral)

A primary example for the use case of Softmax can be MNIST dataset - 70k grayscale images of handwritten digits (0-9)

MNIST Dataset Sample

3-4-2 Neural Network

Softmax Neural Network

Advancements in Softmax Function

Adaptive Softmax

Faster, memory efficient for large number of classes
For eg: Instead of treating all words equally, it treats frequent and rare words differently.

Candidate Sampling

Sample a few positive & negative examples (called candidates) during training.
Calculate for small or random set of candidate classes.

Sparsemax

Produces sparse outputs
Cuts off small values to exactly zero.
Sparsemax output: [1,0,0]