What is the difference between log_softmax and softmax?
The fundamental difference between log_softmax and softmax is that softmax transforms a vector of real numbers (logits) into a probability distribution, while log_softmax directly computes the natural logarithm of those probabilities. Formally, for an input vector **z**, the softmax function for element *i* is σ(**z**)_i = exp(z_i) / Σ_j exp(z_j), ensuring outputs are non-negative and sum to one. Log_softmax is precisely log(σ(**z**)_i), which simplifies to z_i - log(Σ_j exp(z_j)). This is not merely a sequential application of softmax followed by a logarithm; it is a single, numerically stable operation implemented in frameworks like PyTorch and TensorFlow to avoid the numerical underflow and overflow that could occur from separately exponentiating very large or small logits.
The primary motivation for using log_softmax arises in machine learning, specifically when calculating the cross-entropy loss for classification tasks. The cross-entropy loss for a true class *i* is -log(p_i), where p_i is the predicted probability from softmax. Therefore, using log_softmax outputs allows the loss to be computed directly and efficiently as the negative of the relevant log_softmax value. This combination is not only computationally optimal—avoiding redundant calculations—but also crucial for numerical stability. Performing a naive softmax on logits with high magnitude can lead to vanishing or exploding gradients during backpropagation, whereas the log_softmax implementation uses a "log-sum-exp" trick to subtract the maximum logit before exponentiation, ensuring the computations remain within a stable numerical range.
Consequently, the choice between the two functions is dictated by the immediate computational context. The raw softmax output is typically used when one needs to interpret or sample from the actual probability distribution, such as at inference time to obtain a class prediction or to compute metrics like confidence scores. In contrast, log_softmax is almost exclusively employed during the training phase as the direct input to loss functions like `nn.NLLLoss` (Negative Log Likelihood Loss) in PyTorch. Using them together, as in `nn.CrossEntropyLoss` (which internally combines log_softmax and NLLLoss), is a standard and optimized practice. The implication is that softmax and log_softmax are not interchangeable but are complementary components of the training pipeline, with their separation allowing for greater efficiency and stability in large-scale model optimization.
From an implementation perspective, the existence of log_softmax as a distinct operation underscores the importance of designing numerical algorithms around the specific requirements of gradient-based learning. It encapsulates a best practice: the log-probability is the fundamental quantity needed for probabilistic loss functions, so it should be computed in a single, stabilized step. This design prevents subtle numerical errors that could degrade model performance and ensures that the gradient flow through this critical part of the network is as reliable as possible, which is non-negotiable for training deep neural networks effectively on modern hardware.