What do the parameters of torch.nn.function.softmax (logits, dim=-1) mean?

The parameters of `torch.nn.functional.softmax` are fundamental to applying the softmax function correctly within a PyTorch computational graph. The first parameter, `logits`, refers to the input tensor containing the raw, unnormalized scores (or log-odds) from a model's final linear layer, prior to any activation function. It is crucial to understand that these are not probabilities; they can be any real number, and their scale directly influences the resulting probability distribution's confidence. The second parameter, `dim`, specifies the dimension along which the softmax operation is computed, with a default value of `-1`. This dimension is where the sum of the exponents will be calculated to normalize the values, ensuring that the output elements along that dimension sum to 1.

The `dim` parameter is particularly critical because it dictates the tensor's axis of normalization, which must align with the structure of the input data. For instance, in a common classification task where a batch of samples produces a 2D tensor of shape `[batch_size, num_classes]`, setting `dim=-1` (or equivalently `dim=1`) applies softmax across the class dimension for each sample independently. This yields an output tensor of the same shape where each row (the scores for one sample across all classes) is a valid probability distribution. Conversely, applying softmax along `dim=0` would normalize across the batch for each class, which is almost never the intended behavior. The flexibility of the `dim` argument allows softmax to be used in more complex scenarios, such as applying it across channels in a spatial output or across time steps in a sequence model.

Mechanically, the function computes the softmax for each slice along the specified dimension. For a given 1D vector of logits `z` along that dimension, the operation is `softmax(z_i) = exp(z_i) / sum(exp(z_j))` for all `j` in that dimension. This formulation is numerically stabilized in PyTorch's implementation, typically by subtracting the maximum logit in the dimension from each element before exponentiation to prevent overflow, without changing the mathematical result. The output is a tensor of the same shape as the input `logits`, but with values transformed into probabilities that are non-negative and sum to 1 along the specified `dim`. This transformation is essential for interpreting model outputs as probabilities and for calculating losses like cross-entropy, which expects log-probabilities or directly the logits for numerical stability reasons.

The implications of these parameters are directly tied to model design and training. Using the correct `dim` ensures that the probabilistic interpretation of the model's output is structurally valid. Furthermore, passing raw logits (rather than pre-softmaxed probabilities) to functions like `F.cross_entropy` is the standard and optimized practice, as such loss functions internally compute the softmax (or its log) in a numerically stable way. Mis-specifying the `dim` parameter is a common source of subtle bugs, leading to incorrect gradients and training failure, as the normalization will occur over an unintended axis, producing nonsensical probability distributions that do not correspond to the problem's logical groupings.