What is the difference between nn.Linear () and nn.Embedding ()?
The fundamental distinction between `nn.Linear()` and `nn.Embedding()` in PyTorch lies in their operational purpose and the structure of their parameter tensors. `nn.Linear()` performs a general affine transformation, applying a learned weight matrix and an optional bias vector to any input vector or batch of vectors. Its weight tensor is two-dimensional, with a size of `(out_features, in_features)`, and it operates on continuous-valued inputs through a matrix multiplication. In contrast, `nn.Embedding()` functions as a discrete lookup table, mapping integer indices to dense vector representations. Its parameter tensor is also two-dimensional, but with a size of `(num_embeddings, embedding_dim)`, where `num_embeddings` is the size of the vocabulary or categorical space and `embedding_dim` is the dimensionality of each embedding vector. The core operation is an index-select, not a matrix multiply; it retrieves the row of the embedding matrix corresponding to the input integer.
This difference dictates their primary application domains. `nn.Linear()` is a universal building block for learning any linear mapping between continuous spaces and is ubiquitous in the hidden layers and output layers of neural networks across all domains. `nn.Embedding()` is specialized for handling categorical or symbolic data, most famously the tokens in natural language processing. It converts sparse, one-hot-like integer representations into dense, continuous vectors that can be processed by subsequent layers like `nn.Linear`. While a `nn.Linear` layer could technically learn to emulate an embedding by multiplying a one-hot encoded vector, this would be computationally prohibitive for large vocabularies, as the one-hot vectors would be extremely wide and mostly zeros. The `nn.Embedding` layer is the efficient, practical implementation of this concept, directly performing the lookup without materializing the large one-hot intermediate.
Mechanistically, the forward pass for each layer reveals their distinct nature. For `nn.Linear(input)`, where `input` is a tensor of shape `(..., in_features)`, the operation is `torch.matmul(input, weight.t()) + bias`. For `nn.Embedding(input)`, where `input` is a tensor of arbitrary shape containing integer indices, the operation is effectively `self.weight[input]`, outputting a tensor where the last dimension is the `embedding_dim`. This means an `Embedding` layer's output adds a dimension, transforming a shape `(batch_size, sequence_length)` of indices into `(batch_size, sequence_length, embedding_dim)`. A `Linear` layer typically transforms or reduces the last dimension, for example, from `(batch_size, in_features)` to `(batch_size, out_features)`.
The implications for network design are significant. `nn.Embedding` serves as the crucial interface that translates discrete symbols into the continuous geometric space where neural networks operate, enabling models to learn semantic relationships between words or categories. The vectors it provides are then fed into sequence models or, after pooling, into linear layers for classification. `nn.Linear` layers subsequently perform the actual task-specific computation on these continuous representations. While `nn.Embedding` is almost exclusively an input-layer component for discrete data, `nn.Linear` is used throughout a network's depth. Understanding this division—embedding for efficient discrete-to-continuous conversion and linear for learned transformation within continuous space—is essential for architecting effective models, particularly in fields like NLP where the initial embedding step is foundational.