What is the difference between nn and nn.functional in PyTorch?

The core distinction between `torch.nn` and `torch.nn.functional` in PyTorch is architectural, centering on state management. `torch.nn` provides classes that encapsulate learnable parameters (like weights and biases) as persistent attributes, alongside the functional operations that use them. For instance, `nn.Linear` stores the weight matrix and bias vector as `nn.Parameter` objects, which are automatically registered for optimization. In contrast, `torch.nn.functional` (commonly imported as `F`) is a purely functional interface containing stateless implementations of the same core operations, such as `F.linear`. When using `F.linear`, the weight and bias tensors must be supplied explicitly as arguments each time the function is called; the function itself holds no persistent state. This fundamental difference dictates their primary use cases: `nn.Module` subclasses are the building blocks for defining layers and models that manage their own parameters, while `F` is used for stateless operations, within custom `forward` methods, or in more complex, dynamic computational graphs.

The practical implications of this design are significant for code structure and flexibility. Using `nn.Conv2d` in a model automatically gives that layer trainable parameters that are tracked by the optimizer, can be moved to different devices with `.to(device)`, and are saved and loaded via `state_dict`. It represents a self-contained, reusable component. Conversely, `F.conv2d` requires the programmer to manually pass the filter weights and biases, which offers maximum control but also demands manual management of those tensors' lifecycle and optimization. This makes `F` indispensable for implementing novel or non-standard layers where the operation is static or the parameters are computed dynamically elsewhere. For example, a custom attention mechanism would likely use `F.softmax` and `F.matmul` within its `forward` pass, while the learnable projection matrices would be defined as `nn.Parameter` or as separate `nn.Linear` modules.

The choice between the two is not always mutually exclusive and often follows a consistent pattern within standard PyTorch code. For operations that inherently have no parameters—such as activation functions (`ReLU`), pooling (`max_pool2d`), or normalization (`dropout`)—both interfaces exist. The modern convention typically favors using the class version from `nn` (e.g., `nn.ReLU()`) when the operation is to be used as a persistent layer within a sequential container or when it has a trainable mode flag, like `nn.Dropout`, which respects the model's `.train()` and `.eval()` states. The functional form `F.relu` is used inside custom `forward` methods for direct, inline invocation. This hybrid approach leverages the organizational benefits of `nn.Module` for parameter management and model hierarchy, while retaining the flexibility of `F` for detailed, procedural control over the forward computation. Understanding this duality is key to writing clean, efficient, and powerful PyTorch code, as it allows developers to seamlessly blend declarative layer definitions with imperative tensor manipulations.