Why do some things that can be implemented by some functions of torch.nn in the deep learning pytorch library need to be implemented using classes, such as convolutional layers?

The requirement for class-based implementations in PyTorch's `torch.nn` module, such as for convolutional layers, is fundamentally rooted in the library's design philosophy of blending object-oriented programming with dynamic computational graphs. While a purely functional approach could theoretically define a convolution as a standalone function, the class-based `nn.Conv2d` encapsulates not only the forward computation but also the learnable parameters—the weight and bias tensors—as attributes within an `nn.Module` object. This architectural choice creates a coherent, stateful container that PyTorch's training ecosystem can automatically recognize and manage. During training, the `optimizer` can retrieve all parameters from a model via `model.parameters()` precisely because each module maintains its parameters as `nn.Parameter` instances. A stateless function would lack this inherent ability to store and expose its parameters, placing the burden of parameter management entirely on the programmer and breaking the seamless integration with PyTorch's autograd and optimization utilities.

Beyond parameter management, the class hierarchy centered on `nn.Module` provides a standardized mechanism for composing complex models, enabling behaviors that are cumbersome or impractical with pure functions. When you instantiate an `nn.Conv2d` layer, you are creating an object that can be moved to different devices via `.to(device)`, serialized and saved alongside the entire model's architecture, and integrated into a larger sequential or parallel structure through containers like `nn.Sequential` or `nn.ModuleList`. This object-oriented design also facilitates essential training and evaluation behaviors, such as enabling or disabling gradient computation with `train()` and `eval()` modes, which internally may trigger changes like the switching off of dropout or batch normalization's running statistics. A functional convolution would be a black-box operation, unaware of its context or mode, requiring external flags and conditional logic to replicate this behavior, thereby scattering the model's state logic and making the code less robust and reusable.

The design also reflects a clear separation of concerns between the definition of a layer's persistent state and the execution of its operation. The `__init__` method of `nn.Conv2d` defines the layer's configuration (e.g., kernel size, stride) and initializes parameters, while the `forward` method defines the specific computation using these parameters and the input. This separation is critical for both clarity and flexibility. It allows the same layer instance to be called multiple times on different inputs while maintaining its internal learned state, a pattern central to iterative training. Furthermore, the `nn.Module` base class implements critical behind-the-scenes logic, such as hook registration for debugging or gradient manipulation, and recursive operations over all submodules. Implementing such a rich set of features with a functional paradigm would necessitate a separate, likely more complex, framework for tracking the function's associated state and metadata, effectively reinventing the object-oriented wheel but with less clarity and more potential for error.

Therefore, the use of classes is not an arbitrary choice but a deliberate design that provides a unified, powerful abstraction for neural network components. It transforms a mathematical operation like convolution into a managed, stateful building block that interoperates predictably with the entire PyTorch ecosystem for training, evaluation, and deployment. While `torch.nn.functional` offers functional equivalents for many operations, they are typically used within the `forward` method of a module when an operation is stateless or requires finer-grained control. The class-based approach for core layers is the scaffolding that makes the practical construction and management of complex, trainable models both systematic and scalable.