PyTorch nn.Module 与模型构建

构建神经网络的模块化框架 -- 从基类到实战的完整指南

核心主题： PyTorch nn.Module 模型构建全面指南

主要内容： nn.Module 基类、内置层、自定义层、权重初始化、模型容器

关键词： PyTorch, nn.Module, nn.Sequential, 自定义层, 权重初始化, nn.init, 模型容器

一、nn.Module 基类详解

在 PyTorch 中，nn.Module 是所有神经网络模块的基类。无论是 PyTorch 提供的内置层（如 nn.Linear、nn.Conv2d），还是用户自定义的网络结构，都需要继承 nn.Module。它是整个 PyTorch 模型构建体系的基石。

nn.Module 核心功能：

自动注册子模块和参数，支持 parameters()、named_parameters() 方法遍历
提供 __init__ 和 forward 的标准化接口
支持 apply() 递归初始化
内置 train() / eval() 模式切换
自动管理梯度计算和设备（CPU/GPU）迁移
提供 state_dict() / load_state_dict() 序列化接口

1.1 基本结构：init 与 forward

每个继承 nn.Module 的类必须实现两个核心方法：__init__ 用于定义网络层，forward 用于定义前向传播逻辑。

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleNet, self).__init__()
        # 在 __init__ 中定义所有层
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, num_classes)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        # forward 定义数据流向
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        return x

model = SimpleNet(784, 256, 10)
print(model)

init vs forward 的分工：

__init__ 负责声明网络结构中的所有可学习层和持久化状态。PyTorch 在 __init__ 中通过赋值 self.xxx = nn.Linear(...) 自动检测并注册子模块，使其参数可以被优化器访问。

forward 负责描述前向传播的计算图。每次调用模型（如 model(x)）会触发 forward，PyTorch 的 autograd 机制会自动记录计算图用于反向传播。

1.2 parameters() 与 named_parameters()

parameters() 返回模型中所有可训练参数的迭代器，named_parameters() 额外返回参数名称。这是优化器传入参数的基础。

# 遍历模型所有参数
for name, param in model.named_parameters():
    print(f"{name}: shape={param.shape}, requires_grad={param.requires_grad}")

# 输出示例:
# fc1.weight: shape=torch.Size([256, 784]), requires_grad=True
# fc1.bias: shape=torch.Size([256]), requires_grad=True
# fc2.weight: shape=torch.Size([256, 256]), requires_grad=True
# fc2.bias: shape=torch.Size([256]), requires_grad=True
# fc3.weight: shape=torch.Size([10, 256]), requires_grad=True
# fc3.bias: shape=torch.Size([10]), requires_grad=True

# 传入优化器
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

1.3 apply() 递归初始化

apply(fn) 方法递归地将函数 fn 应用到所有子模块上，是实现统一权重初始化的标准方式。

def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        if m.bias is not None:
            nn.init.zeros_(m.bias)
    elif isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out',
                                nonlinearity='relu')

model.apply(init_weights)
print("权重初始化完成")

apply() 的工作原理：

apply() 递归遍历所有子模块（包括嵌套的子模块），对每个模块调用传入的函数。它采用深度优先策略，先处理子模块再处理父模块。注意 apply 只对注册为子模块（即通过 self.xxx = Module(...) 赋值的对象）生效，普通 Tensor 不会被遍历。

二、PyTorch 内置层

PyTorch 在 torch.nn 中提供了丰富的内置层，涵盖全连接、卷积、循环、嵌入、正则化、激活函数等各类深度学习操作。理解这些内置层的参数含义和使用场景是构建有效模型的前提。

2.1 nn.Linear 全连接层

全连接层（线性层）执行仿射变换 y = xW^T + b，是最基础的神经网络层。

import torch.nn as nn

# nn.Linear(in_features, out_features, bias=True)
linear = nn.Linear(128, 64)
x = torch.randn(32, 128)  # batch_size=32, input_dim=128
output = linear(x)         # shape: (32, 64)
print(f"输入形状: {x.shape}, 输出形状: {output.shape}")
print(f"权重形状: {linear.weight.shape}")  # (64, 128)
print(f"偏置形状: {linear.bias.shape}")    # (64,)

2.2 nn.Conv2d 二维卷积层

卷积层是卷积神经网络（CNN）的核心组件，通过卷积核在输入上滑动提取局部特征。

# nn.Conv2d(in_channels, out_channels, kernel_size,
#           stride=1, padding=0, dilation=1, groups=1, bias=True)
conv = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
x = torch.randn(16, 3, 224, 224)  # N, C, H, W
output = conv(x)
print(f"Conv2d 输出形状: {output.shape}")  # (16, 64, 224, 224)

# 常用卷积配置
conv1 = nn.Conv2d(3, 64, 3, padding=1)      # 保持空间尺寸
conv2 = nn.Conv2d(64, 128, 3, stride=2)     # 下采样
conv3 = nn.Conv2d(64, 128, 3, dilation=2)   # 空洞卷积
depthwise = nn.Conv2d(64, 64, 3,
                      groups=64)             # 深度可分离卷积

2.3 nn.LSTM 长短时记忆网络

LSTM 是处理序列数据的经典循环神经网络变体，通过遗忘门、输入门、输出门机制解决了长序列中的梯度消失问题。

# nn.LSTM(input_size, hidden_size, num_layers=1,
#         batch_first=False, dropout=0, bidirectional=False)
lstm = nn.LSTM(input_size=100, hidden_size=256,
               num_layers=2, batch_first=True, dropout=0.3)

x = torch.randn(32, 50, 100)   # batch=32, seq_len=50, input_size=100
output, (h_n, c_n) = lstm(x)

print(f"LSTM 输出形状: {output.shape}")    # (32, 50, 256)
print(f"最后隐状态 h_n: {h_n.shape}")      # (2, 32, 256)
print(f"最后细胞状态 c_n: {c_n.shape}")    # (2, 32, 256)

# batch_first=True 时输入为 (batch, seq, feature)
# batch_first=False (默认) 时输入为 (seq, batch, feature)

LSTM 参数详解：

input_size：输入特征维度
hidden_size：隐状态维度
num_layers：堆叠的 LSTM 层数（默认为 1）
batch_first：True 时输入输出形状为 (batch, seq, feature)，推荐使用
bidirectional：True 时使用双向 LSTM，输出维度翻倍
dropout：层间 dropout 概率（仅在 num_layers > 1 时生效）

2.4 nn.Embedding 嵌入层

嵌入层将离散的 token（单词、类别等）映射为稠密向量。它本质上是一个可学习的查找表。

# nn.Embedding(num_embeddings, embedding_dim,
#              padding_idx=None, max_norm=None)
embedding = nn.Embedding(10000, 300, padding_idx=0)
x = torch.randint(0, 10000, (32, 50))  # 批量索引
output = embedding(x)
print(f"Embedding 输出形状: {output.shape}")  # (32, 50, 300)

# padding_idx=0 表示索引 0 处的向量始终为 0（梯度不更新）
# 常用于序列填充位置的掩码处理

2.5 nn.Dropout 与 nn.BatchNorm1d

Dropout 和 Batch Normalization 是训练深度网络时最常用的两种正则化和加速技术。

# Dropout -- 训练时随机将部分神经元置零，防止过拟合
dropout = nn.Dropout(p=0.5)
x = torch.randn(4, 10)
print(f"训练模式: {dropout.training}")
output_train = dropout(x)

dropout.eval()  # 切换为评估模式，Dropout 不再生效
print(f"评估模式: {dropout.training}")
output_eval = dropout(x)
print(f"训练时非零元素比例: {(output_train != 0).float().mean():.2f}")
print(f"评估时: {(output_eval != 0).float().mean():.2f}")

# BatchNorm1d -- 对小批量数据进行标准化
bn = nn.BatchNorm1d(256)  # 参数为特征维度
x = torch.randn(32, 256)  # (batch, features)
output = bn(x)
print(f"BatchNorm 输出均值: {output.mean().item():.4f}")
print(f"BatchNorm 输出方差: {output.var().item():.4f}")

2.6 nn.ReLU 与常用激活函数

激活函数为神经网络引入非线性，使其能够逼近任意复杂函数。PyTorch 提供了几乎所有主流激活函数。

# 常用激活函数一览
relu = nn.ReLU()           # ReLU: max(0, x)
leaky_relu = nn.LeakyReLU(negative_slope=0.01)  # 带泄漏的 ReLU
sigmoid = nn.Sigmoid()     # Sigmoid: 1/(1+exp(-x)) 用于二分类
tanh = nn.Tanh()           # Tanh: 双曲正切
softmax = nn.Softmax(dim=1) # Softmax: 多分类概率输出
gelu = nn.GELU()           # GELU: GPT/BERT 等 Transformer 首选

x = torch.randn(4, 10)
print(f"ReLU: {relu(x).shape}")
print(f"Softmax (每行和为1): {softmax(x).sum(dim=1)}")

激活函数选择建议：

隐藏层首选 ReLU 或其变体 LeakyReLU，计算简单且缓解梯度消失
Transformer 架构使用 GELU 或 Swish/SiLU
二分类输出层使用 Sigmoid
多分类输出层使用 Softmax
RNN/LSTM 内部使用 Tanh 作为默认激活
避免在深层网络中使用 Sigmoid/Tanh 作为隐藏层激活，易导致梯度消失

三、自定义层

当 PyTorch 内置层无法满足特定需求时，可以通过继承 nn.Module 创建自定义层。自定义层可以包含任意计算逻辑，并自动注册参数和子模块。

3.1 基本自定义层

class CustomLinear(nn.Module):
    """自定义全连接层"""
    def __init__(self, in_features, out_features, bias=True):
        super(CustomLinear, self).__init__()
        # 使用 nn.Parameter 注册可训练参数
        self.weight = nn.Parameter(
            torch.randn(out_features, in_features) * 0.01
        )
        if bias:
            self.bias = nn.Parameter(torch.zeros(out_features))
        else:
            self.register_parameter('bias', None)

    def forward(self, x):
        # y = x @ W^T + b
        output = x @ self.weight.T
        if self.bias is not None:
            output += self.bias
        return output

layer = CustomLinear(128, 64)
x = torch.randn(32, 128)
print(f"自定义层输出: {layer(x).shape}")
print(f"参数数量: {sum(p.numel() for p in layer.parameters())}")

nn.Parameter 的作用：

nn.Parameter 是 Tensor 的子类，当被赋值给 nn.Module 的属性时，会自动注册为模块的可训练参数。这意味着：

参数会被 parameters() 方法遍历
参数会被优化器更新
参数会自动迁移到 GPU（调用 model.cuda() 时）
参数会被包含在 state_dict() 中

3.2 参数注册高级用法

class ComplexLayer(nn.Module):
    """演示参数注册的各种方式"""
    def __init__(self):
        super(ComplexLayer, self).__init__()
        # 方式1: 普通 Tensor 不会被注册为参数
        self.not_a_param = torch.randn(10, 10)

        # 方式2: nn.Parameter 注册可训练参数
        self.trainable_weight = nn.Parameter(torch.randn(10, 10))

        # 方式3: register_parameter 注册命名参数
        self.register_parameter(
            'custom_param',
            nn.Parameter(torch.ones(5))
        )

        # 方式4: register_buffer 注册持久化缓冲区（不参与梯度）
        self.register_buffer(
            'running_mean',
            torch.zeros(10)
        )

        # 方式5: nn.ParameterList / nn.ParameterDict
        self.params = nn.ParameterList([
            nn.Parameter(torch.randn(3, 4)) for _ in range(5)
        ])
        self.named_params = nn.ParameterDict({
            'w1': nn.Parameter(torch.randn(4, 4)),
            'w2': nn.Parameter(torch.randn(4, 4)),
        })

    def forward(self, x):
        for p in self.params:
            x = x @ p
        return x

layer = ComplexLayer()
print("可训练参数:")
for name, param in layer.named_parameters():
    print(f"  {name}: {param.shape}")
print(f"\n缓冲区:")
for name, buf in layer.named_buffers():
    print(f"  {name}: {buf.shape}")
print(f"\n普通 Tensor (不会被注册):")
print(f"  not_a_param: {layer.not_a_param.shape}")

register_buffer 的使用场景：

register_buffer 注册的张量会像参数一样被自动管理（设备迁移、state_dict 序列化），但 不会计算梯度。典型使用场景包括：

BatchNorm 中的 running_mean 和 running_var
模型中的固定嵌入或位置编码（如 Transformer 的 position encoding）
统计信息（如训练过程中的 loss 记录）

3.3 带自定义前向逻辑的层

class ResidualBlock(nn.Module):
    """残差连接块"""
    def __init__(self, dim, dropout=0.1):
        super(ResidualBlock, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(dim, dim),
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # 残差连接: output = x + F(x)
        residual = x
        out = self.net(x)
        out = self.dropout(out)
        return out + residual


class GatedLinearUnit(nn.Module):
    """门控线性单元 (GLU)"""
    def __init__(self, in_features, out_features):
        super(GatedLinearUnit, self).__init__()
        self.fc = nn.Linear(in_features, out_features * 2)

    def forward(self, x):
        x = self.fc(x)
        a, b = x.chunk(2, dim=-1)  # 切分为两半
        return a * torch.sigmoid(b)  # 门控机制


# 复合使用
class DeepResNet(nn.Module):
    def __init__(self, dim, num_blocks=6):
        super(DeepResNet, self).__init__()
        self.blocks = nn.ModuleList([
            ResidualBlock(dim) for _ in range(num_blocks)
        ])
        self.glu = GatedLinearUnit(dim, dim)
        self.output = nn.Linear(dim, 10)

    def forward(self, x):
        for block in self.blocks:
            x = block(x)
        x = self.glu(x)
        return self.output(x)

四、权重初始化

权重初始化对深度神经网络的训练收敛至关重要。合理的初始化可以加快收敛速度，防止梯度爆炸或消失。PyTorch 在 nn.init 模块中提供了丰富的初始化方法。

初始化的重要性：

破坏对称性：不同的初始化打破神经元的对称性，使各通道学到不同特征
控制方差：合理的初始化使各层输出的方差保持稳定，避免信号逐层放大或衰减
影响收敛：好的初始化可以显著减少训练所需的 epoch 数量
避免局部最优：适当的初始值帮助模型跳出不良局部最优点

4.1 nn.init 核心方法

import torch.nn.init as init

# --- 均匀分布初始化 ---
# 从 U(-a, a) 中采样，a = sqrt(1/fan_in)
init.xavier_uniform_(linear.weight)

# --- 正态分布初始化 ---
# 从 N(0, std^2) 中采样，std = sqrt(2/(fan_in + fan_out))
init.xavier_normal_(linear.weight)

# --- Kaiming 初始化 (He 初始化) ---
# 推荐用于 ReLU 系列激活函数
init.kaiming_uniform_(conv.weight, mode='fan_in',
                      nonlinearity='relu')
init.kaiming_normal_(conv.weight, mode='fan_out',
                     nonlinearity='relu')

# --- 正交初始化 ---
# 用正交矩阵初始化，常用于 RNN/LSTM
init.orthogonal_(lstm.weight_ih_l0, gain=1.0)

# --- 常数初始化 ---
init.zeros_(linear.bias)          # 偏置置零
init.ones_(bn.weight)             # BN 的 gamma 置 1
init.constant_(linear.weight, 0.1) # 常数初始化

# --- 单位矩阵初始化 ---
init.eye_(linear.weight)  # 适用于某些特殊结构

初始化方法选择指南：

激活函数	推荐初始化	原理
ReLU / LeakyReLU	Kaiming (He) 初始化	考虑 ReLU 将一半神经元置零的方差变化
Sigmoid / Tanh	Xavier (Glorot) 初始化	维持输入输出方差一致
GELU / Swish	Kaiming 初始化	与 ReLU 类似，使用 fan_in 模式
LSTM / GRU	正交初始化	防止梯度爆炸/消失，保持循环动态稳定
Embedding	正态分布 N(0, 0.01)	小方差正态分布即可

4.2 自定义初始化函数

def weights_init(m):
    """一个完整的自定义初始化函数"""
    classname = m.__class__.__name__

    if isinstance(m, nn.Conv2d):
        # Conv2d: Kaiming Normal
        init.kaiming_normal_(m.weight, mode='fan_out',
                             nonlinearity='relu')
        if m.bias is not None:
            init.zeros_(m.bias)
        print(f"  Conv2d 初始化完成")

    elif isinstance(m, nn.BatchNorm2d):
        # BatchNorm: weight=1, bias=0
        init.ones_(m.weight)
        init.zeros_(m.bias)
        print(f"  BatchNorm2d 初始化完成")

    elif isinstance(m, nn.Linear):
        # Linear: Xavier Uniform
        init.xavier_uniform_(m.weight, gain=0.5)
        init.zeros_(m.bias)
        print(f"  Linear 初始化完成")

    elif isinstance(m, nn.LSTM):
        # LSTM: 正交初始化 + 遗忘门偏置 +1
        for name, param in m.named_parameters():
            if 'weight' in name:
                init.orthogonal_(param)
            elif 'bias' in name:
                # 遗忘门偏置初始化 +1 (forget gate bias)
                n = param.shape[0] // 4
                init.zeros_(param)
                param[n:2*n].data.fill_(1.0)  # forget gate bias = 1
        print(f"  LSTM 初始化完成 (forget_gate_bias=1)")

# 应用到完整模型
model = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    nn.Conv2d(64, 128, 3, stride=2),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d(1),
    nn.Flatten(),
    nn.Linear(128, 10),
)
model.apply(weights_init)

初始化常见陷阱：

全零初始化：所有神经元输出相同，梯度相同，模型无法学习
方差过大：深层网络中梯度爆炸，激活值饱和
方差过小：梯度消失，深层网络信号衰减为零
遗忘门偏置：LSTM 遗忘门偏置初始化为 1 或较大值（而非 0），有助于长程记忆
BatchNorm gamma/beta：gamma 初始化为 1，beta 初始化为 0，保持标准化后的分布

4.3 不同初始化效果对比

def compare_init():
    """对比不同初始化方法对输出方差的影响"""
    x = torch.randn(1000, 512)
    layer = nn.Linear(512, 512)

    # Xavier Uniform
    init.xavier_uniform_(layer.weight)
    out1 = layer(x)
    print(f"Xavier Uniform 输出方差: {out1.var().item():.4f}")

    # Kaiming Normal
    init.kaiming_normal_(layer.weight, mode='fan_in',
                         nonlinearity='relu')
    out2 = layer(x)
    print(f"Kaiming Normal 输出方差: {out2.var().item():.4f}")

    # 常数初始化 (方差为0)
    init.constant_(layer.weight, 0.01)
    out3 = layer(x)
    print(f"常数初始化 输出方差: {out3.var().item():.4f}")

    # 单位矩阵初始化
    init.eye_(layer.weight)
    out4 = layer(x)
    print(f"单位矩阵 输出方差: {out4.var().item():.4f}")

compare_init()

# 输出示例:
# Xavier Uniform 输出方差: 1.0213
# Kaiming Normal 输出方差: 1.9876
# 常数初始化 输出方差: 0.0001
# 单位矩阵 输出方差: 1.0312

五、模型容器

PyTorch 提供了三种主要的模型容器，用于组织和管理多个子模块：nn.Sequential、nn.ModuleList 和 nn.ModuleDict。理解它们的差异和适用场景对构建复杂模型至关重要。

容器	结构	forward	适用场景
nn.Sequential	有序顺序	自动按序执行	直线型网络、固定流水线
nn.ModuleList	列表	需手动迭代	动态层数、循环结构
nn.ModuleDict	字典	需按键访问	条件分支、按名称寻址

5.1 nn.Sequential -- 顺序容器

nn.Sequential 按添加顺序自动执行模块，适合构建直线型网络。模块之间不需要显式指定数据流向。

# 方法1: 按顺序传入模块
model1 = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

# 方法2: 使用 OrderedDict 命名模块
from collections import OrderedDict
model2 = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(784, 256)),
    ('relu1', nn.ReLU()),
    ('drop1', nn.Dropout(0.3)),
    ('fc2', nn.Linear(256, 128)),
    ('relu2', nn.ReLU()),
    ('fc3', nn.Linear(128, 10)),
]))

# 方法3: 使用 add_module 动态添加
model3 = nn.Sequential()
model3.add_module('fc1', nn.Linear(784, 256))
model3.add_module('relu', nn.ReLU())
model3.add_module('fc2', nn.Linear(256, 10))

x = torch.randn(32, 784)
output = model3(x)  # 自动按序执行
print(f"Sequential 输出: {output.shape}")

# 通过索引或名称访问子模块
print(f"第一层: {model2[0]}")
print(f"通过名称访问: {model2.fc1}")

5.2 nn.ModuleList -- 列表容器

nn.ModuleList 和 Python 列表类似，但会自动注册其中的所有模块。它不提供 forward 方法，需要手动迭代。

class ModuleListExample(nn.Module):
    """演示 ModuleList 的灵活使用"""
    def __init__(self, layer_sizes, dropout=0.2):
        super(ModuleListExample, self).__init__()
        self.layers = nn.ModuleList()
        self.dropout = nn.Dropout(dropout)

        # 动态构建任意深度的网络
        for in_size, out_size in zip(layer_sizes[:-1],
                                     layer_sizes[1:]):
            self.layers.append(nn.Linear(in_size, out_size))

    def forward(self, x):
        # 手动遍历层列表
        for i, layer in enumerate(self.layers):
            x = layer(x)
            if i < len(self.layers) - 1:
                x = torch.relu(x)
                x = self.dropout(x)
        return x

net = ModuleListExample([784, 512, 256, 128, 10])
x = torch.randn(32, 784)
print(f"ModuleList 网络输出: {net(x).shape}")
print(f"层数: {len(net.layers)}")

# 可以动态访问特定层
for i, layer in enumerate(net.layers):
    print(f"  第{i}层: {layer}")

5.3 nn.ModuleDict -- 字典容器

nn.ModuleDict 允许通过字符串键名访问模块，适合实现条件分支或按名称寻址的多头网络。

class ModuleDictExample(nn.Module):
    """演示 ModuleDict 在多头网络中的应用"""
    def __init__(self, input_dim, hidden_dim, task_configs):
        super(ModuleDictExample, self).__init__()
        self.shared = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
        )

        # 为每个任务创建一个独立的输出头
        self.task_heads = nn.ModuleDict({
            task_name: nn.Linear(hidden_dim, num_classes)
            for task_name, num_classes in task_configs.items()
        })

    def forward(self, x, task_name=None):
        x = self.shared(x)
        if task_name:
            # 只执行指定任务的输出头
            return self.task_heads[task_name](x)
        else:
            # 返回所有任务结果
            return {
                name: head(x)
                for name, head in self.task_heads.items()
            }

# 多任务学习示例
task_configs = {
    'sentiment': 3,      # 情感三分类
    'category': 10,      # 类别十分类
    'urgency': 2,        # 紧急二分类
}

model = ModuleDictExample(768, 256, task_configs)
x = torch.randn(16, 768)
outputs = model(x)  # 返回所有任务结果
for task, out in outputs.items():
    print(f"{task}: {out.shape}")

# 也可以只推理单个任务
sentiment = model(x, task_name='sentiment')
print(f"单任务输出: {sentiment.shape}")

三种容器的选择策略：

使用 nn.Sequential 当网络是简单的直线拓扑，数据依次流过每个层，没有跳连或分支
使用 nn.ModuleList 当网络层数可变（由配置决定）、需要循环执行、或需要索引访问特定层
使用 nn.ModuleDict 当需要按名称访问模块、实现多任务输出头、或有条件分支逻辑
三种容器可以 嵌套组合：Sequential 内可以包含 ModuleList，ModuleDict 内可以包含 Sequential，实现复杂网络拓扑

六、完整模型构建实战

将以上知识综合运用，构建一个完整的图像分类模型，包含卷积特征提取、自定义初始化、残差连接和多种容器的组合使用。

import torch
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F


class ConvBlock(nn.Module):
    """卷积块: Conv2d + BN + ReLU"""
    def __init__(self, in_ch, out_ch, kernel_size=3,
                 stride=1, padding=1):
        super(ConvBlock, self).__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, kernel_size,
                              stride, padding, bias=False)
        self.bn = nn.BatchNorm2d(out_ch)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        return self.relu(self.bn(self.conv(x)))


class ResidualBlock(nn.Module):
    """残差块: 两个卷积 + 残差连接"""
    def __init__(self, channels):
        super(ResidualBlock, self).__init__()
        self.block = nn.Sequential(
            ConvBlock(channels, channels),
            ConvBlock(channels, channels),
        )

    def forward(self, x):
        return x + self.block(x)


class AdvancedCNN(nn.Module):
    """完整图像分类模型"""
    def __init__(self, num_classes=10):
        super(AdvancedCNN, self).__init__()

        # 特征提取器: 使用 ModuleList 和 Sequential 组合
        self.stages = nn.ModuleList([
            nn.Sequential(
                ConvBlock(3, 64, stride=1),   # 32x32
                ResidualBlock(64),
            ),
            nn.Sequential(
                ConvBlock(64, 128, stride=2), # 16x16
                ResidualBlock(128),
            ),
            nn.Sequential(
                ConvBlock(128, 256, stride=2), # 8x8
                ResidualBlock(256),
            ),
        ])

        # 分类头
        self.head = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Dropout(0.3),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, num_classes),
        )

        # 初始化参数
        self.apply(self._init_weights)

    def _init_weights(self, m):
        """自定义初始化策略"""
        if isinstance(m, nn.Conv2d):
            init.kaiming_normal_(m.weight, mode='fan_out',
                                 nonlinearity='relu')
        elif isinstance(m, nn.BatchNorm2d):
            init.ones_(m.weight)
            init.zeros_(m.bias)
        elif isinstance(m, nn.Linear):
            init.xavier_uniform_(m.weight, gain=0.5)
            init.zeros_(m.bias)

    def forward(self, x):
        # 逐阶段处理
        for stage in self.stages:
            x = stage(x)
        # 分类输出
        return self.head(x)


# 实例化并测试
model = AdvancedCNN(num_classes=10)
x = torch.randn(4, 3, 32, 32)
output = model(x)

print(f"输入形状: {x.shape}")
print(f"输出形状: {output.shape}")    # (4, 10)
print(f"输出概率: {F.softmax(output, dim=1)}")

# 计算总参数量
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel()
                       for p in model.parameters()
                       if p.requires_grad)
print(f"\n总参数量: {total_params:,}")
print(f"可训练参数量: {trainable_params:,}")

# 模型保存与加载
torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))
model.eval()  # 切换到评估模式

6.1 训练循环示例

# 完整的训练流水线
device = torch.device('cuda' if torch.cuda.is_available()
                      else 'cpu')
model = AdvancedCNN(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3,
                             weight_decay=5e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100
)

# 模拟数据
dummy_inputs = torch.randn(64, 3, 32, 32).to(device)
dummy_labels = torch.randint(0, 10, (64,)).to(device)

# 单步训练
model.train()
optimizer.zero_grad()
outputs = model(dummy_inputs)
loss = criterion(outputs, dummy_labels)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()

print(f"训练损失: {loss.item():.4f}")
print(f"当前学习率: {scheduler.get_last_lr()[0]:.6f}")

# 验证
model.eval()
with torch.no_grad():
    val_outputs = model(dummy_inputs)
    val_loss = criterion(val_outputs, dummy_labels)
    _, predicted = torch.max(val_outputs, 1)
    accuracy = (predicted == dummy_labels).float().mean()
    print(f"验证损失: {val_loss.item():.4f}")
    print(f"准确率: {accuracy.item():.2%}")

实战最佳实践总结：

组织代码：每个层/块使用独立的 class，提高可读性和复用性
初始化：在模型构建完成后立即调用 apply(init_fn) 统一初始化
梯度裁剪：使用 clip_grad_norm_ 防止梯度爆炸（尤其 RNN/LSTM）
设备管理：使用 .to(device) 统一管理模型和数据设备
模式切换：训练时 model.train()，评估时 model.eval()，确保 Dropout/BN 行为正确
梯度清零：每个 batch 开始前调用 optimizer.zero_grad()
学习率调度：使用 CosineAnnealingLR 或 ReduceLROnPlateau 动态调整学习率
模型保存：推荐保存 state_dict() 而非整个模型对象，确保兼容性

七、核心要点总结

nn.Module 基类：所有神经网络模块的基石，通过 __init__ 声明层、forward 定义前向计算，自动注册参数和子模块
参数管理：使用 parameters() / named_parameters() 遍历参数，apply() 递归初始化，state_dict() / load_state_dict() 序列化
内置层丰富：Linear 全连接、Conv2d 卷积、LSTM 循环、Embedding 嵌入、Dropout 正则化、BatchNorm 标准化、ReLU/GELU 激活函数等构成了完整的深度学习算子库
自定义层灵活：继承 nn.Module、使用 nn.Parameter 注册参数、register_buffer 注册缓冲区，可实现任意复杂的前向逻辑
权重初始化关键：Kaiming 初始化适合 ReLU 系列、Xavier 初始化适合 Sigmoid/Tanh、正交初始化适合 RNN/LSTM，良好的初始化是训练成功的前提
模型容器各有用途：Sequential 用于顺序执行、ModuleList 用于动态层数/循环、ModuleDict 用于按名称寻址/多任务分支
组合使用构建复杂网络：通过容器的嵌套组合（Sequential 内 ModuleList、ModuleDict 内 Sequential），可以构建任意拓扑的深度网络
训练流水线完整：设备管理、模式切换、梯度裁剪、学习率调度、模型序列化构成了完整的训练闭环

八、进一步思考

PyTorch 的 nn.Module 设计哲学体现了面向对象编程在深度学习框架中的精妙应用。通过"组合优于继承"和"模块化"的设计，PyTorch 使得研究者可以像搭积木一样构建复杂的神经网络。

扩展学习方向：

自定义 autograd Function：通过继承 torch.autograd.Function 定义完全自定义的前向和反向传播，适用于实现新算子
混合精度训练：使用 torch.cuda.amp 在 nn.Module 基础上实现自动混合精度，大幅提升训练速度
分布式训练：nn.parallel.DistributedDataParallel 包装模型实现多卡训练
TorchScript / FX 转换：将 nn.Module 转换为静态图以部署到生产环境
nn.Module 钩子：使用 register_forward_hook 和 register_backward_hook 实现特征图可视化和梯度监控
量化部署：使用 torch.quantization 对 nn.Module 进行量化压缩，适配移动端和边缘设备

nn.Module 的设计模式启示：

PyTorch 的 nn.Module 设计实际上是 组合模式（Composite Pattern） 在深度学习框架中的经典应用。每个模块既可以是一个简单的层（叶子节点），也可以是由多个子模块组合而成的复杂网络（容器节点）。这种设计使得：

用户可以以统一的方式对待单个层和整个网络
网络结构的递归特性使得代码高度复用
训练流程（参数管理、设备迁移、序列化）对复杂网络和简单层完全透明