MobileNet与轻量级网络 - 学习笔记-深度学习-上海佼艾

一、概述：为什么需要轻量级网络

随着深度学习在计算机视觉领域的广泛应用，以VGG、ResNet、Inception为代表的大型网络在ImageNet等基准上取得了顶尖精度。然而，这些模型参数量动辄数十百万甚至上亿，推理时需要数十亿次浮点运算（FLOPs），难以部署在手机、嵌入式设备、IoT终端等资源受限场景。

轻量级网络的核心目标是在保持可接受精度的前提下，极大压缩模型参数量和计算量，使其能够在移动端实时运行。代表性的轻量网络系列包括Google的MobileNet（V1/V2/V3）、旷视的ShuffleNet（V1/V2）、以及EfficientNet-Lite等。这些网络通过创新的卷积分解、通道操作和架构搜索技术，将模型效率推向了新的高度。

轻量级网络的核心挑战

效率与精度的权衡： 如何在减少90%以上计算量的同时，保持精度下降在可接受范围内（通常2-5%）
硬件适配： 算法设计需考虑移动端CPU/GPU/NPU的实际计算特性（如缓存友好性、量化支持）
延迟约束： 实际部署中对推理延迟的硬性要求（通常30ms以内）
能耗限制： 移动设备电池容量有限，需控制推理能耗

二、深度可分离卷积（Depthwise Separable Convolution）

深度可分离卷积是所有轻量级网络的基石。它将标准卷积分解为两个独立的步骤：Depthwise卷积（深度卷积）和Pointwise卷积（逐点1x1卷积）。这种分解可以在几乎不损失精度的情况下，大幅降低参数量和计算量。

2.1 标准卷积回顾

标准卷积层接收尺寸为 D_F x D_F x M 的输入特征图（D_F为空间尺寸，M为输入通道数），使用 N 个尺寸为 D_K x D_K x M 的卷积核，输出尺寸为 D_F x D_F x N 的特征图。其计算量为：

计算量(标准) = D_K \times D_K \times M \times N \times D_F \times D_F

参数量(标准) = D_K \times D_K \times M \times N

# 标准卷积 —— PyTorch实现
import torch.nn as nn

# 输入: 3x224x224, 输出: 64x224x224, 卷积核: 3x3
standard_conv = nn.Conv2d(
    in_channels=3,
    out_channels=64,
    kernel_size=3,
    stride=1,
    padding=1,
    bias=False
)
# 参数量: 3 x 3 x 3 x 64 = 1,728
# 计算量: 1,728 x 224 x 224 ≈ 86.7M FLOPs
print(f"标准卷积参数量: {standard_conv.weight.numel():,}")

2.2 Depthwise卷积（深度卷积）

Depthwise卷积对每个输入通道独立使用一个卷积核进行空间卷积。输入有 M 个通道，就使用 M 个 D_K x D_K x 1 的卷积核，每个核只处理一个通道。输出通道数与输入通道数相同。其计算量和参数量为：

计算量(DW) = D_K \times D_K \times M \times D_F \times D_F

参数量(DW) = D_K \times D_K \times M

# Depthwise卷积 —— PyTorch实现
depthwise_conv = nn.Conv2d(
    in_channels=64,
    out_channels=64,
    kernel_size=3,
    stride=1,
    padding=1,
    groups=64,  # groups=in_channels 即为Depthwise
    bias=False
)
# 参数量: 3 x 3 x 64 = 576
# 计算量: 576 x 112 x 112 ≈ 7.2M FLOPs
print(f"Depthwise参数量: {depthwise_conv.weight.numel():,}")

关键区别： groups=in_channels 使每个卷积核只与一个输入通道连接，完全消除了通道维度的跨通道信息融合。

2.3 Pointwise卷积（逐点1x1卷积）

Pointwise卷积使用 1x1 卷积核在通道维度上进行信息融合。它将Depthwise输出的 D_F x D_F x M 特征图通过 N 个 1x1xM 的卷积核转换为 D_F x D_F x N 的输出。其计算量和参数量为：

计算量(PW) = M \times N \times D_F \times D_F

参数量(PW) = M \times N

# Pointwise卷积 —— PyTorch实现
pointwise_conv = nn.Conv2d(
    in_channels=64,
    out_channels=128,
    kernel_size=1,
    stride=1,
    padding=0,
    bias=False
)
# 参数量: 1 x 1 x 64 x 128 = 8,192
# 计算量: 8,192 x 112 x 112 ≈ 102.8M FLOPs
print(f"Pointwise参数量: {pointwise_conv.weight.numel():,}")

2.4 计算量对比分析

将标准卷积分解为Depthwise + Pointwise后的总计算量为：

计算量(DW+PW) = D_K \times D_K \times M \times D_F \times D_F + M \times N \times D_F \times D_F

与标准卷积的计算量之比为：

压缩比 = (D_K \times D_K \times M \times D_F \times D_F + M \times N \times D_F \times D_F) / (D_K \times D_K \times M \times N \times D_F \times D_F) = 1/N + 1/D_K²

当使用 D_K=3（3x3卷积）时，1/D_K² = 1/9。随着输出通道数 N 增大，1/N 趋近于0，因此深度可分离卷积的理论计算量约为标准卷积的 1/9 到 1/8。

卷积类型	参数量	计算量（FLOPs）	说明
标准卷积 (3×3, 3→64)	1,728	86.7M	基准
Depthwise (3×3, 64→64)	576	7.2M	节省通道组合
Pointwise (1×1, 64→128)	8,192	102.8M	跨通道组合
DW+PW 合计	8,768	110.0M	约为标准的1/8
标准卷积等效 (3×3, 3→128)	73,728	3.7G	对应相同输出

# 完整的深度可分离卷积模块
import torch
import torch.nn as nn

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1):
        super().__init__()
        padding = kernel_size // 2
        self.depthwise = nn.Conv2d(
            in_channels, in_channels, kernel_size,
            stride=stride, padding=padding,
            groups=in_channels, bias=False
        )
        self.pointwise = nn.Conv2d(
            in_channels, out_channels, 1,
            stride=1, padding=0, bias=False
        )
        self.bn1 = nn.BatchNorm2d(in_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU6(inplace=True)

    def forward(self, x):
        x = self.depthwise(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.pointwise(x)
        x = self.bn2(x)
        x = self.relu(x)
        return x

# 测试模块
model = DepthwiseSeparableConv(64, 128, 3, 1)
dummy = torch.randn(1, 64, 112, 112)
output = model(dummy)
print(f"输出尺寸: {output.shape}")  # [1, 128, 112, 112]

深度可分离卷积总结

核心思想： 将空间卷积与通道组合解耦，先逐通道做空间卷积，再用1x1卷积做通道融合
效率提升： 理论计算量可降至标准卷积的约1/9，参数压缩显著
精度影响： 在ImageNet上精度损失通常在1-3%以内，可通过增加通道数补偿
硬件友好： 1x1卷积高度并行化，GPU/CPU上均有高效实现

三、MobileNetV1：里程碑式的轻量网络

MobileNetV1（Howard et al., 2017）是Google推出的首个专为移动端设计的轻量级卷积神经网络。其核心创新是使用深度可分离卷积替代标准卷积来构建完整网络，并引入两个超参数——宽度乘数α和分辨率乘数ρ——实现灵活的精度-效率折中。

3.1 网络架构设计

MobileNetV1的整体结构基于VGG风格的直筒型设计，共28层（含深度可分离卷积层、BN层和全连接层）。除了第一层使用标准卷积外，其余所有卷积层均替换为深度可分离卷积。主体结构如下：

输入 (224×224×3) │ ▼ [Conv2d 3×3, s2] ──→ 112×112×32 ← 标准卷积（仅第一层） │ ▼ [DWConv 3×3, s1] ──→ 112×112×32 [Conv2d 1×1, s1] ──→ 112×112×64 │ ▼ [DWConv 3×3, s2] ──→ 56×56×64 [Conv2d 1×1, s1] ──→ 56×56×128 │ ▼ [DWConv 3×3, s1] ──→ 56×56×128 ← 重复5次 [Conv2d 1×1, s1] ──→ 56×56×128 │ ▼ [DWConv 3×3, s2] ──→ 28×28×128 [Conv2d 1×1, s1] ──→ 28×28×256 │ ▼ [DWConv 3×3, s1] ──→ 28×28×256 ← 重复5次 [Conv2d 1×1, s1] ──→ 28×28×256 │ ▼ [DWConv 3×3, s2] ──→ 14×14×256 [Conv2d 1×1, s1] ──→ 14×14×512 │ ▼ [DWConv 3×3, s1] ×5 ──→ 14×14×512 ← 重复5组 [Conv2d 1×1, s1] ×5 ──→ 14×14×512 │ ▼ [DWConv 3×3, s2] ──→ 7×7×512 [Conv2d 1×1, s1] ──→ 7×7×1024 │ ▼ [DWConv 3×3, s1] ──→ 7×7×1024 [Conv2d 1×1, s1] ──→ 7×7×1024 │ ▼ [AvgPool 7×7] ──→ 1×1×1024 [FC 1024→1000] ──→ 输出 (1000类)

# MobileNetV1 核心架构 PyTorch实现
import torch.nn as nn

class MobileNetV1(nn.Module):
    def __init__(self, num_classes=1000, width_mult=1.0):
        super().__init__()

        def conv_bn(inp, oup, stride):
            return nn.Sequential(
                nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
                nn.BatchNorm2d(oup),
                nn.ReLU6(inplace=True)
            )

        def conv_dw(inp, oup, stride):
            return nn.Sequential(
                nn.Conv2d(inp, inp, 3, stride, 1,
                          groups=inp, bias=False),
                nn.BatchNorm2d(inp),
                nn.ReLU6(inplace=True),
                nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
                nn.ReLU6(inplace=True)
            )

        # 宽度乘数：调整通道数
        def round_channels(ch):
            return max(1, int(ch * width_mult))

        self.model = nn.Sequential(
            conv_bn(3, round_channels(32), 2),        # 标准卷积
            conv_dw(round_channels(32), round_channels(64), 1),
            conv_dw(round_channels(64), round_channels(128), 2),
            conv_dw(round_channels(128), round_channels(128), 1),
            conv_dw(round_channels(128), round_channels(256), 2),
            conv_dw(round_channels(256), round_channels(256), 1),
            conv_dw(round_channels(256), round_channels(512), 2),
            # 5个重复的 stride=1 深度可分离卷积
            *[conv_dw(round_channels(512), round_channels(512), 1)
              for _ in range(5)],
            conv_dw(round_channels(512), round_channels(1024), 2),
            conv_dw(round_channels(1024), round_channels(1024), 1),
            nn.AdaptiveAvgPool2d(1),
        )
        self.fc = nn.Linear(round_channels(1024), num_classes)

    def forward(self, x):
        x = self.model(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

3.2 宽度乘数 α（Width Multiplier）

宽度乘数 α 是一个在(0,1]范围内的超参数，用于均匀地缩减每一层的通道数。在应用α后，每层的输入通道数 M 变为 αM，输出通道数 N 变为 αN。宽度乘数可将计算量和参数量近似降低为原来的 α² 倍。

计算量(α) \approx α \times α \times (D_K² \times M \times D_F² + M \times N \times D_F²) = α² \times 计算量(原始)

# 宽度乘数对模型大小的影响
alpha_values = [1.0, 0.75, 0.5, 0.25]
for alpha in alpha_values:
    model = MobileNetV1(num_classes=1000, width_mult=alpha)
    total_params = sum(p.numel() for p in model.parameters())
    # 约4.2M (α=1.0), 2.6M (α=0.75), 1.3M (α=0.5), 0.5M (α=0.25)
    print(f"α={alpha:.2f}: 参数量={total_params:,}")

典型的宽度乘数配置及对应性能：

宽度乘数 α	参数量	ImageNet Top-1	计算量 (MFLOPs)
1.0	4.2M	70.6%	569M
0.75	2.6M	68.4%	325M
0.50	1.3M	63.7%	149M
0.25	0.5M	50.6%	41M

3.3 分辨率乘数 ρ（Resolution Multiplier）

分辨率乘数 ρ 是一个在(0,1]范围内的超参数，用于缩减输入图像的分辨率。当应用ρ后，输入尺寸从 224x224 变为 ρ×224，相应地所有层内部特征图分辨率也同比例缩减。分辨率乘数可将计算量近似降低为原来的 ρ² 倍。

计算量(ρ) \approx ρ \times ρ \times 计算量(原始) = ρ² \times 计算量(原始)

综合应用α和ρ，MobileNetV1的计算量可表示为：

总计算量 \approx α \times α \times ρ \times ρ \times 计算量(基础)

输入分辨率	ρ值	计算量 (α=1.0)	Top-1 精度
224×224	1.0	569M	70.6%
192×192	0.857	418M	69.1%
160×160	0.714	290M	67.2%
128×128	0.571	186M	64.4%

3.4 效率对比：MobileNetV1 vs 经典网络

模型	参数量	计算量 (MFLOPs)	ImageNet Top-1	相对速度
VGG-16	138M	15,300M	71.5%	1x (基准)
GoogleNet	6.8M	1,550M	69.8%	~10x
ResNet-50	25.6M	3,800M	76.0%	~4x
MobileNetV1 (α=1.0)	4.2M	569M	70.6%	~27x
MobileNetV1 (α=0.5)	1.3M	149M	63.7%	~103x

MobileNetV1 关键贡献

首次系统性地将深度可分离卷积应用于大规模视觉网络的设计中
提出宽度乘数α和分辨率乘数ρ两个超参数，实现灵活的效率-精度调节
相比VGG-16，在保持相近精度（70.6% vs 71.5%）的情况下，参数量减少33倍，计算量减少27倍
在移动端CPU上达到实时推理（约30ms/帧）

四、MobileNetV2：倒残差与线性瓶颈

MobileNetV2（Sandler et al., 2018）在V1基础上引入了两个关键创新：倒残差结构（Inverted Residual）和线性瓶颈（Linear Bottleneck）。V2的设计深受残差网络和流形学习的启发，在效率和精度上都显著超越了V1。

4.1 线性瓶颈（Linear Bottleneck）

MobileNetV2的一个核心洞察是：ReLU激活函数在低维空间中会造成严重的信息丢失。当特征处于一个低维流形（manifold）中时，ReLU会将非负区域以外的信息全部置零，导致信息瓶颈（information bottleneck）。换言之，ReLU在低维空间中不是一个保距映射，会破坏流形的结构。

为解决这一问题，V2在低维瓶颈层使用线性激活（即不接ReLU），仅在升维后的高维空间中使用ReLU。这就构成了"线性瓶颈"——即输出层的激活函数是线性的。

# 线性瓶颈 vs 传统瓶颈
class LinearBottleneck(nn.Module):
    """MobileNetV2中的线性瓶颈块"""
    def __init__(self, in_channels, out_channels, stride, expand_ratio):
        super().__init__()
        hidden_dim = in_channels * expand_ratio
        self.use_residual = (stride == 1 and in_channels == out_channels)

        layers = []
        if expand_ratio != 1:
            # 1) Expansion (Pointwise升维) + ReLU6
            layers.extend([
                nn.Conv2d(in_channels, hidden_dim, 1, 1, 0, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.ReLU6(inplace=True)
            ])

        # 2) Depthwise卷积 + ReLU6
        layers.extend([
            nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1,
                      groups=hidden_dim, bias=False),
            nn.BatchNorm2d(hidden_dim),
            nn.ReLU6(inplace=True),

            # 3) Pointwise降维 (Linear Bottleneck —— 不加激活函数!)
            nn.Conv2d(hidden_dim, out_channels, 1, 1, 0, bias=False),
            nn.BatchNorm2d(out_channels)
            # 注意: 这里没有 ReLU !!
        ])

        self.conv = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_residual:
            return x + self.conv(x)
        else:
            return self.conv(x)

4.2 倒残差结构（Inverted Residual）

与传统的残差块（如ResNet的瓶颈结构）不同，MobileNetV2采用了"先升维、后卷积、再降维"的倒残差结构。对比两者：

特性	传统残差（ResNet）	倒残差（MobileNetV2）
结构顺序	降维 → 卷积 → 升维	升维 → DW卷积 → 降维
输入/输出通道	高维 → 低维 → 高维	低维 → 高维 → 低维
瓶颈形状	沙漏型（两头大中间小）	纺锤型（两头小中间大）
中间的卷积	标准卷积 3×3	深度可分离卷积 3×3
Shortcut连接	在宽的特征图上	在窄的瓶颈特征图上
扩展比（expansion ratio）	0.25（压缩）	6（扩展）

ResNet Bottleneck (沙漏型): 输入 (256通道) │ ▼ Conv 1×1, 64通道 ← 降维 (1/4) │ ▼ Conv 3×3, 64通道 ← 标准卷积 │ ▼ Conv 1×1, 256通道 ← 升维 (恢复) │ └─── + (shortcut) ──→ 输出 MobileNetV2 Inverted Residual (纺锤型): 输入 (24通道) ← 已经是低维 │ ▼ Conv 1×1, 144通道 ← 升维 (×6 expansion) │ ▼ DWConv 3×3, 144通道 ← 深度可分离卷积 │ ▼ Conv 1×1, 24通道 ← 降维 (线性瓶颈, 无ReLU) │ └─── + (shortcut) ──→ 输出

4.3 Expansion Layer（扩展层）

扩展比（expansion ratio, 通常记为 t）是MobileNetV2的重要超参数，控制瓶颈层中间特征的通道数。默认 t=6，即中间特征通道数为输入通道数的6倍。

中间通道数 = 输入通道数 \times t (典型值 t=6)

扩展比的意义在于：在低维瓶颈处使用shortcut连接可以减小张量尺寸、节省内存；而在高维空间中做Depthwise卷积可以提取更丰富的空间特征，且高维空间中ReLU的信息丢失更少。

# MobileNetV2 完整实现（核心部分）
class MobileNetV2(nn.Module):
    def __init__(self, num_classes=1000, width_mult=1.0):
        super().__init__()

        def _make_block(inp, oup, stride, expand_ratio):
            return LinearBottleneck(inp, oup, stride, expand_ratio)

        input_ch = 32
        last_ch = 1280

        # 初始标准卷积
        features = [nn.Sequential(
            nn.Conv2d(3, input_ch, 3, 2, 1, bias=False),
            nn.BatchNorm2d(input_ch),
            nn.ReLU6(inplace=True)
        )]

        # 倒残差块配置: (expand_ratio, channels, repeats, stride)
        inverted_residual_settings = [
            # t, c, n, s
            (1, 16, 1, 1),     # 扩展比1 = 无扩展
            (6, 24, 2, 2),     # stride=2 下采样
            (6, 32, 3, 2),
            (6, 64, 4, 2),
            (6, 96, 3, 1),
            (6, 160, 3, 2),
            (6, 320, 1, 1),
        ]

        for t, c, n, s in inverted_residual_settings:
            oup_ch = int(c * width_mult)
            for i in range(n):
                stride = s if i == 0 else 1
                features.append(
                    _make_block(input_ch, oup_ch, stride, t)
                )
                input_ch = oup_ch

        # 最后的1x1升维 + 全局池化 + 分类器
        features.append(
            nn.Sequential(
                nn.Conv2d(input_ch, int(last_ch * width_mult), 1, 1, 0, bias=False),
                nn.BatchNorm2d(int(last_ch * width_mult)),
                nn.ReLU6(inplace=True)
            )
        )

        self.features = nn.Sequential(*features)
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Dropout(0.2),
            nn.Linear(int(last_ch * width_mult), num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

4.4 MobileNetV2 的效率与精度

模型	参数量	计算量 (MFLOPs)	ImageNet Top-1	相较V1提升
MobileNetV1 (α=1.0)	4.2M	569M	70.6%	—
MobileNetV2 (α=1.0)	3.4M	300M	72.0%	+1.4%, 计算量减半
MobileNetV2 (α=1.4)	6.9M	585M	74.7%	+4.1%

MobileNetV2 核心贡献总结

倒残差结构： 先升维后降维，在低维瓶颈处做shortcut，节省内存
线性瓶颈： 在低维输出层去除ReLU，保护低维流形结构，减少信息丢失
更高效的架构： 在更少计算量（300M vs 569M）下实现更高精度（72.0% vs 70.6%）
扩展比设计： 默认t=6，平衡升维带来的计算成本与精度收益

五、MobileNetV3：NAS搜索与硬件感知优化

MobileNetV3（Howard et al., 2019）是MobileNet系列的第三代，结合了神经架构搜索（NAS）和硬件感知网络设计两大技术。V3在V2的倒残差结构基础上，引入了h-swish激活函数和SE（Squeeze-and-Excitation）注意力模块，并采用NetAdapt算法进行逐层优化。

5.1 NAS搜索架构（平台感知NAS）

MobileNetV3使用两种搜索技术来发现最优架构：

MnasNet风格的NAS搜索： 在资源约束下搜索全局网络结构，优化目标是accuracy × [latency(target_latency)]^w 的多目标函数，在精度和延迟之间寻找Pareto最优解
NetAdapt算法： 对NAS搜索得到的结果进行逐层微调，在保持延迟预算的前提下，逐个调整每层的通道数以最大化精度收益

# MobileNetV3中使用的NAS搜索目标函数示意
# 多目标优化: 最大化精度, 最小化延迟
def nas_objective(accuracy, latency, target_latency, w=-0.15):
    """
    MnasNet风格的多目标优化函数
    accuracy: 验证集精度
    latency: 在实际设备上测量的延迟(ms)
    target_latency: 目标延迟约束
    w: 延迟的惩罚权重(负值意味着延迟越接近目标越好)
    """
    return accuracy * (latency / target_latency) ** w

# 搜索空间包括:
# - 卷积类型: 标准卷积 / Depthwise卷积 / 倒残差块
# - 卷积核大小: 3x3 / 5x5
# - 通道数: 16 ~ 1024
# - 扩展比: 3 / 4 / 6
# - 是否使用SE模块
# - 激活函数类型: ReLU / h-swish

5.2 h-swish激活函数

MobileNetV3引入了h-swish（hard version of Swish）激活函数。原始Swish激活函数为 f(x) = x * sigmoid(x)，虽然精度高但sigmoid计算成本高。h-swish使用ReLU6逼近sigmoid：

h-swish(x) = x \times ReLU6(x+3) / 6

h-swish相比Swish的优势：

计算效率高： 仅使用ReLU6和基本算术运算，无需指数运算
量化友好： 输出有界（范围在0到x之间），便于后续量化部署
自门控特性： 保留Swish的非线性表达能力，平滑程度略低但精度接近

# h-swish 激活函数实现
import torch
import torch.nn as nn
import torch.nn.functional as F

class HardSwish(nn.Module):
    """MobileNetV3中的h-swish激活函数"""
    def __init__(self, inplace=True):
        super().__init__()
        self.inplace = inplace

    def forward(self, x):
        # h-swish(x) = x * ReLU6(x + 3) / 6
        return x * F.relu6(x + 3.0, inplace=self.inplace) / 6.0

class HardSigmoid(nn.Module):
    """硬件友好的sigmoid近似, 用于SE模块"""
    def __init__(self):
        super().__init__()

    def forward(self, x):
        # h-sigmoid(x) = ReLU6(x + 3) / 6
        return F.relu6(x + 3.0) / 6.0

# 测试h-swish
test_input = torch.tensor([-5.0, -3.0, -1.0, 0.0, 1.0, 3.0, 5.0])
hs = HardSwish(inplace=False)
print(f"h-swish: {hs(test_input)}")
# 输出: [-0.0, -0.0, -0.33, 0.0, 0.67, 3.0, 5.0]

5.3 SE注意力模块（Squeeze-and-Excitation）

MobileNetV3在倒残差块中引入了轻量级的SE（Squeeze-and-Excitation）注意力模块。SE模块通过全局平均池化 → 全连接降维 → h-swish/h-sigmoid激活 → 全连接升维 → 通道加权的流程，自适应地重新校准每个通道的重要性。

# MobileNetV3 中使用的SE模块
class SqueezeExcitation(nn.Module):
    """轻量级SE注意力模块"""
    def __init__(self, in_channels, reduced_channels=None):
        super().__init__()
        reduced_channels = reduced_channels or max(1, in_channels // 4)

        self.se = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),          # Squeeze: 全局空间信息聚合
            nn.Conv2d(in_channels, reduced_channels, 1),  # 降维
            HardSwish(inplace=True),           # 非线性
            nn.Conv2d(reduced_channels, in_channels, 1),  # 升维
            HardSigmoid()                      # 门控值 [0, 1]
        )

    def forward(self, x):
        # Excitation: 逐通道加权
        return x * self.se(x)

# 带有SE模块的倒残差块 (MobileNetV3风格)
class InvertedResidualV3(nn.Module):
    def __init__(self, inp, oup, stride, expand_ratio, use_se=False, use_hs=False):
        super().__init__()
        hidden_dim = inp * expand_ratio
        self.use_residual = (stride == 1 and inp == oup)
        act_layer = HardSwish if use_hs else nn.ReLU6

        layers = []
        if expand_ratio != 1:
            layers.extend([
                nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False),
                nn.BatchNorm2d(hidden_dim),
                act_layer(inplace=True)
            ])

        layers.extend([
            nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1,
                      groups=hidden_dim, bias=False),
            nn.BatchNorm2d(hidden_dim),
            act_layer(inplace=True)
        ])

        if use_se:
            layers.append(SqueezeExcitation(hidden_dim))

        layers.extend([
            nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
            nn.BatchNorm2d(oup)
        ])

        self.conv = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_residual:
            return x + self.conv(x)
        return self.conv(x)

5.4 MobileNetV3-Large与Small的架构差异

MobileNetV3提供了两种变体：V3-Large面向高性能场景，V3-Small面向资源极度受限场景。两者在倒残差块的数量、SE模块的使用频率、激活函数选择上有所不同。

特性	MobileNetV3-Large	MobileNetV3-Small
目标场景	旗舰手机、中等算力设备	低端手机、IoT设备
参数量	5.4M	2.9M
计算量	219M FLOPs	56M FLOPs
ImageNet Top-1	75.2%	67.4%
h-swish使用	后半部分网络	全部网络
SE模块	部分瓶颈块	部分瓶颈块
输入分辨率	224×224	224×224

MobileNetV3 关键创新

NAS + NetAdapt 两级搜索策略，先搜索宏观结构再逐层优化通道数
h-swish激活函数，兼顾Swish精度的同时大幅提升计算效率和量化友好性
SE注意力模块，以极小的计算开销（约2-3%）提升分类精度（约1-2%）
硬件感知设计，直接以实际设备延迟（而非FLOPs）为优化目标
V3-Large以219M FLOPs达到75.2% Top-1精度，优于V2的300M FLOPs/72.0%

六、ShuffleNet：通道混洗与分组卷积

ShuffleNet（Zhang et al., 2018）由旷视科技提出，是另一个重要的轻量级网络系列。其核心创新是通道混洗（Channel Shuffle）操作，用于解决分组卷积（Group Convolution）中不同组之间信息无法流通的问题。

6.1 分组卷积（Group Convolution）的问题

使用分组卷积时，每个卷积核只与同一组的输入通道相连。当连续使用多个分组卷积层时，不同组之间的信息完全隔离，导致特征表示能力下降。ShuffleNet通过通道混洗操作，在下一个分组卷积之前将通道重新排列，使每组都包含来自上一组的信息。

# 通道混洗 (Channel Shuffle) 实现
import torch

def channel_shuffle(x, groups):
    """
    通道混洗操作
    x: 输入张量 [B, C, H, W]
    groups: 分组数
    """
    batch_size, channels, height, width = x.shape
    assert channels % groups == 0, \
        f"通道数({channels})必须能被分组数({groups})整除"

    # 重塑为 [B, groups, channels/groups, H, W]
    x = x.view(batch_size, groups, channels // groups, height, width)
    # 转置: 交换groups和channels/groups维度
    x = x.transpose(1, 2).contiguous()
    # 展平恢复原始形状
    x = x.view(batch_size, channels, height, width)
    return x

# 演示: 3组通道混洗
x = torch.arange(12).view(1, 12, 1, 1).float()
print(f"原始:     {x.view(1, -1)}")
# 原始:     [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

shuffled = channel_shuffle(x, 3)
print(f"混洗后:   {shuffled.view(1, -1)}")
# 混洗后:   [0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11]
# 每组(g=3)输出 -> 每组的第1个元素, 每组的第2个元素, ...
# 实现了跨组信息交换

通道混洗操作示意 (groups=3, 每通道4个元素, 但通道混洗是在通道维度): 输入通道: [00] [01] [02] [03] | [04] [05] [06] [07] | [08] [09] [10] [11] 组1 组2 组3 │ │ │ └──────────┬────────┘───────────────────┘ │ Channel Shuffle ┌──────────┴──────────┬───────────────────┐ │ │ │ 输出通道: [00] [04] [08] | [01] [05] [09] | [02] [06] [10] | [03] [07] [11] 组1 组2 组3 组4 各组的通道来自所有组, 实现跨组信息交换

6.2 逐点分组卷积（Pointwise Group Convolution）

ShuffleNet使用逐点分组卷积（即1x1分组卷积）替代标准的1x1卷积来降低计算量。1x1卷积在MobileNet中占了绝大部分计算量（约94.86%），将其分组是进一步压缩的关键。ShuffleNet的基本单元包含：

1x1分组卷积：对1x1卷积使用分组，降低计算量
通道混洗：在1x1分组卷积与3x3 Depthwise卷积之间，实现跨组信息交换
3x3 Depthwise卷积：逐通道空间特征提取
1x1分组卷积：再次进行通道融合
Shortcut连接（stride=1时）：残差学习

# ShuffleNet基本单元实现
class ShuffleNetUnit(nn.Module):
    def __init__(self, in_channels, out_channels, stride, groups):
        super().__init__()
        self.stride = stride

        # 中间通道数为输出的1/4 (瓶颈设计)
        bottleneck_channels = out_channels // 4

        # 1x1分组卷积 (GConv)
        self.gconv1 = nn.Conv2d(
            in_channels, bottleneck_channels, 1,
            groups=groups, bias=False
        )
        self.bn1 = nn.BatchNorm2d(bottleneck_channels)

        # 3x3 Depthwise卷积
        self.dwconv = nn.Conv2d(
            bottleneck_channels, bottleneck_channels, 3,
            stride=stride, padding=1,
            groups=bottleneck_channels, bias=False
        )
        self.bn2 = nn.BatchNorm2d(bottleneck_channels)

        # 1x1分组卷积 (GConv)
        self.gconv2 = nn.Conv2d(
            bottleneck_channels, out_channels, 1,
            groups=groups, bias=False
        )
        self.bn3 = nn.BatchNorm2d(out_channels)

        # Shortcut处理
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.AvgPool2d(3, stride=2, padding=1),
                nn.Conv2d(in_channels, out_channels, 1, groups=groups, bias=False),
                nn.BatchNorm2d(out_channels)
            )
        else:
            self.shortcut = nn.Identity()

    def forward(self, x):
        residual = self.shortcut(x)

        out = self.gconv1(x)
        out = self.bn1(out)
        out = F.relu(out)

        # 通道混洗 (connect)
        out = channel_shuffle(out, groups=4)

        out = self.dwconv(out)
        out = self.bn2(out)

        out = self.gconv2(out)
        out = self.bn3(out)

        if self.stride == 1:
            out = F.relu(out + residual)
        else:
            out = F.relu(out)
        return out

6.3 ShuffleNet V2：四准则实用设计

ShuffleNet V2（Ma et al., 2018）进一步从直接测量速度的角度提出了四条实用网络设计准则，在同等FLOPs下实现了更高的实际运行速度。

ShuffleNet V2 四大准则

准则1： 当输入和输出通道数相等时，内存访问成本（MAC）最小。因此应保持瓶颈层输入输出通道一致。
准则2： 过度的分组卷积会增加MAC，因此分组数应根据目标平台适当选择（而非越大越好）。
准则3： 网络碎片化（如Inception中的多分支结构）会降低并行度，应尽量减少分支数。
准则4： Element-wise操作（如ReLU、Shortcut加法）虽然FLOPs为0但仍有显著时间开销，应尽量减少。

# ShuffleNet V2 基本单元 (遵循四准则)
class ShuffleNetV2Block(nn.Module):
    def __init__(self, in_channels, out_channels, stride):
        super().__init__()
        self.stride = stride

        if stride == 1:
            # 通道分成两半: c' = c/2
            bottleneck = out_channels // 2
            branch_channels = out_channels // 2
        else:
            bottleneck = out_channels
            branch_channels = out_channels

        # 主分支: 3x3 DWConv + 1x1 Conv (不使用分组卷积, 满足准则2)
        self.main_branch = nn.Sequential(
            nn.Conv2d(branch_channels, branch_channels, 1, 1, 0, bias=False),
            nn.BatchNorm2d(branch_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(branch_channels, branch_channels, 3, stride, 1,
                      groups=branch_channels, bias=False),
            nn.BatchNorm2d(branch_channels),
            nn.Conv2d(branch_channels, bottleneck, 1, 1, 0, bias=False),
            nn.BatchNorm2d(branch_channels),
            nn.ReLU(inplace=True),
        )

        if stride == 2:
            # 副分支: 3x3 DWConv + 1x1 Conv
            self.side_branch = nn.Sequential(
                nn.Conv2d(in_channels - branch_channels,
                          in_channels - branch_channels, 1, 1, 0, bias=False),
                nn.BatchNorm2d(in_channels - branch_channels),
                nn.ReLU(inplace=True),
                nn.Conv2d(in_channels - branch_channels,
                          in_channels - branch_channels, 3, stride, 1,
                          groups=in_channels - branch_channels, bias=False),
                nn.BatchNorm2d(in_channels - branch_channels),
                nn.Conv2d(in_channels - branch_channels,
                          bottleneck, 1, 1, 0, bias=False),
                nn.BatchNorm2d(bottleneck),
                nn.ReLU(inplace=True),
            )
            # Concat后通道数 = bottleneck * 2 = out_channels
        else:
            self.side_branch = nn.Identity()

    def forward(self, x):
        if self.stride == 1:
            # 通道分割 (满足准则1: 输入输出通道相同)
            x1, x2 = x.chunk(2, dim=1)
            out = torch.cat([x1, self.main_branch(x2)], dim=1)
        else:
            out = torch.cat([self.main_branch(x), self.side_branch(x)], dim=1)

        # 通道混洗 (保证跨组信息流通)
        out = channel_shuffle(out, 2)
        return out

模型	参数量	计算量 (MFLOPs)	ImageNet Top-1	实际加速比 (ARM)
ShuffleNet V1 (g=3)	2.4M	140M	67.5%	—
ShuffleNet V2 (1x)	2.3M	146M	69.4%	~1.5x vs V1
ShuffleNet V2 (2x)	7.4M	591M	74.9%	—

ShuffleNet系列总结

通道混洗（Channel Shuffle）：轻量级操作（O(1)），解决分组卷积的跨组信息隔离问题
逐点分组卷积：将计算量最密集的1x1卷积分组，进一步提升效率
V2四大准则：从实际运行速度出发，纠正了"FLOPs等价于速度"的误解
通道分割策略：V2中stride=1时将通道分为两半，分别处理再concat，满足MAC最优条件

七、轻量级网络综合对比

下表从参数量、计算量、精度、推理速度、核心创新等多个维度，对主流轻量级网络进行全面对比。

7.1 ImageNet分类性能对比

模型系列	具体版本	参数量 (M)	计算量 (MFLOPs)	Top-1 精度	推理速度 (ms)
MobileNetV1	α=1.0	4.2	569	70.6%	~30
MobileNetV2	α=1.0	3.4	300	72.0%	~25
MobileNetV3-Large	—	5.4	219	75.2%	~20
MobileNetV3-Small	—	2.9	56	67.4%	~10
ShuffleNetV1	g=3	2.4	140	67.5%	~22
ShuffleNetV2	1x	2.3	146	69.4%	~15
ShuffleNetV2	2x	7.4	591	74.9%	~35
EfficientNet-Lite0	—	5.3	390	75.1%	~28
EfficientNet-Lite1	—	5.8	574	76.7%	~35
EfficientNet-Lite4	—	9.7	1,909	80.1%	~60
ResNet-50	（参考基准）	25.6	3,800	76.0%	~45

7.2 核心设计维度对比

维度	MobileNetV1	MobileNetV2	MobileNetV3	ShuffleNetV2
基础算子	深度可分离卷积	倒残差 + 线性瓶颈	倒残差 + SE + h-swish	分组卷积 + 通道混洗
激活函数	ReLU6	ReLU6	ReLU6 / h-swish	ReLU
Shortcut	无	倒残差瓶颈处	倒残差瓶颈处	单元中添加
结构搜索	人工设计	人工设计	NAS + NetAdapt	人工 + 准则
精度/FLOPs效率	中等	较高	最高	较高
硬件优化导向	否	否	是（直接测量延迟）	是（MAC/并行度）
量化友好性	高	高	较高（h-swish有界）	高

7.3 选择指南

实际部署建议

极致效率需求（<100M FLOPs）： 选择 MobileNetV3-Small 或 ShuffleNetV2 0.5x，两者均在极低计算量下保持较好精度
通用移动端部署（200-400M FLOPs）： MobileNetV3-Large 是当前最佳选择，219M FLOPs达到75.2% Top-1精度
高精度需求（>75% Top-1）： EfficientNet-Lite系列在更高计算量下精度优势明显，Lite4达到80.1%
低延迟敏感应用： ShuffleNetV2在ARM设备上实测速度最快，归功于其MAC优化设计
量化部署场景： MobileNetV2/V3均有成熟的TFLite量化方案，h-swish的对有界输出更有利于int8量化
自定义硬件（FPGA/ASIC）： 深度可分离卷积的规则计算模式更易于硬件加速实现

# 使用torchvision加载预训练轻量级模型
import torch
import torchvision.models as models

# 加载MobileNetV3-Large预训练模型
model_v3_large = models.mobilenet_v3_large(pretrained=True)
model_v3_large.eval()

# 加载MobileNetV3-Small
model_v3_small = models.mobilenet_v3_small(pretrained=True)
model_v3_small.eval()

# 加载ShuffleNetV2 1x
model_shufflenet = models.shufflenet_v2_x1_0(pretrained=True)
model_shufflenet.eval()

# 各模型参数量对比
models_dict = {
    "MobileNetV3-Large": model_v3_large,
    "MobileNetV3-Small": model_v3_small,
    "ShuffleNetV2-1x": model_shufflenet,
}

for name, model in models_dict.items():
    params = sum(p.numel() for p in model.parameters())
    print(f"{name}: {params/1e6:.2f}M parameters")

# 推理速度测试
import time

dummy_input = torch.randn(1, 3, 224, 224)
model = model_v3_large

# 预热
for _ in range(10):
    _ = model(dummy_input)

# 测试100次平均
start = time.time()
for _ in range(100):
    _ = model(dummy_input)
end = time.time()

avg_ms = (end - start) / 100 * 1000
print(f"MobileNetV3-Large 平均推理时间: {avg_ms:.2f}ms")

八、总结与展望

核心要点总结

深度可分离卷积是轻量级网络的基石，将标准卷积分解为Depthwise + Pointwise两步，理论计算量降至约1/9
MobileNetV1首次系统性地将深度可分离卷积引入大规模视觉网络，通过α和ρ两个超参数实现灵活调节
MobileNetV2引入倒残差和线性瓶颈，在低维瓶颈处使用线性激活保护流形结构，计算量减半的同时精度反而提升
MobileNetV3结合NAS搜索+h-swish+SE注意力，以219M FLOPs达到75.2% Top-1精度，效率达到新高度
ShuffleNet通过通道混洗解决分组卷积隔离问题，V2的四准则从实际速度出发重新定义了轻量网络设计方法论
EfficientNet-Lite在更高计算量预算下提供更优精度（Lite4: 80.1%），适合对精度要求较高的场景

轻量级网络的未来趋势

展望方向

自动化搜索（NAS）：从手动设计向完全自动化搜索演进，MobileNetV3和EfficientNet已展示NAS的巨大潜力
Transformer轻量化：MobileViT、EdgeNeXt等轻量级Vision Transformer正在挑战CNN在移动端的统治地位
结构重参数化：如RepVGG通过训练时多分支推理时合并的策略，在不增加推理成本的情况下提升性能
模型量化与剪枝：将轻量网络与量化（int8/int4）和结构化剪枝结合，进一步压缩模型
硬件-算法协同设计：如NPU定制算子与网络结构的联合优化，充分发挥硬件潜力
知识蒸馏：使用大型教师网络指导轻量学生网络训练，成为提升轻量网络精度的标配技术

# 知识蒸馏简单示例：教师-学生训练框架
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
    """
    知识蒸馏损失函数
    student_logits: 学生网络logits
    teacher_logits: 教师网络logits (不计算梯度)
    labels: 真实标签
    T: 温度参数，控制软标签的平滑程度
    alpha: 软标签损失的权重
    """
    # 硬标签损失（交叉熵）
    hard_loss = F.cross_entropy(student_logits, labels)

    # 软标签损失（KL散度）
    soft_student = F.log_softmax(student_logits / T, dim=1)
    soft_teacher = F.softmax(teacher_logits / T, dim=1)
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean')
    soft_loss = soft_loss * (T ** 2)  # 温度缩放补偿

    # 总损失
    total_loss = alpha * soft_loss + (1 - alpha) * hard_loss
    return total_loss

# 示例: 使用ResNet-50作为教师, MobileNetV3作为学生
teacher = models.resnet50(pretrained=True).eval()
student = models.mobilenet_v3_large(pretrained=False)

# 对每个batch计算蒸馏损失
# for images, labels in dataloader:
#     with torch.no_grad():
#         teacher_logits = teacher(images)
#     student_logits = student(images)
#     loss = distillation_loss(student_logits, teacher_logits, labels)
#     loss.backward()
#     optimizer.step()