超参数调优与实验追踪 - 深度学习-学习笔记-上海佼艾

一、超参数调优概述

超参数调优（Hyperparameter Tuning）是深度学习工作流中至关重要的环节。不同于模型在训练过程中通过梯度下降自动学习的参数（权重和偏置），超参数是训练开始之前由人类设定的配置值，它们直接决定了模型结构、训练行为和最终性能。一套优秀的超参数组合往往能显著提升模型精度，甚至将模型从"不收敛"的状态挽救回来。

核心概念定义：

超参数（Hyperparameter）： 训练前设定的配置值，控制学习过程和模型结构
模型参数（Model Parameter）： 训练过程中由优化算法自动更新的权重和偏置
调优空间（Search Space）： 所有超参数可能取值的集合定义
调优目标（Objective）： 通常为验证集上的损失或评估指标
试验（Trial）： 一次完整的超参数组合训练+评估过程

演进历程：超参数调优方法经历了从手工调参（直觉+经验）→ 网格搜索（暴力枚举）→ 随机搜索 → 贝叶斯优化（概率建模）→ 多 fidelity 方法（早停/学习曲线）→ 分布式/自动化调优框架的进化过程。现代调优已经进入智能化、分布式、自动化的阶段。

二、超参数类型与调优空间定义

2.1 网络架构类超参数

层数（Depth）： 神经网络的深度，影响模型容量
隐藏单元数（Width）： 每层的神经元数量
激活函数： ReLU、GELU、Swish、LeakyReLU 等选择
注意力头数（Multi-head）： Transformer 中的注意力头数量
卷积核大小： CNN 中卷积窗口的尺寸
Dropout 率： 随机失活比例，控制正则化强度

2.2 训练过程类超参数

学习率（Learning Rate）： 最重要的超参数，控制参数更新步长
批次大小（Batch Size）： 每次梯度更新的样本数
优化器选择： SGD、Adam、AdamW、RMSprop、LAMB 等
动量系数（Momentum）： SGD 的动量参数
权重衰减（Weight Decay）： L2 正则化系数
学习率调度策略： Cosine、Step Decay、Warmup、ReduceLROnPlateau
训练轮数（Epochs）： 完整遍历训练集的次数

2.3 数据与预处理类超参数

数据增强策略： 翻转、旋转、裁剪、颜色变换的参数
序列长度： NLP 任务中的最大序列截断长度
词汇表大小： Tokenizer 的词汇量上限
采样策略： 类别平衡采样、难例挖掘参数

2.4 调优空间定义实践

                # 调优空间定义示例（搜索空间设计）
                from optuna import trial

                def define_search_space(trial):
                    # 类别型：优化器选择
                    optimizer_name = trial.suggest_categorical(
                        'optimizer', ['Adam', 'SGD', 'AdamW']
                    )

                    # 连续型：学习率（对数尺度）
                    lr = trial.suggest_float(
                        'lr', 1e-5, 1e-1, log=True
                    )

                    # 整型：隐藏层神经元数量
                    n_units = trial.suggest_int(
                        'n_units', 32, 512, step=32
                    )

                    # 整型：网络层数
                    n_layers = trial.suggest_int(
                        'n_layers', 2, 8
                    )

                    # 连续型：Dropout率
                    dropout = trial.suggest_float(
                        'dropout', 0.0, 0.5
                    )

                    # 条件参数：SGD才使用动量
                    momentum = None
                    if optimizer_name == 'SGD':
                        momentum = trial.suggest_float(
                            'momentum', 0.8, 0.99
                        )

                    return {
                        'optimizer': optimizer_name,
                        'lr': lr,
                        'n_units': n_units,
                        'n_layers': n_layers,
                        'dropout': dropout,
                        'momentum': momentum,
                    }
            

三、超参数调优方法详解

3.1 网格搜索（Grid Search）

网格搜索是最基础的调优方法，在预定义的超参数网格上穷举所有组合。优点是实现简单、结果可复现；缺点是计算成本随参数数量指数增长（维度灾难）。

                from sklearn.model_selection import GridSearchCV
                from sklearn.svm import SVC

                param_grid = {
                    'C': [0.1, 1, 10, 100],
                    'gamma': [0.001, 0.01, 0.1, 1],
                    'kernel': ['rbf', 'poly'],
                }

                grid_search = GridSearchCV(
                    SVC(), param_grid,
                    cv=5, scoring='accuracy',
                    n_jobs=-1, verbose=1
                )
                grid_search.fit(X_train, y_train)

                print(f"Best params: {grid_search.best_params_}")
                print(f"Best score: {grid_search.best_score_:.4f}")

                # 网格大小为 4x4x2 = 32 组组合 × 5折CV = 160次训练
            

局限性：假设4个超参数，每个取10个值，则需要 10^{4} = 10000 次训练。对于深度学习模型，这往往不可行。网格搜索适合超参数较少（≤3）且取值较少的情况。

3.2 随机搜索（Random Search）

相比网格搜索在每个维度上均匀采点，随机搜索从分布中随机采样。Bergstra & Bengio (2012) 证明随机搜索在大多数实际场景中优于网格搜索，因为它能以更少的试验次数覆盖更大的超参数空间，并且不会浪费算力在不敏感的参数维度上。

                from sklearn.model_selection import RandomizedSearchCV
                from scipy.stats import uniform, loguniform, randint

                param_dist = {
                    'lr': loguniform(1e-4, 1e-1),  # 对数均匀分布
                    'batch_size': randint(16, 256),
                    'dropout': uniform(0, 0.5),
                    'n_units': randint(32, 512),
                }

                random_search = RandomizedSearchCV(
                    model, param_dist,
                    n_iter=50,  # 仅运行50次随机采样
                    cv=3, scoring='neg_log_loss',
                    n_jobs=-1, random_state=42
                )
                random_search.fit(X_train, y_train)
            

核心洞见：超参数优化中，并非所有参数同等重要。某些超参数（如学习率）对结果高度敏感，而其他参数（如动量）可能相对不敏感。随机搜索能天然发现哪些区域参数更重要，因为它在高维空间中比网格搜索分布得更均匀。

3.3 贝叶斯优化（Bayesian Optimization）

贝叶斯优化是目前最主流的单次训练调优方法。它的核心思路是建立一个概率代理模型（Surrogate Model）来近似目标函数，然后通过采集函数（Acquisition Function）决定下一个采样点，在探索（Exploration）和利用（Exploitation）之间做平衡。

高斯过程（Gaussian Process, GP）

GP是最经典的代理模型。它为正态分布假设下任意有限的函数值集合给出联合高斯分布，并提供预测均值和不确定性（方差）。GP的优点是在少量数据点上表现良好；缺点是计算复杂度为O(n³)，不适合大量观测，且对高维空间（>20维）效果下降。

                from sklearn.gaussian_process import GaussianProcessRegressor
                from sklearn.gaussian_process.kernels import Matern, RBF, ConstantKernel
                import numpy as np

                # 定义核函数
                kernel = ConstantKernel(1.0) * Matern(length_scale=1.0, nu=2.5)

                # 高斯过程回归
                gp = GaussianProcessRegressor(
                    kernel=kernel,
                    n_restarts_optimizer=10,
                    alpha=1e-6,
                    normalize_y=True
                )

                # 使用已有观测数据拟合
                gp.fit(X_observed, y_observed)

                # 预测新点的均值和标准差
                mean, std = gp.predict(X_candidates, return_std=True)

                # 采集函数：期望改进（EI）
                def expected_improvement(mean, std, y_best, xi=0.01):
                    from scipy.stats import norm
                    delta = mean - y_best - xi
                    Z = delta / std
                    return delta * norm.cdf(Z) + std * norm.pdf(Z)

                ei = expected_improvement(mean, std, np.max(y_observed))
                best_idx = np.argmax(ei)
                next_x = X_candidates[best_idx]
            

TPE（Tree-structured Parzen Estimator）

TPE是贝叶斯优化的另一种实现，由Bergstra等人提出。不同于GP直接建模P(score|params)，TPE使用两个密度函数来建模：

l(x)：表现好的超参数配置的密度
g(x)：表现差（剩余）的超参数配置的密度

采集函数为 l(x)/g(x) 的比值，最大化该比值即可找到有希望的新采样点。TPE天然支持条件参数和树形搜索空间，且在高维和大量类别型参数上优于GP。

GP优点

理论基础扎实，不确定性量化准确
小样本下表现优良
采集函数多样且成熟

TPE优点

支持高维搜索空间（>20维）
天然支持条件超参数
计算开销更小，可扩展性更好
处理类别型参数更优

3.4 HyperBand 与早停策略

HyperBand 是一种基于多 fidelity（多保真度）的调优方法，核心思想是：与其让每个配置都训练完整 epoch，不如先用少量资源评估大量配置，淘汰明显较差的，将资源集中在有潜力的配置上。HyperBand 通过 Successive Halving 算法实现：将所有配置分配少量预算运行，淘汰一半最差的，将预算翻倍继续训练剩余配置，重复此过程。

                # HyperBand 原理示意（Successive Halving）
                def successive_halving(configs, budget_per_round, eta=2):
                    """
                    eta: 淘汰比例因子（默认为2，每轮淘汰一半）
                    budget_per_round: 每轮分配给每个配置的预算
                    """
                    n_configs = len(configs)

                    while n_configs > 1:
                        # 本轮所有配置运行 budget 步
                        scores = []
                        for cfg in configs:
                            score = train_and_eval(cfg, budget_per_round)
                            scores.append(score)

                        # 保留前 1/eta 的配置
                        n_keep = max(1, n_configs // eta)
                        top_indices = np.argsort(scores)[-n_keep:]
                        configs = [configs[i] for i in top_indices]

                        # 增加分配给每个配置的预算
                        budget_per_round *= eta
                        n_configs = len(configs)

                    return configs[0]

                # HyperBand 运行多个 rung，每个 rung 使用不同的初始预算
                # 保证了探索（多配置）和利用（深训练）的平衡
            

3.5 ASHA（Asynchronous Successive Halving Algorithm）

ASHA 是 HyperBand 的异步版本，解决了同步 Successive Halving 的资源浪费问题（等待慢任务完成才能进入下一轮）。ASHA 采用异步调度：每当有 worker 空闲时，就立即启动新的 trial 或者提升已有的 trial 到下一阶段。这使得 ASHA 在大规模分布式场景下具有极佳的扩展性。

ASHA 关键优势：

异步执行：无需等待整轮完成，worker 利用率接近100%
线性加速：在计算资源增加时近似线性提升调优效率
早期淘汰：仅需少量 epoch 就能判断大部分配置的优劣
适用场景：大规模分布式调优、每次训练时间较长、计算资源充足

3.6 PBT（Population Based Training）

PBT 由 DeepMind 提出，灵感来自遗传算法。它维护一组并行训练的模型（种群），在训练过程中周期性地：

利用（Exploit）：将表现差的模型的参数替换为表现好的模型的参数
探索（Explore）：对超参数施加随机扰动，产生新的超参数配置

PBT 的优势在于超参数可以在训练过程中动态调整（如随时间降低学习率），而不是静态固定值。

                # PBT 核心逻辑伪代码
                def pbt_step(population, step):
                    for member in population:
                        member.train()                               # 训练一个阶段

                    # 按验证集性能排序
                    population.sort(key=lambda m: m.valid_score)

                    # 利用：底部 20% 的成员从顶部 20% 复制权重
                    n_replace = int(0.2 * len(population))
                    for i in range(n_replace):
                        donor = population[np.random.randint(0, n_replace)]
                        population[-i-1].copy_weights_from(donor)

                    # 探索：对所有成员的超参数进行随机扰动
                    for member in population:
                        member.perturb_hyperparams(
                            perturb_factor=1.2,
                            resample_prob=0.2
                        )
            

3.7 方法对比总结

方法	搜索效率	并行度	高维支持	适用场景
Grid Search	低	高	差	超参数少（≤3），枚举验证
Random Search	中	高	较好	初探搜索空间
Bayesian (GP)	高	低	中等	低维连续参数，预算有限
TPE	高	中等	好	通用场景，混合参数类型
HyperBand/ASHA	高	极高	好	大规模分布式调优
PBT	很高	高	较好	训练过程动态调参

四、Optuna 调优框架

Optuna 是日本Preferred Networks公司开发的超参数优化框架，以其简洁的API设计和强大的功能在业界广受欢迎。它采用"定义式搜索空间"和"按 Trial 执行"的模式，支持多采样器、修剪器和丰富的可视化工具。

4.1 核心概念

Study：一组调优任务的集合，负责管理所有 trials 的持久化和调度
Trial：单次超参数组合的训练与评估过程
Sampler：采样策略，决定下一组超参数如何生成（TPESampler、GridSampler、RandomSampler、CmaEsSampler）
Pruner：修剪器，在中途停止没有希望的 trials（MedianPruner、SuccessiveHalvingPruner、HyperbandPruner）

4.2 Optuna 完整调优示例

                import optuna
                import torch
                import torch.nn as nn
                import torch.optim as optim
                from torch.utils.data import DataLoader, TensorDataset

                # 1. 定义模型构建函数
                class MLP(nn.Module):
                    def __init__(self, n_units, n_layers, dropout):
                        super().__init__()
                        layers = []
                        in_dim = 28*28
                        for _ in range(n_layers):
                            layers.append(nn.Linear(in_dim, n_units))
                            layers.append(nn.ReLU())
                            layers.append(nn.Dropout(dropout))
                            in_dim = n_units
                        layers.append(nn.Linear(in_dim, 10))
                        self.net = nn.Sequential(*layers)

                    def forward(self, x):
                        return self.net(x)

                # 2. 定义目标函数（Objective）
                def objective(trial):
                    # --- 采样超参数 ---
                    n_units = trial.suggest_int('n_units', 32, 512, step=32)
                    n_layers = trial.suggest_int('n_layers', 1, 5)
                    dropout = trial.suggest_float('dropout', 0.0, 0.5)
                    lr = trial.suggest_float('lr', 1e-4, 1e-2, log=True)
                    optimizer_name = trial.suggest_categorical(
                        'optimizer', ['Adam', 'SGD']
                    )
                    batch_size = trial.suggest_int('batch_size', 32, 256, step=32)

                    # --- 构建模型 ---
                    model = MLP(n_units, n_layers, dropout)
                    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                    model.to(device)

                    if optimizer_name == 'Adam':
                        optimizer = optim.Adam(model.parameters(), lr=lr)
                    else:
                        momentum = trial.suggest_float('momentum', 0.8, 0.99)
                        optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

                    criterion = nn.CrossEntropyLoss()

                    # --- 训练循环 ---
                    loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
                    for epoch in range(20):
                        model.train()
                        for batch_x, batch_y in loader:
                            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
                            optimizer.zero_grad()
                            loss = criterion(model(batch_x), batch_y)
                            loss.backward()
                            optimizer.step()

                        # --- 中间评估（支持修剪） ---
                        model.eval()
                        val_loss = 0.0
                        with torch.no_grad():
                            for val_x, val_y in valid_loader:
                                val_x, val_y = val_x.to(device), val_y.to(device)
                                val_loss += criterion(model(val_x), val_y).item()

                        # 报告中间值并检查是否该修剪
                        trial.report(val_loss, epoch)
                        if trial.should_prune():
                            raise optuna.exceptions.TrialPruned()

                    return val_loss

                # 3. 创建 Study 并执行调优
                study = optuna.create_study(
                    study_name='mnist_mlp_tuning',
                    storage='sqlite:///optuna_study.db',  # SQLite持久化
                    direction='minimize',
                    sampler=optuna.samplers.TPESampler(
                        n_startup_trials=10,
                        n_ei_candidates=24
                    ),
                    pruner=optuna.pruners.HyperbandPruner(
                        min_resource=1, max_resource=20,
                        reduction_factor=3
                    ),
                    load_if_exists=True,
                )

                # 4. 运行200次试验
                study.optimize(objective, n_trials=200)

                print("Best trial:")
                print(f"  Value: {study.best_value:.4f}")
                print("  Params:")
                for key, value in study.best_params.items():
                    print(f"    {key}: {value}")

                # 5. 查看 Trial 状态分布
                from collections import Counter
                states = Counter([t.state for t in study.trials])
                print(f"Trial states: {states}")
                # TrialState: COMPLETE, PRUNED, FAIL, WAITING, RUNNING
            

4.3 suggest_* API 一览

方法	参数类型	示例
suggest_categorical	类别型	suggest_categorical('act', ['relu','gelu'])
suggest_int	整数型	suggest_int('units', 32, 512, step=32)
suggest_float	浮点型	suggest_float('lr', 1e-4, 1e-1, log=True)
suggest_uniform	均匀分布	suggest_uniform('dropout', 0, 0.5)
suggest_loguniform	对数均匀	suggest_loguniform('lr', 1e-5, 1e-1)
suggest_discrete_uniform	离散均匀	suggest_discrete_uniform('lr', 1e-4, 1e-1, 1e-4)

4.4 Optuna 可视化

                from optuna.visualization import (
                    plot_optimization_history,
                    plot_parallel_coordinate,
                    plot_slice,
                    plot_contour,
                    plot_param_importances,
                    plot_edf,
                )

                # 优化历史：查看收敛过程
                fig1 = plot_optimization_history(study)
                fig1.show()

                # 平行坐标图：可视化参数与目标的关系
                fig2 = plot_parallel_coordinate(study)
                fig2.show()

                # 切片图：单个参数 vs 目标值
                fig3 = plot_slice(study, params=['lr', 'n_units', 'dropout'])
                fig3.show()

                # 等高线图：两个参数的交互效应
                fig4 = plot_contour(study, params=['lr', 'n_units'])
                fig4.show()

                # 参数重要性：自动识别最敏感的超参数
                fig5 = plot_param_importances(study)
                fig5.show()

                # 经验分布函数
                fig6 = plot_edf(study)
                fig6.show()
            

4.5 Optuna Dashboard 与分布式优化

                # 安装 Optuna Dashboard
                pip install optuna-dashboard

                # 启动 Web Dashboard（基于存储后端）
                optuna-dashboard sqlite:///optuna_study.db

                # 或使用 MySQL/PostgreSQL 作为后端（支持多进程并发）
                optuna-dashboard postgresql://user:pass@host/dbname
            

                # 分布式优化示例（多进程/多机器）
                # 只需要共享同一个 storage 后端即可

                # 进程1（或机器1）
                study = optuna.create_study(
                    study_name='distributed_tuning',
                    storage='postgresql://user:pass@host:5432/optuna',
                    direction='minimize',
                    load_if_exists=True,
                )
                study.optimize(objective, n_trials=100)  # 分担

                # 进程2（或机器2）同样代码
                study = optuna.create_study(
                    study_name='distributed_tuning',
                    storage='postgresql://user:pass@host:5432/optuna',
                    direction='minimize',
                    load_if_exists=True,
                )
                study.optimize(objective, n_trials=100)  # 分担
                # 两个进程自动协调，共享 trials 信息，不重复工作
            

五、Ray Tune 分布式调优

Ray Tune 是基于 Ray 分布式计算框架的超参数调优库，专为大规模分布式调优场景设计。它原生支持各种调优算法、分布式执行、GPU 资源管理和实验监控。

5.1 Ray Tune 核心概念

                from ray import tune
                from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining
                from ray.tune.search.optuna import OptunaSearch
                from ray.tune import CLIReporter
                import ray

                # 初始化 Ray
                ray.init(address='auto', num_cpus=16, num_gpus=4)

                # 定义训练函数（接受 config 字典）
                def train_mnist(config):
                    import torch
                    import torch.nn as nn
                    import torch.optim as optim

                    model = MLP(
                        n_units=config['n_units'],
                        n_layers=config['n_layers'],
                        dropout=config['dropout']
                    )
                    optimizer = optim.Adam(model.parameters(), lr=config['lr'])
                    criterion = nn.CrossEntropyLoss()

                    for epoch in range(20):
                        # ... 训练代码 ...
                        loss = train_one_epoch(model, optimizer, criterion)

                        # 报告的指标必须包含在 metric 配置中
                        tune.report({
                            'loss': loss,
                            'accuracy': accuracy,
                            'epoch': epoch,
                        })

                # 定义搜索空间（ConfigSpace）
                config = {
                    'n_units': tune.randint(32, 512),
                    'n_layers': tune.choice([2, 3, 4, 5]),
                    'dropout': tune.uniform(0.0, 0.5),
                    'lr': tune.loguniform(1e-4, 1e-2),
                    'batch_size': tune.choice([32, 64, 128, 256]),
                }

                # 配置 ASHA Scheduler（异步早停）
                scheduler = ASHAScheduler(
                    time_attr='epoch',
                    max_t=20,
                    grace_period=3,  # 最少训练3轮
                    reduction_factor=3,  # 每轮保留1/3
                )

                # 配置 Reporter（CLI输出）
                reporter = CLIReporter(
                    metric_columns=['loss', 'accuracy', 'training_iteration'],
                    max_report_frequency=10,
                )

                # 运行调优
                tuner = tune.Tuner(
                    tune.with_resources(
                        train_mnist,
                        resources={'cpu': 2, 'gpu': 0.5},
                    ),
                    tune_config=tune.TuneConfig(
                        scheduler=scheduler,
                        search_alg=OptunaSearch(),
                        num_samples=100,  # 共100组配置
                        max_concurrent_trials=8,  # 同时运行8个
                    ),
                    param_space=config,
                    run_config=air.RunConfig(
                        storage_path='/tmp/ray_results',
                        name='mnist_tuning',
                        callbacks=[reporter],
                    ),
                )

                results = tuner.fit()

                # 获取最佳结果
                best_result = results.get_best_result(metric='accuracy', mode='max')
                print(f"Best config: {best_result.config}")
                print(f"Best accuracy: {best_result.metrics['accuracy']:.4f}")
            

5.2 Ray Tune 与 TensorBoard 集成

                from ray.tune import TensorBoardLoggerCallback

                # 在 RunConfig 中添加 TensorBoard 回调
                run_config = air.RunConfig(
                    storage_path='/tmp/ray_results',
                    name='tb_mnist',
                    callbacks=[TensorBoardLoggerCallback()],
                )

                tuner = tune.Tuner(train_mnist, tune_config=..., param_space=..., run_config=run_config)
                results = tuner.fit()

                # 启动 TensorBoard 查看结果
                # tensorboard --logdir /tmp/ray_results/tb_mnist
            

Ray Tune 核心优势：

原生分布式：基于 Ray 框架，天然支持多节点、多 GPU 水平扩展
集成丰富：支持 Optuna、HyperOpt、BayesOpt、BOHB 等多种搜索算法
资源感知：支持细粒度资源分配（CPU/GPU/内存），自动装箱调度
故障恢复：内置 trial 级别的容错机制，可中断恢复
可视化生态：与 TensorBoard、WandB、MLflow 深度集成

六、MLflow 实验追踪

MLflow 是 Databricks 开源的机器学习生命周期管理工具，其 Tracking 组件提供了强大的实验追踪能力。在超参数调优过程中，MLflow 能够系统化记录每次 trial 的超参数、评估指标、模型产物和源代码版本。

6.1 MLflow Tracking Server

                # 安装 MLflow
                pip install mlflow

                # 启动 Tracking Server（带 UI）
                mlflow server \
                    --host 0.0.0.0 \
                    --port 5000 \
                    --backend-store-uri sqlite:///mlflow.db \
                    --default-artifact-root ./mlflow-artifacts

                # 设置 Tracking URI（客户端代码）
                # export MLFLOW_TRACKING_URI=http://localhost:5000
            

6.2 MLflow 与 Optuna 集成

                import mlflow
                import optuna
                from mlflow.models import infer_signature

                # 设置 MLflow 追踪 URI
                mlflow.set_tracking_uri('http://localhost:5000')

                # 设置实验（Experiment），类似项目的概念
                mlflow.set_experiment('hyperparameter_tuning')

                def objective_with_mlflow(trial):
                    # 启动一条 MLflow Run，对应一次 trial
                    with mlflow.start_run(run_name=f'trial_{trial.number}'):
                        # --- 采样超参数 ---
                        params = {
                            'lr': trial.suggest_float('lr', 1e-4, 1e-2, log=True),
                            'n_units': trial.suggest_int('n_units', 32, 512),
                            'dropout': trial.suggest_float('dropout', 0.0, 0.5),
                            'optimizer': trial.suggest_categorical(
                                'optimizer', ['Adam', 'AdamW']
                            ),
                        }

                        # 记录超参数（params）
                        mlflow.log_params(params)

                        # 记录标签（tags）
                        mlflow.set_tag('trial_number', trial.number)
                        mlflow.set_tag('model_type', 'MLP')

                        # --- 训练模型 ---
                        model = MLP(params['n_units'], 3, params['dropout'])
                        accuracy = train_and_eval(model, params)

                        # 记录指标（metrics）
                        mlflow.log_metric('accuracy', accuracy)
                        mlflow.log_metric('val_loss', val_loss)

                        # 记录模型产物（artifacts）
                        mlflow.log_artifact('confusion_matrix.png')
                        mlflow.log_artifact('learning_curve.png')

                        # 记录 PyTorch 模型（log_model）
                        signature = infer_signature(torch.randn(1, 784), output)
                        mlflow.pytorch.log_model(
                            model, 'model',
                            signature=signature,
                            registered_model_name='MNIST_MLP',
                        )

                    return accuracy

                # 运行调优（所有 trial 自动关联到同一 experiment）
                study = optuna.create_study(direction='maximize')
                study.optimize(objective_with_mlflow, n_trials=50)

                # 在 MLflow UI 中可以比较所有 trial 的超参数和指标
            

6.3 MLflow Model Registry 与模型部署

                from mlflow.tracking import MlflowClient

                client = MlflowClient()

                # 注册模型版本（若 log_model 时指定了 registered_model_name）
                # 自动创建 Model Registry 条目

                # 模型阶段转换
                client.transition_model_version_stage(
                    name='MNIST_MLP',
                    version=3,
                    stage='Production',
                )

                # 从 Model Registry 加载模型
                model = mlflow.pytorch.load_model(
                    'models:/MNIST_MLP/Production'
                )

                # 或通过 run_id 加载
                model = mlflow.pytorch.load_model(
                    f'runs:/{run_id}/model'
                )
            

6.4 MLflow UI 实验对比

启动 MLflow UI 后（访问 http://localhost:5000），可以在 Experiments 页面选择多个 runs 进行对比。核心功能包括：

并行坐标图（Parallel Coordinates）：可视化超参数与指标的关系
散点图矩阵：对比两个超参数对同一指标的影响
指标对比表：并排对比所有选中的 runs 的指标和参数
时间序列：查看训练过程中指标随 epoch 的变化
产物预览：直接查看图片、文本、JSON 等 artifacts
模型部署：从 UI 一键将模型注册并部署到 Serving 端点

七、Weights & Biases (WandB)

Weights & Biases 是目前最流行的实验追踪和可视化平台之一，提供云端仪表盘、自动图表生成和团队协作功能。与 MLflow 相比，WandB 更强调开箱即用的可视化体验和团队协作。

7.1 WandB 基础配置

                import wandb

                # 登录（首次需要 API Key）
                # wandb login

                # 初始化一次实验（wandb.init）
                run = wandb.init(
                    project='hyperparameter-tuning',
                    name='trial-001',
                    config={
                        'learning_rate': 1e-3,
                        'batch_size': 64,
                        'n_units': 256,
                        'dropout': 0.2,
                        'optimizer': 'Adam',
                        'epochs': 30,
                        'dataset': 'MNIST',
                    },
                    tags=['baseline', 'v1'],
                    notes='Initial baseline run with default config',
                )

                # 记录指标（wandb.log）
                for epoch in range(30):
                    train_loss, val_loss, val_acc = train_one_epoch()

                    wandb.log({
                        'epoch': epoch,
                        'train/loss': train_loss,
                        'val/loss': val_loss,
                        'val/accuracy': val_acc,
                        'learning_rate': current_lr,
                    })

                # 记录图像等媒体
                wandb.log({
                    'predictions': wandb.Image(img_grid),
                    'confusion_matrix': wandb.plot.confusion_matrix(
                        y_true, y_pred
                    ),
                })

                # 记录模型到 Artifacts
                artifact = wandb.Artifact(
                    name='mnist_model',
                    type='model',
                    description='Trained MLP on MNIST',
                    metadata={'accuracy': val_acc},
                )
                artifact.add_file('model.pth')
                wandb.log_artifact(artifact)

                # 结束 run
                wandb.finish()
            

7.2 WandB Sweep（超参数搜索）

                # 定义 Sweep 配置（YAML 或 Python 字典）
                sweep_config = {
                    'method': 'bayes',  # grid / random / bayes
                    'metric': {
                        'name': 'val/accuracy',
                        'goal': 'maximize',
                    },
                    'parameters': {
                        'learning_rate': {
                            'distribution': 'log_uniform_values',
                            'min': 1e-5,
                            'max': 1e-1,
                        },
                        'batch_size': {
                            'values': [16, 32, 64, 128, 256]
                        },
                        'n_units': {
                            'distribution': 'int_uniform',
                            'min': 32,
                            'max': 512,
                        },
                        'dropout': {
                            'values': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
                        },
                        'optimizer': {
                            'values': ['adam', 'adamw', 'sgd']
                        },
                    },
                    'early_terminate': {
                        'type': 'hyperband',
                        'min_iter': 3,
                        'eta': 2,
                    },
                }

                # 定义训练函数
                def sweep_train():
                    with wandb.init() as run:
                        config = wandb.config

                        # 从 config 读取超参数
                        model = build_model(config.n_units, config.dropout)
                        optimizer = get_optimizer(config.optimizer, config.learning_rate)

                        for epoch in range(30):
                            metrics = train_epoch(model, optimizer, config.batch_size)
                            wandb.log(metrics)

                            # 早停检查（自动由 Sweep Agent 处理）

                # 启动 Sweep
                sweep_id = wandb.sweep(sweep_config, project='hp-tuning')

                # 启动 Sweep Agent（可多个并行）
                wandb.agent(sweep_id, function=sweep_train, count=100)

                # 或在多个终端同时运行（分布式）
                # wandb agent USER/PROJECT/SWEEP_ID
            

7.3 WandB Artifacts 与数据集版本管理

                # 数据集版本管理
                data_artifact = wandb.Artifact(
                    name='mnist_processed',
                    type='dataset',
                    description='MNIST with augmentation v2',
                )
                data_artifact.add_dir('./data/processed')
                data_artifact.add_reference('gs://bucket/mnist.zip')
                wandb.log_artifact(data_artifact)

                # 使用特定版本的数据集
                artifact = run.use_artifact('mnist_processed:v2')
                data_dir = artifact.download()
            

WandB 与 MLflow 对比：

WandB：云端优先、可视化出色、团队协作强、自动生成报告、开源核心 + SaaS
MLflow：完全开源、本地部署友好、Model Registry完善、与Databricks深度集成、更注重生命周期管理
选型建议：个人/小团队快速迭代选 WandB；企业级合规部署选 MLflow；两者可同时使用

八、最佳实践与生产化建议

8.1 调优策略选择指南

场景	推荐方法	理由
首次探索（< 10次试验）	Random Search + 宽空间	快速了解参数敏感度分布
精调（50-200次试验）	TPE（Optuna）或 GP	利用历史信息智能选择下一点
大规模分布式（>1000次）	ASHA（Ray Tune）	异步调度，线性扩展，早停高效
动态调参	PBT	训练过程中自适应调整
多目标优化	NSGA-II / MOTPE	同时优化精度、延迟、模型大小

8.2 实用技巧

学习率是最关键的超参数

大量研究表明，学习率对最终模型性能的影响远大于其他超参数。优先调优学习率，确定合适的数量级后，再精调其他参数。建议使用余弦退火调度或 ReduceLROnPlateau 自动调整。

从小规模开始

在子集数据上运行快速试验（10-20%数据、少量epoch）来缩小搜索空间，确定大致范围后再进行完整数据上的精调。这可以节省80%以上的算力开销。

使用交叉验证

为了避免过拟合到验证集，在数据量较小时使用 K-Fold 交叉验证评估每个配置。Optuna 支持 trial-level 的交叉验证实现。

种子固定与重复性

每次试验固定随机种子（torch.manual_seed、np.random.seed），避免随机性导致的评估噪声掩盖超参数的真实效果。对于评估噪声大的场景，每个配置重复运行3-5次取均值。

常见陷阱

维度灾难：同时调优过多超参数（>15个），搜索空间稀疏，需要指数级试验
早停误杀：过于激进的修剪可能错误地淘汰学习缓慢但最终优秀的配置
验证集污染：反复使用同一验证集调优会导致隐式过拟合，应保留最终的 Holdout 测试集
资源不均衡：分布式调优中异构资源导致部分试验过慢，应使用资源感知调度

8.3 完整的生产化调优流水线

                """
                完整的生产级超参数调优流水线
                集成 Optuna + MLflow + WandB + Ray Tune
                """
                import optuna
                import mlflow
                import wandb
                import numpy as np
                import json
                from datetime import datetime
                from pathlib import Path

                class HyperparameterPipeline:
                    """统一调优流水线：Optuna 搜索 + MLflow 追踪 + WandB 可视化"""

                    def __init__(self, study_name, storage_url, mlflow_uri, wandb_project):
                        self.study_name = study_name
                        self.storage_url = storage_url
                        self.mlflow_uri = mlflow_uri
                        self.wandb_project = wandb_project

                        # 配置 MLflow
                        mlflow.set_tracking_uri(mlflow_uri)
                        mlflow.set_experiment(study_name)

                        # 创建或加载 Optuna Study
                        self.study = optuna.create_study(
                            study_name=study_name,
                            storage=storage_url,
                            direction='maximize',
                            sampler=optuna.samplers.TPESampler(
                                n_startup_trials=10,
                                multivariate=True,
                                group=True,
                            ),
                            pruner=optuna.pruners.MedianPruner(
                                n_startup_trials=5,
                                n_warmup_steps=5,
                            ),
                            load_if_exists=True,
                        )

                    def objective(self, trial):
                        # 采样超参数
                        config = self.sample_params(trial)

                        # MLflow Run
                        with mlflow.start_run(run_name=f'trial_{trial.number}') as mlflow_run:
                            mlflow.log_params(config)
                            mlflow.set_tag('study', self.study_name)

                            # WandB Run（嵌套在 trial 中）
                            wandb_run = wandb.init(
                                project=self.wandb_project,
                                group=self.study_name,
                                name=f'trial_{trial.number}',
                                config=config,
                                reinit=True,
                            )

                            # 训练模型
                            model = build_model(config)
                            metric = train_with_early_stop(
                                model, config, trial,
                                log_fn=lambda m: wandb.log(m),
                            )

                            # 记录结果
                            mlflow.log_metric('final_accuracy', metric)
                            mlflow.log_artifact('best_model.pth')
                            wandb.log({'final/accuracy': metric})
                            wandb_run.finish()

                        return metric

                    def run(self, n_trials):
                        self.study.optimize(self.objective, n_trials=n_trials)

                        # 保存最佳参数
                        best = {
                            'best_value': self.study.best_value,
                            'best_params': self.study.best_params,
                            'best_trial': self.study.best_trial.number,
                            'timestamp': datetime.now().isoformat(),
                        }
                        Path('best_params.json').write_text(
                            json.dumps(best, indent=2)
                        )
                        mlflow.log_artifact('best_params.json')

                        return best

                # 使用示例
                pipeline = HyperparameterPipeline(
                    study_name='resnet_cifar100_v2',
                    storage_url='postgresql://user:pass@host/optuna_db',
                    mlflow_uri='http://mlflow-server:5000',
                    wandb_project='cifar100-tuning',
                )
                best = pipeline.run(n_trials=200)
                print(f"Best accuracy: {best['best_value']:.4f}")
            

九、核心要点总结

超参数调优与实验追踪 · 知识体系

超参数类型：网络架构参数（层数、宽度）、训练参数（学习率、批次大小、优化器）、正则化参数（Dropout、权重衰减）、数据参数（增强策略、序列长度）
调优方法演进：网格搜索 → 随机搜索 → 贝叶斯优化（GP/TPE）→ 多 fidelity（HyperBand/ASHA）→ 动态调参（PBT），效率逐代提升
贝叶斯优化核心：代理模型（GP或TPE）+ 采集函数（EI/PI/UCB），在探索与利用之间平衡
Optuna 三要素：Study 管理生命周期、Trial 承载单次实验、Pruner 实现早停淘汰。suggest_* API 提供简洁的参数搜索空间定义
Ray Tune 优势：基于 Ray 原生分布式，支持 ASHA/PBT 等调度器，与 TensorBoard/MLflow 集成，资源感知调度
MLflow Tracking：params/metrics/tags/artifacts/log_model 五维记录体系，Model Registry 提供模型版本管理与阶段转换
WandB 特色：云端可视化仪表盘、Sweep 超参数搜索自动管理、Artifacts 数据集版本化、自动生成实验报告
生产化路径：从小规模探索 → 缩小搜索空间 → 大规模分布式调优 → 最佳参数记录与模型注册 → 部署到 Serving
关键经验：学习率最敏感先用对数尺度搜索；固定随机种子消除噪声；交叉验证防止过拟合验证集；早停激进导致误杀需谨慎