Hyperparameters, Schedules, and Stability Scaling in Deep Networks
1. Learning-rate and batch-size coupling
For SGD near a basin, noise scale can be approximated as
where N is dataset size and B batch size. Holding S roughly constant motivates linear LR scaling with B in large-batch regimes:
typically with warmup to avoid early-time instability.
2. Warmup theorem in a linearized regime
Consider local quadratic model
GD is stable iff
Early training often has rapidly changing effective Hessian. Warmup controls transient violation of this bound.
Error e_t = θ_t - θ⋆ evolves as
In eigenbasis, each component multiplies by 1 - ηλ_i. Convergence requires |1 - ηλ_i| < 1 for all i, equivalent to the condition above. □
3. Cosine decay and polynomial decay
A common schedule:
Interpretation: high exploration/noise early, low-variance refinement late. In stochastic differential equation approximations of SGD, this corresponds to annealing temperature.
4. Weight decay, AdamW, and decoupling
In AdamW,
Decoupled decay isolates norm shrinkage from adaptive gradient statistics, improving tuning predictability versus naive L2 inside adaptive preconditioner.
5. EMA for sampling and evaluation
Exponential moving average parameters:
Bias-corrected effective window is approximately 1/(1 - β). EMA reduces high-frequency optimizer noise and often improves diffusion sample quality substantially.
6. Diffusion-specific schedules
Discrete forward process:
Signal-to-noise ratio:
Schedule design (linear, cosine, EDM-style continuous sigma schedules) controls where denoising capacity is allocated across frequency scales.
7. Practical scaling recipe (large models)
- Choose target compute budget and sequence length.
- Co-design model width/depth and token count to match empirical scaling exponents.
- Set global batch by memory/throughput.
- Apply LR warmup, then cosine/polynomial decay.
- Tune weight decay and gradient clipping to satisfy stability constraints.
- Use EMA and mixed precision with loss scaling.
- Track gradient noise scale and Hessian trace proxies to retune schedule.
8. Final perspective
At scale, hyperparameters are not "tricks". They are control variables of a stochastic dynamical system. Good schedules match curvature, gradient noise, and data regime over training time, which is why transfer across model sizes requires explicit scaling rules.
Key principles:
- Learning rate and batch size are coupled through noise scale considerations
- Warmup prevents instability during early training with rapidly changing curvature
- Schedule design balances exploration and refinement over training trajectory
- Large-scale training requires systematic hyperparameter scaling, not manual tuning