Hyperparameters, Schedules, and Stability Scaling in Deep Networks

1. Learning-rate and batch-size coupling

For SGD near a basin, noise scale can be approximated as

S ∝ η (N - B) / B

where N is dataset size and B batch size. Holding S roughly constant motivates linear LR scaling with B in large-batch regimes:

η(B) ≈ η₀ (B / B₀)

typically with warmup to avoid early-time instability.

2. Warmup theorem in a linearized regime

Consider local quadratic model

L(θ) = (1/2)(θ - θ⋆)^T H(θ - θ⋆), H ⪰ 0

GD is stable iff

0 < η < 2 / λ_max(H)

Early training often has rapidly changing effective Hessian. Warmup controls transient violation of this bound.

Error e_t = θ_t - θ⋆ evolves as

e_{t+1} = (I - ηH)e_t

In eigenbasis, each component multiplies by 1 - ηλ_i. Convergence requires |1 - ηλ_i| < 1 for all i, equivalent to the condition above. □

3. Cosine decay and polynomial decay

A common schedule:

η_t = η_min + (1/2)(η_max - η_min)(1 + cos(πt/T))

Interpretation: high exploration/noise early, low-variance refinement late. In stochastic differential equation approximations of SGD, this corresponds to annealing temperature.

4. Weight decay, AdamW, and decoupling

In AdamW,

θ_{t+1} = θ_t - η_t m̂_t/(√v̂_t + ε) - η_t λθ_t

Decoupled decay isolates norm shrinkage from adaptive gradient statistics, improving tuning predictability versus naive L2 inside adaptive preconditioner.

5. EMA for sampling and evaluation

Exponential moving average parameters:

θ̄_t = β θ̄_{t-1} + (1 - β)θ_t

Bias-corrected effective window is approximately 1/(1 - β). EMA reduces high-frequency optimizer noise and often improves diffusion sample quality substantially.

6. Diffusion-specific schedules

Discrete forward process:

q(x_t|x_{t-1}) = N(√(1 - β_t)x_{t-1}, β_t I), ᾱ_t = Π_{s=1}^t(1 - β_s)

Signal-to-noise ratio:

SNR(t) = ᾱ_t / (1 - ᾱ_t)

Schedule design (linear, cosine, EDM-style continuous sigma schedules) controls where denoising capacity is allocated across frequency scales.

7. Practical scaling recipe (large models)

  1. Choose target compute budget and sequence length.
  2. Co-design model width/depth and token count to match empirical scaling exponents.
  3. Set global batch by memory/throughput.
  4. Apply LR warmup, then cosine/polynomial decay.
  5. Tune weight decay and gradient clipping to satisfy stability constraints.
  6. Use EMA and mixed precision with loss scaling.
  7. Track gradient noise scale and Hessian trace proxies to retune schedule.

8. Final perspective

At scale, hyperparameters are not "tricks". They are control variables of a stochastic dynamical system. Good schedules match curvature, gradient noise, and data regime over training time, which is why transfer across model sizes requires explicit scaling rules.

Key principles:

  • Learning rate and batch size are coupled through noise scale considerations
  • Warmup prevents instability during early training with rapidly changing curvature
  • Schedule design balances exploration and refinement over training trajectory
  • Large-scale training requires systematic hyperparameter scaling, not manual tuning
← Previous Article Back to Blog