Higher-Order Methods for Deep Learning: Curvature, Rates, and Practicality
1. Why higher-order?
First-order methods are cheap but can be curvature-blind. Higher-order methods use Hessian or Fisher structure to improve conditioning.
2. Newton and damped Newton
Newton step:
where g_t = ∇L(θ_t) and H_t = ∇²L(θ_t).
Damped version uses (H_t + λI)^{-1}.
Assume L twice continuously differentiable, Hessian Lipschitz near minimizer θ⋆, and H(θ⋆) ≻ 0. Then pure Newton iterates sufficiently close to θ⋆ satisfy
Use Taylor expansion of gradient around θ_t and inverse-function stability of Hessian in neighborhood of optimum; first-order error cancels, leaving second-order residual proportional to Hessian-Lipschitz constant.
3. Cubic regularization global rate
At iteration t, solve approximately
This method has worst-case complexity to reach ||∇L|| ≤ ε of
improving over gradient descent's O(ε^{-2}) in nonconvex smooth settings.
4. Natural gradient and information geometry
Define Fisher matrix
Natural gradient step:
It is steepest descent in distribution space under KL metric, giving reparameterization invariance (locally).
Solve
Lagrangian yields
So natural gradient is the trust-region direction under KL-induced quadratic approximation.
5. Practical approximations: K-FAC and Shampoo
Exact Hessian/Fisher inversion is infeasible at scale. Structured approximations:
- K-FAC: Kronecker-factorized blocks per layer.
- Shampoo: tensor preconditioning with matrix roots.
- L-BFGS variants: low-memory curvature estimates.
Cost-benefit depends on batch size, model architecture, and hardware communication overhead.
6. When higher-order wins
Higher-order methods provide strongest gains when:
- curvature anisotropy is high,
- batch sizes are large enough for stable curvature estimates,
- wall-clock bottleneck is optimization steps rather than per-step compute.
Practical considerations:
- Newton methods excel in well-conditioned regions with strong curvature
- Natural gradients provide parameterization invariance for probabilistic models
- Modern approximations like K-FAC make higher-order methods feasible at scale
- Success depends on matching method characteristics to problem structure and hardware constraints