Optimization Geometry and Convergence in Deep Neural Networks
1. Problem setup
Let
where ℓ is differentiable and f_θ is a deep network. We train by
Even though L is nonconvex, useful regions often satisfy a Polyak-Lojasiewicz (PL) inequality.
2. PL inequality and linear convergence
Definition (PL)
L satisfies μ-PL if
Assume:
- L is L-smooth,
- L is μ-PL,
- step size η ∈ (0, 1/L].
Then
By L-smoothness,
Substitute θ_{t+1} - θ_t = -η∇L(θ_t):
For η ≤ 1/L:
Apply PL:
which gives the recursion
hence geometric decay. □
3. Stochastic gradients and rates
Let g_t be an unbiased estimator: E[g_t|θ_t] = ∇L(θ_t) and
With constant η, one gets a neighborhood bound
Diminishing η_t ~ 1/√t yields the classical nonconvex rate
Under PL + bounded variance, SGD attains linear convergence to a noise floor proportional to ησ²/μ.
4. Saddle avoidance intuition
Strict saddles have at least one direction of negative curvature. Perturbed GD and SGD noise almost surely escape strict saddles in polynomial time under smoothness and Hessian-Lipschitz conditions. This explains why empirically we rarely observe long-term trapping at saddles.
5. Practical implication
For deep nets, the useful question is not "is the objective convex?" but "does training trajectory enter a region with PL-like geometry and controlled noise?". Most optimizer/schedule choices are mechanisms to reach and stay in such regions.
Key takeaways:
- PL inequality provides a framework for understanding linear convergence in nonconvex settings
- SGD noise helps escape strict saddles while maintaining convergence in PL regions
- Optimization geometry matters more than convexity for deep learning success