Optimization Geometry and Convergence in Deep Neural Networks

Audience: Stanford post-graduate ML students

1. Problem setup

Let

L(θ) = (1/n) Σᵢ₌₁ⁿ ℓ(f_θ(xᵢ), yᵢ), θ \in ℝᵈ

where ℓ is differentiable and f_θ is a deep network. We train by

θ_{t+1} = θ_t - η∇L(θ_t)

Even though L is nonconvex, useful regions often satisfy a Polyak-Lojasiewicz (PL) inequality.

L satisfies μ-PL if

(1/2)||\nablaL(θ)||₂² \geq μ(L(θ) - L⋆), μ > 0

Theorem 1 (GD linear convergence under smooth + PL)

Assume:

Then

L(θ_t) - L⋆ \leq (1 - ημ)ᵗ(L(θ₀) - L⋆)

By L-smoothness,

L(θ_{t+1}) ≤ L(θ_t) + ⟨∇L(θ_t), θ_{t+1} - θ_t⟩ + (L/2)||θ_{t+1} - θ_t||²

Substitute θ_{t+1} - θ_t = -η∇L(θ_t):

L(θ_{t+1}) ≤ L(θ_t) - η||∇L(θ_t)||² + (Lη²/2)||∇L(θ_t)||²

For η ≤ 1/L:

L(θ_{t+1}) ≤ L(θ_t) - (η/2)||∇L(θ_t)||²

Apply PL:

L(θ_{t+1}) - L⋆ ≤ L(θ_t) - L⋆ - ημ(L(θ_t) - L⋆)

which gives the recursion

Δ_{t+1} ≤ (1 - ημ)Δ_t

hence geometric decay. □

Let g_t be an unbiased estimator: E[g_t|θ_t] = ∇L(θ_t) and

E||g_t - \nablaL(θ_t)||² \leq σ²

With constant η, one gets a neighborhood bound

(1/T) Σ_{t=0}^{T-1} E||∇L(θ_t)||² = O((L(θ₀) - L⋆)/(ηT) + ησ²)

Diminishing η_t ~ 1/√t yields the classical nonconvex rate

min_{t Under PL + bounded variance, SGD attains linear convergence to a noise floor proportional to ησ²/μ. 4. Saddle avoidance intuition Strict saddles have at least one direction of negative curvature. Perturbed GD and SGD noise almost surely escape strict saddles in polynomial time under smoothness and Hessian-Lipschitz conditions. This explains why empirically we rarely observe long-term trapping at saddles. 5. Practical implication For deep nets, the useful question is not "is the objective convex?" but "does training trajectory enter a region with PL-like geometry and controlled noise?". Most optimizer/schedule choices are mechanisms to reach and stay in such regions. Key takeaways: PL inequality provides a framework for understanding linear convergence in nonconvex settings SGD noise helps escape strict saddles while maintaining convergence in PL regions Optimization geometry matters more than convexity for deep learning success