Optimization Geometry and Convergence in Deep Neural Networks

Audience: Stanford post-graduate ML students

1. Problem setup

Let

L(θ) = (1/n) Σᵢ₌₁ⁿ ℓ(f_θ(xᵢ), yᵢ), θ ∈ ℝᵈ

where ℓ is differentiable and f_θ is a deep network. We train by

θ_{t+1} = θ_t - η∇L(θ_t)

Even though L is nonconvex, useful regions often satisfy a Polyak-Lojasiewicz (PL) inequality.

2. PL inequality and linear convergence

Definition (PL)

L satisfies μ-PL if

(1/2)||∇L(θ)||₂² ≥ μ(L(θ) - L⋆), μ > 0
Theorem 1 (GD linear convergence under smooth + PL)

Assume:

  1. L is L-smooth,
  2. L is μ-PL,
  3. step size η ∈ (0, 1/L].

Then

L(θ_t) - L⋆ ≤ (1 - ημ)ᵗ(L(θ₀) - L⋆)

By L-smoothness,

L(θ_{t+1}) ≤ L(θ_t) + ⟨∇L(θ_t), θ_{t+1} - θ_t⟩ + (L/2)||θ_{t+1} - θ_t||²

Substitute θ_{t+1} - θ_t = -η∇L(θ_t):

L(θ_{t+1}) ≤ L(θ_t) - η||∇L(θ_t)||² + (Lη²/2)||∇L(θ_t)||²

For η ≤ 1/L:

L(θ_{t+1}) ≤ L(θ_t) - (η/2)||∇L(θ_t)||²

Apply PL:

L(θ_{t+1}) - L⋆ ≤ L(θ_t) - L⋆ - ημ(L(θ_t) - L⋆)

which gives the recursion

Δ_{t+1} ≤ (1 - ημ)Δ_t

hence geometric decay. □

3. Stochastic gradients and rates

Let g_t be an unbiased estimator: E[g_t|θ_t] = ∇L(θ_t) and

E||g_t - ∇L(θ_t)||² ≤ σ²

With constant η, one gets a neighborhood bound

(1/T) Σ_{t=0}^{T-1} E||∇L(θ_t)||² = O((L(θ₀) - L⋆)/(ηT) + ησ²)

Diminishing η_t ~ 1/√t yields the classical nonconvex rate

min_{t

Under PL + bounded variance, SGD attains linear convergence to a noise floor proportional to ησ²/μ.

4. Saddle avoidance intuition

Strict saddles have at least one direction of negative curvature. Perturbed GD and SGD noise almost surely escape strict saddles in polynomial time under smoothness and Hessian-Lipschitz conditions. This explains why empirically we rarely observe long-term trapping at saddles.

5. Practical implication

For deep nets, the useful question is not "is the objective convex?" but "does training trajectory enter a region with PL-like geometry and controlled noise?". Most optimizer/schedule choices are mechanisms to reach and stay in such regions.

Key takeaways:

  • PL inequality provides a framework for understanding linear convergence in nonconvex settings
  • SGD noise helps escape strict saddles while maintaining convergence in PL regions
  • Optimization geometry matters more than convexity for deep learning success
← Back to Blog Next Article →