Approximation, Generalization, and Scaling in Neural Networks

1. Approximation in high dimension: Barron-space perspective

For a two-layer network

f_m(x) = (1/m) Σ_{j=1}^m a_j σ(w_j^T x)

if target f⋆ has finite Barron norm ||f⋆||_B, there exists a parameterization such that

||f_m - f⋆||_{L²(ρ)} ≤ C (||f⋆||_B / √m)

Proof sketch

Represent f⋆ as an integral over features σ(w^T x) with finite first moment in Fourier domain. Approximate integral by Monte Carlo with m samples; Jensen + variance control gives m^{-1/2}.

This is dimension-robust if spectral mass is controlled.

2. Rademacher-style generalization bounds

For bounded Lipschitz loss and hypothesis class F,

sup_{f∈F} (E[ℓ(f)] - (1/n) Σ_{i=1}^n ℓ(f(x_i), y_i)) ≲ R_n(F) + √(log(1/δ)/n)

with probability 1 - δ.

For deep nets, R_n can be bounded via products of spectral norms times sums of Frobenius-to-spectral ratios. This highlights implicit regularization via optimization trajectory and normalization.

3. Margin-based bound for classification

Let γ be empirical margin. A typical bound takes the form

E_test ≲ E_γ + O(C(θ) / (γ\sqrtn))

where C(θ) captures norm complexity. Training often increases margin while controlling effective complexity, even past interpolation.

4. Why scaling works: optimization + statistics decomposition

A useful decomposition for excess risk:

E(θ̂) - E(θ⋆) = approximation error + estimation error + optimization error

As model size grows:

approximation error decreases,
estimation error may initially rise then fall (double descent regime),
optimization error decreases in overparameterized lazy/NTK-like regimes.

5. Compute-optimal scaling laws (stylized)

Empirical studies often fit

L(N,D,C) ≈ L_∞ + aN^{-α} + bD^{-β} + cC^{-γ}

with parameters N (model size), D (tokens/examples), C (compute). Under budget constraints, optimal allocation balances partial derivatives:

λ = αaN^{-α-1} = βbD^{-β-1} = γcC^{-γ-1}

The key principle is equalizing marginal loss reduction per unit compute.

6. Takeaway

Generalization in modern deep learning is better captured by norm/margin/trajectory-dependent complexity and scaling trade-offs than by VC dimension alone.

Key insights:

Barron space provides dimension-robust approximation guarantees for neural networks
Modern generalization depends on optimization trajectory, not just model complexity
Scaling laws emerge from balancing approximation, estimation, and optimization errors
Compute-optimal training requires equalizing marginal gains across resources