Skip to main content

🌀 1-Lipschitz and stability

1. The original idea of κ\kappa-per layer

The original κ\kappa-budgeting idea is about controlling the network gain. For a network:

f(x;θ),f(x;\theta),

we want a Lipschitz bound:

f(x1;θ)f(x2;θ)Lx1x2.\|f(x_1;\theta)-f(x_2;\theta)\| \leq L \|x_1-x_2\|.

Here:

  • x1,x2x_1,x_2 are two inputs,
  • f(x;θ)f(x;\theta) is the network output,
  • LL is the Lipschitz constant.

For a purely linear network:

f(x)=WLWL1W1x,f(x)=W_LW_{L-1}\cdots W_1x,

the exact input-output gain is:

K=WLWL1W1.K = \left\| W_LW_{L-1}\cdots W_1 \right\|.

Using submultiplicativity of matrix norms:

WLWL1W1=1LW.\left\| W_LW_{L-1}\cdots W_1 \right\| \leq \prod_{\ell=1}^L \|W_\ell\|.

So if we enforce:

Wκ,\|W_\ell\|\leq \kappa_\ell,

then:

K=1Lκ.K \leq \prod_{\ell=1}^L \kappa_\ell.

That is the static κ\kappa-budgeting idea:

=1LκLmax.\boxed{ \prod_{\ell=1}^{L}\kappa_\ell \leq L_{\max}. }

Key insight: The original κ\kappa-budgeting idea was about controlling the static Lipschitz gain of the network. This is important for inference-time robustness and numerical stability.

2. The problem of online training

But online training introduces a different map: not only

xf(x;θ),x\mapsto f(x;\theta),

but also:

θtθt+1.\theta_t \mapsto \theta_{t+1}.

The learning rule is itself a dynamical system:

θt+1=θtηθL(θt).\theta_{t+1} = \theta_t-\eta\nabla_\theta\mathcal{L}(\theta_t).

So the stability of online learning is governed by the sensitivity of the gradient field:

G(θ)=θL(θ).G(\theta) = \nabla_\theta \mathcal{L}(\theta).

A natural smoothness/Lipschitz condition for the gradient is:

G(θa)G(θb)LGθaθb.\|G(\theta_a)-G(\theta_b)\| \leq L_G \|\theta_a-\theta_b\|.

Here LGL_G is the Lipschitz constant of the gradient field. In smooth optimization, this is related to the Hessian norm:

LGH(θ)2,L_G \approx \|H(\theta)\|_2,

where:

H(θ)=θ2L(θ).H(\theta)=\nabla_\theta^2\mathcal{L}(\theta).

So there are two different Lipschitz ideas:

a. Static network Lipschitzness

f(x1;θ)f(x2;θ)Lfx1x2.\|f(x_1;\theta)-f(x_2;\theta)\| \leq L_f \|x_1-x_2\|.

This is about input-output gain.

b. Dynamic optimizer Lipschitzness

L(θa)L(θb)LGθaθb.\|\nabla\mathcal{L}(\theta_a)-\nabla\mathcal{L}(\theta_b)\| \leq L_G \|\theta_a-\theta_b\|.

This is about how violently the gradient changes when the weights move.

κ\kappa-budgeting mostly targets the first. The new throttle targets the second.

Summary box

Key insight: We were controlling the static Lipschitz gain of the network, but the instability during online training is governed by the Lipschitzness of the gradient/update field. The new controller targets that closed-loop sensitivity.