🌀 1-Lipschitz and stability

1. The original idea of $\kappa$ -per layer

The original $\kappa$ -budgeting idea is about controlling the network gain. For a network:

f(x;\theta),

we want a Lipschitz bound:

\|f(x_1;\theta)-f(x_2;\theta)\| \leq L \|x_1-x_2\|.

Here:

$x_1,x_2$ are two inputs,
$f(x;\theta)$ is the network output,
$L$ is the Lipschitz constant.

For a purely linear network:

f(x)=W_LW_{L-1}\cdots W_1x,

the exact input-output gain is:

K = \left\| W_LW_{L-1}\cdots W_1 \right\|.

Using submultiplicativity of matrix norms:

\left\| W_LW_{L-1}\cdots W_1 \right\| \leq \prod_{\ell=1}^L \|W_\ell\|.

So if we enforce:

\|W_\ell\|\leq \kappa_\ell,

then:

K \leq \prod_{\ell=1}^L \kappa_\ell.

That is the static $\kappa$ -budgeting idea:

\boxed{ \prod_{\ell=1}^{L}\kappa_\ell \leq L_{\max}. }

Key insight: The original $\kappa$ -budgeting idea was about controlling the static Lipschitz gain of the network. This is important for inference-time robustness and numerical stability.

2. The problem of online training

But online training introduces a different map: not only

x\mapsto f(x;\theta),

but also:

\theta_t \mapsto \theta_{t+1}.

The learning rule is itself a dynamical system:

\theta_{t+1} = \theta_t-\eta\nabla_\theta\mathcal{L}(\theta_t).

So the stability of online learning is governed by the sensitivity of the gradient field:

G(\theta) = \nabla_\theta \mathcal{L}(\theta).

A natural smoothness/Lipschitz condition for the gradient is:

\|G(\theta_a)-G(\theta_b)\| \leq L_G \|\theta_a-\theta_b\|.

Here $L_G$ is the Lipschitz constant of the gradient field. In smooth optimization, this is related to the Hessian norm:

L_G \approx \|H(\theta)\|_2,

where:

H(\theta)=\nabla_\theta^2\mathcal{L}(\theta).

So there are two different Lipschitz ideas:

a. Static network Lipschitzness

\|f(x_1;\theta)-f(x_2;\theta)\| \leq L_f \|x_1-x_2\|.

This is about input-output gain.

b. Dynamic optimizer Lipschitzness

\|\nabla\mathcal{L}(\theta_a)-\nabla\mathcal{L}(\theta_b)\| \leq L_G \|\theta_a-\theta_b\|.

This is about how violently the gradient changes when the weights move.

$\kappa$ -budgeting mostly targets the first. The new throttle targets the second.

Summary box

Key insight: We were controlling the static Lipschitz gain of the network, but the instability during online training is governed by the Lipschitzness of the gradient/update field. The new controller targets that closed-loop sensitivity.

1. The original idea of κ\kappaκ-per layer​

2. The problem of online training​

a. Static network Lipschitzness​

b. Dynamic optimizer Lipschitzness​

Summary box​

1. The original idea of $\kappa$ -per layer

2. The problem of online training

a. Static network Lipschitzness

b. Dynamic optimizer Lipschitzness

Summary box