🎚️ The global throttle mechanism

Instead of projecting each row/column independently, we introduce a single global scalar:

0<\alpha_t\leq 1.

The update becomes:

\boxed{ \theta_{t+1} = \theta_t-\alpha_t\eta G_t }

where:

G_t=\nabla_\theta\mathcal{L}(\theta_t).

This is important because the update remains parallel to the original gradient update.

Raw update:

\Delta\theta_t^{\text{raw}} = -\eta G_t.

Controlled update:

\Delta\theta_t^{\text{ctrl}} = -\alpha_t\eta G_t.

Therefore:

\Delta\theta_t^{\text{ctrl}} = \alpha_t\Delta\theta_t^{\text{raw}}.

So:

\cos \left( \Delta\theta_t^{\text{ctrl}}, \Delta\theta_t^{\text{raw}} \right) = 1.

That means global throttling preserves the direction of learning. It only changes the speed.

This is the key contrast:

\text{row/column κ projection} \Rightarrow \text{may rotate the update}

\text{global throttle} \Rightarrow \text{preserves the update direction}

Estimation of local curvature/sensitivity

Finding the actual hessian is computationally expensive and sometimes even not feasible. Thus, the controller chooses $\alpha_t$ using an online curvature estimate.

The gradient is:

G(\theta)=\nabla_\theta\mathcal{L}(\theta).

The Hessian is the derivative of the gradient:

H(\theta) = \frac{\partial G}{\partial\theta} = \nabla_\theta^2\mathcal{L}(\theta).

For a small change:

\Delta\theta_t = \theta_t-\theta_{t-1},

Taylor expansion gives:

G_t-G_{t-1} \approx H_t(\theta_t-\theta_{t-1}).

So:

\Delta G_t \approx H_t\Delta\theta_t.

Taking norms:

\|\Delta G_t\| \approx \|H_t\Delta\theta_t\|.

Divide by:

\|\Delta\theta_t\|.

Then:

\frac{ \|\Delta G_t\| }{ \|\Delta\theta_t\| } \approx \frac{ \|H_t\Delta\theta_t\| }{ \|\Delta\theta_t\| }.

This is an observed directional curvature estimate.

So we define:

\boxed{ \widehat{C}_t = \frac{ \|G_t-G_{t-1}\|_2 }{ \|\theta_t-\theta_{t-1}\|_2+\epsilon }. }

This does not always equal:

\lambda_{\max}(H_t).

More precisely, it estimates the curvature along the direction the optimizer just moved.

Smoothing and control

To avoid reacting to noise, we can smooth the curvature estimate with an exponential moving average:

S_t = (1-\rho)S_{t-1} + \rho \widehat{C}_t.

And use:

C_t^{\text{ctrl}} = \max(\widehat{C}_t,S_t).

Finally:

\boxed{ \alpha_t = \min \left( 1, \frac{\chi}{\eta(C_t^{\text{ctrl}}+\epsilon)} \right) }

where:

$\chi$ is the desired stability margin,
$\eta$ is the base learning rate,
$C_t^{\text{ctrl}}$ is the curvature/sensitivity estimate,
$\epsilon$ avoids division by zero.

The controller attempts to enforce:

\boxed{ \alpha_t\eta C_t^{\text{ctrl}} \leq \chi. }

Caveat and summary

The previous estimate is not perfect. It is very good for convex problems and things like a single-dense layer architecture, but it can be inaccurate for non-convex problems and more complex architectures. However, it is a simple and computationally cheap way to get a sense of the local curvature/sensitivity, which is what we need to adapt the learning rate and prevent divergence.

Key insight: The global throttle mechanism preserves the geometry of the original gradient update, but it adaptively reduces the learning rate when the gradient field becomes stiff/sharp. This allows the model to keep learning without diverging, even under distribution shift.

What happens for nonlinear / nonconvex networks?

For deeper networks, ReLU networks, CNNs, cross-entropy, etc., the loss is not globally quadratic.

For a general network:

\hat{y}=f(x;\theta),

with loss:

\mathcal{L}(\theta),

we use a local Taylor expansion:

\mathcal{L}(\theta+\delta) \approx \mathcal{L}(\theta) + G(\theta)^\top\delta + \frac{1}{2}\delta^\top H(\theta)\delta.

The gradient descent map is:

\theta_{t+1} = \theta_t-\eta G(\theta_t).

Linearizing perturbations:

\delta_{t+1} \approx (I-\eta H_t)\delta_t.

So the same stability idea applies locally.

However:

$H_t$ changes over time,
$H_t$ may be indefinite,
ReLU networks are piecewise smooth,
mini-batch gradients include sampling noise,
quantization introduces non-smooth perturbations.

Therefore, for general networks we should not claim:

\widehat{C}_t=\lambda_{\max}(H_t)

exactly.

Instead, we claim:

\widehat{C}_t

is an online estimate of local update-field sensitivity:

\widehat{C}_t \approx \frac{ \|H_t\Delta\theta_t\| }{ \|\Delta\theta_t\| }.

This is a directional curvature estimate.

For a general loss, we want to keep:

\alpha_t\eta L_t<\chi,

where $L_t$ is a local estimate of gradient-field Lipschitzness.

The controller uses:

L_t\approx C_t^{\text{ctrl}}.

So:

\boxed{ \alpha_t\eta C_t^{\text{ctrl}}\leq\chi. }

Key insight: For the linear case, this matches the true Hessian stability condition. For general networks, it becomes a local adaptive control rule.

Implementation

Implementation-wise, this throttle can be implemented layer by layer, but the final controller is global. For each layer $\ell$ , we have parameters $\theta_{\ell,t}$ and gradients $G_{\ell,t}$ .

We can compute local contributions:

A_{\ell,t} = \|G_{\ell,t}-G_{\ell,t-1}\|_2^2,

B_{\ell,t} = \|\theta_{\ell,t}-\theta_{\ell,t-1}\|_2^2.

Then aggregate globally:

A_t = \sum_{\ell=1}^L A_{\ell,t},

B_t = \sum_{\ell=1}^L B_{\ell,t}.

Then:

\boxed{ \widehat{C}_t = \sqrt{ \frac{ \sum_{\ell=1}^L \|G_{\ell,t}-G_{\ell,t-1}\|_2^2 }{ \sum_{\ell=1}^L \|\theta_{\ell,t}-\theta_{\ell,t-1}\|_2^2 +\epsilon } }. }

This is equivalent to flattening all layers into one vector:

\theta_t= \operatorname{vec}(\theta_{1,t},\dots,\theta_{L,t}),

G_t= \operatorname{vec}(G_{1,t},\dots,G_{L,t}),

and computing:

\widehat{C}_t = \frac{ \|G_t-G_{t-1}\|_2 }{ \|\theta_t-\theta_{t-1}\|_2+\epsilon }.

Hardware-wise, this can be implemented as streaming reductions:

\text{accum\_dG2} \leftarrow \text{accum\_dG2} + (G_{\ell,t}-G_{\ell,t-1})^2,

\text{accum\_dtheta2} \leftarrow \text{accum\_dtheta2} + (\theta_{\ell,t}-\theta_{\ell,t-1})^2.

At the end, compute one global scalar: $\widehat{C}_t$ . Then broadcast one global: $\alpha_t$

So we do not multiply per-layer gains at the top level for this controller. The controller is based on the global sensitivity of the optimizer trajectory.

Layerwise versions are possible later:

\widehat{C}_{\ell,t} = \frac{ \|G_{\ell,t}-G_{\ell,t-1}\| }{ \|\theta_{\ell,t}-\theta_{\ell,t-1}\|+\epsilon }.

But the first version should be global because global scaling preserves update geometry.

Key insight: The estimator can be accumulated layer by layer, but the control action is a single global scalar. That lets us keep hardware implementation simple and avoids layerwise distortion of the descent direction.

Estimation of local curvature/sensitivity​

Smoothing and control​

Caveat and summary​

What happens for nonlinear / nonconvex networks?​

Implementation​

Estimation of local curvature/sensitivity

Smoothing and control

Caveat and summary

What happens for nonlinear / nonconvex networks?

Implementation