Skip to main content

🎚️ The global throttle mechanism

Instead of projecting each row/column independently, we introduce a single global scalar:

0<αt1.0<\alpha_t\leq 1.

The update becomes:

θt+1=θtαtηGt\boxed{ \theta_{t+1} = \theta_t-\alpha_t\eta G_t }

where:

Gt=θL(θt).G_t=\nabla_\theta\mathcal{L}(\theta_t).

This is important because the update remains parallel to the original gradient update.

Raw update:

Δθtraw=ηGt.\Delta\theta_t^{\text{raw}} = -\eta G_t.

Controlled update:

Δθtctrl=αtηGt.\Delta\theta_t^{\text{ctrl}} = -\alpha_t\eta G_t.

Therefore:

Δθtctrl=αtΔθtraw.\Delta\theta_t^{\text{ctrl}} = \alpha_t\Delta\theta_t^{\text{raw}}.

So:

cos(Δθtctrl,Δθtraw)=1.\cos \left( \Delta\theta_t^{\text{ctrl}}, \Delta\theta_t^{\text{raw}} \right) = 1.

That means global throttling preserves the direction of learning. It only changes the speed.

This is the key contrast:

row/column κ projectionmay rotate the update\text{row/column κ projection} \Rightarrow \text{may rotate the update} global throttlepreserves the update direction\text{global throttle} \Rightarrow \text{preserves the update direction}

Estimation of local curvature/sensitivity

Finding the actual hessian is computationally expensive and sometimes even not feasible. Thus, the controller chooses αt\alpha_t using an online curvature estimate.

The gradient is:

G(θ)=θL(θ).G(\theta)=\nabla_\theta\mathcal{L}(\theta).

The Hessian is the derivative of the gradient:

H(θ)=Gθ=θ2L(θ).H(\theta) = \frac{\partial G}{\partial\theta} = \nabla_\theta^2\mathcal{L}(\theta).

For a small change:

Δθt=θtθt1,\Delta\theta_t = \theta_t-\theta_{t-1},

Taylor expansion gives:

GtGt1Ht(θtθt1).G_t-G_{t-1} \approx H_t(\theta_t-\theta_{t-1}).

So:

ΔGtHtΔθt.\Delta G_t \approx H_t\Delta\theta_t.

Taking norms:

ΔGtHtΔθt.\|\Delta G_t\| \approx \|H_t\Delta\theta_t\|.

Divide by:

Δθt.\|\Delta\theta_t\|.

Then:

ΔGtΔθtHtΔθtΔθt.\frac{ \|\Delta G_t\| }{ \|\Delta\theta_t\| } \approx \frac{ \|H_t\Delta\theta_t\| }{ \|\Delta\theta_t\| }.

This is an observed directional curvature estimate.

So we define:

C^t=GtGt12θtθt12+ϵ.\boxed{ \widehat{C}_t = \frac{ \|G_t-G_{t-1}\|_2 }{ \|\theta_t-\theta_{t-1}\|_2+\epsilon }. }

This does not always equal:

λmax(Ht).\lambda_{\max}(H_t).

More precisely, it estimates the curvature along the direction the optimizer just moved.

Smoothing and control

To avoid reacting to noise, we can smooth the curvature estimate with an exponential moving average:

St=(1ρ)St1+ρC^t.S_t = (1-\rho)S_{t-1} + \rho \widehat{C}_t.

And use:

Ctctrl=max(C^t,St).C_t^{\text{ctrl}} = \max(\widehat{C}_t,S_t).

Finally:

αt=min(1,χη(Ctctrl+ϵ))\boxed{ \alpha_t = \min \left( 1, \frac{\chi}{\eta(C_t^{\text{ctrl}}+\epsilon)} \right) }

where:

  • χ\chi is the desired stability margin,
  • η\eta is the base learning rate,
  • CtctrlC_t^{\text{ctrl}} is the curvature/sensitivity estimate,
  • ϵ\epsilon avoids division by zero.

The controller attempts to enforce:

αtηCtctrlχ.\boxed{ \alpha_t\eta C_t^{\text{ctrl}} \leq \chi. }

Caveat and summary

The previous estimate is not perfect. It is very good for convex problems and things like a single-dense layer architecture, but it can be inaccurate for non-convex problems and more complex architectures. However, it is a simple and computationally cheap way to get a sense of the local curvature/sensitivity, which is what we need to adapt the learning rate and prevent divergence.

Key insight: The global throttle mechanism preserves the geometry of the original gradient update, but it adaptively reduces the learning rate when the gradient field becomes stiff/sharp. This allows the model to keep learning without diverging, even under distribution shift.

What happens for nonlinear / nonconvex networks?

For deeper networks, ReLU networks, CNNs, cross-entropy, etc., the loss is not globally quadratic.

For a general network:

y^=f(x;θ),\hat{y}=f(x;\theta),

with loss:

L(θ),\mathcal{L}(\theta),

we use a local Taylor expansion:

L(θ+δ)L(θ)+G(θ)δ+12δH(θ)δ.\mathcal{L}(\theta+\delta) \approx \mathcal{L}(\theta) + G(\theta)^\top\delta + \frac{1}{2}\delta^\top H(\theta)\delta.

The gradient descent map is:

θt+1=θtηG(θt).\theta_{t+1} = \theta_t-\eta G(\theta_t).

Linearizing perturbations:

δt+1(IηHt)δt.\delta_{t+1} \approx (I-\eta H_t)\delta_t.

So the same stability idea applies locally.

However:

  • HtH_t changes over time,
  • HtH_t may be indefinite,
  • ReLU networks are piecewise smooth,
  • mini-batch gradients include sampling noise,
  • quantization introduces non-smooth perturbations.

Therefore, for general networks we should not claim:

C^t=λmax(Ht)\widehat{C}_t=\lambda_{\max}(H_t)

exactly.

Instead, we claim:

C^t\widehat{C}_t

is an online estimate of local update-field sensitivity:

C^tHtΔθtΔθt.\widehat{C}_t \approx \frac{ \|H_t\Delta\theta_t\| }{ \|\Delta\theta_t\| }.

This is a directional curvature estimate.

For a general loss, we want to keep:

αtηLt<χ,\alpha_t\eta L_t<\chi,

where LtL_t is a local estimate of gradient-field Lipschitzness.

The controller uses:

LtCtctrl.L_t\approx C_t^{\text{ctrl}}.

So:

αtηCtctrlχ.\boxed{ \alpha_t\eta C_t^{\text{ctrl}}\leq\chi. }

Key insight: For the linear case, this matches the true Hessian stability condition. For general networks, it becomes a local adaptive control rule.

Implementation

Implementation-wise, this throttle can be implemented layer by layer, but the final controller is global. For each layer \ell, we have parameters θ,t\theta_{\ell,t} and gradients G,tG_{\ell,t}.

We can compute local contributions:

A,t=G,tG,t122,A_{\ell,t} = \|G_{\ell,t}-G_{\ell,t-1}\|_2^2, B,t=θ,tθ,t122.B_{\ell,t} = \|\theta_{\ell,t}-\theta_{\ell,t-1}\|_2^2.

Then aggregate globally:

At==1LA,t,A_t = \sum_{\ell=1}^L A_{\ell,t}, Bt==1LB,t.B_t = \sum_{\ell=1}^L B_{\ell,t}.

Then:

C^t==1LG,tG,t122=1Lθ,tθ,t122+ϵ.\boxed{ \widehat{C}_t = \sqrt{ \frac{ \sum_{\ell=1}^L \|G_{\ell,t}-G_{\ell,t-1}\|_2^2 }{ \sum_{\ell=1}^L \|\theta_{\ell,t}-\theta_{\ell,t-1}\|_2^2 +\epsilon } }. }

This is equivalent to flattening all layers into one vector:

θt=vec(θ1,t,,θL,t),\theta_t= \operatorname{vec}(\theta_{1,t},\dots,\theta_{L,t}), Gt=vec(G1,t,,GL,t),G_t= \operatorname{vec}(G_{1,t},\dots,G_{L,t}),

and computing:

C^t=GtGt12θtθt12+ϵ.\widehat{C}_t = \frac{ \|G_t-G_{t-1}\|_2 }{ \|\theta_t-\theta_{t-1}\|_2+\epsilon }.

Hardware-wise, this can be implemented as streaming reductions:

accum_dG2accum_dG2+(G,tG,t1)2,\text{accum\_dG2} \leftarrow \text{accum\_dG2} + (G_{\ell,t}-G_{\ell,t-1})^2, accum_dtheta2accum_dtheta2+(θ,tθ,t1)2.\text{accum\_dtheta2} \leftarrow \text{accum\_dtheta2} + (\theta_{\ell,t}-\theta_{\ell,t-1})^2.

At the end, compute one global scalar: C^t\widehat{C}_t. Then broadcast one global: αt\alpha_t

So we do not multiply per-layer gains at the top level for this controller. The controller is based on the global sensitivity of the optimizer trajectory.

Layerwise versions are possible later:

C^,t=G,tG,t1θ,tθ,t1+ϵ.\widehat{C}_{\ell,t} = \frac{ \|G_{\ell,t}-G_{\ell,t-1}\| }{ \|\theta_{\ell,t}-\theta_{\ell,t-1}\|+\epsilon }.

But the first version should be global because global scaling preserves update geometry.

Key insight: The estimator can be accumulated layer by layer, but the control action is a single global scalar. That lets us keep hardware implementation simple and avoids layerwise distortion of the descent direction.