Finding the actual hessian is computationally expensive and sometimes even not feasible. Thus, the controller chooses αt using an online curvature estimate.
The gradient is:
G(θ)=∇θL(θ).
The Hessian is the derivative of the gradient:
H(θ)=∂θ∂G=∇θ2L(θ).
For a small change:
Δθt=θt−θt−1,
Taylor expansion gives:
Gt−Gt−1≈Ht(θt−θt−1).
So:
ΔGt≈HtΔθt.
Taking norms:
∥ΔGt∥≈∥HtΔθt∥.
Divide by:
∥Δθt∥.
Then:
∥Δθt∥∥ΔGt∥≈∥Δθt∥∥HtΔθt∥.
This is an observed directional curvature estimate.
So we define:
Ct=∥θt−θt−1∥2+ϵ∥Gt−Gt−1∥2.
This does not always equal:
λmax(Ht).
More precisely, it estimates the curvature along the direction the optimizer just moved.
The previous estimate is not perfect. It is very good for convex problems and things like a single-dense layer architecture, but it can be inaccurate for non-convex problems and more complex architectures. However, it is a simple and computationally cheap way to get a sense of the local curvature/sensitivity, which is what we need to adapt the learning rate and prevent divergence.
Key insight: The global throttle mechanism preserves the geometry of the original gradient update, but it adaptively reduces the learning rate when the gradient field becomes stiff/sharp. This allows the model to keep learning without diverging, even under distribution shift.
Implementation-wise, this throttle can be implemented layer by layer, but the final controller is global. For each layer ℓ, we have parameters θℓ,t and gradients Gℓ,t.
At the end, compute one global scalar: Ct. Then broadcast one global: αt
So we do not multiply per-layer gains at the top level for this controller. The controller is based on the global sensitivity of the optimizer trajectory.
Layerwise versions are possible later:
Cℓ,t=∥θℓ,t−θℓ,t−1∥+ϵ∥Gℓ,t−Gℓ,t−1∥.
But the first version should be global because global scaling preserves update geometry.
Key insight: The estimator can be accumulated layer by layer, but the control action is a single global scalar. That lets us keep hardware implementation simple and avoids layerwise distortion of the descent direction.