Skip to main content

❌ Why κ row/col failed

In our initial experiments, we found that we had to disable κ\kappa-row projection (RowScale{} block) to get any meaningful learning. This is because the original κ\kappa mechanism was trying to enforce local gain constraints on each layer, row, or column. A simplified version is:

WΠκ(W),W_\ell \leftarrow \Pi_{\kappa_\ell}(W_\ell),

where:

  • WW_\ell is the weight matrix of layer \ell,
  • κ\kappa_\ell is the allowed gain/norm budget for that layer,
  • Πκ\Pi_{\kappa_\ell} is some projection/rescaling operator that pushes WW_\ell back into the allowed range.

For row scaling, this looks roughly like:

W,i:s,iW,i:,W_{\ell,i:} \leftarrow s_{\ell,i} W_{\ell,i:},

where W,i:W_{\ell,i:} is row ii of layer \ell, and s,is_{\ell,i} is a row-dependent scale factor.

For column scaling:

W,:js,jW,:j.W_{\ell,:j} \leftarrow s_{\ell,j} W_{\ell,:j}.

The problem is that online learning is not just about whether the weights are inside a norm box. It is about the direction in parameter space.

If the unconstrained gradient update is:

θt+1raw=θtηGt,\theta_{t+1}^{\text{raw}} = \theta_t-\eta G_t,

where:

  • θt\theta_t is the flattened vector of all trainable parameters,
  • Gt=θL(θt)G_t=\nabla_\theta \mathcal{L}(\theta_t) is the gradient,
  • η\eta is the learning rate,

then the ideal update direction is:

Δθtraw=ηGt.\Delta\theta_t^{\text{raw}} = -\eta G_t.

But with row/column κ\kappa projection, the actual update becomes:

θt+1actual=Πκ(θtηGt),\theta_{t+1}^{\text{actual}} = \Pi_\kappa(\theta_t-\eta G_t),

so the actual applied update is:

Δθtactual=Πκ(θtηGt)θt.\Delta\theta_t^{\text{actual}} = \Pi_\kappa(\theta_t-\eta G_t)-\theta_t.

This is generally not parallel to the gradient direction.

The diagnostic is the cosine similarity:

cost=Δθtactual,ΔθtrawΔθtactual2Δθtraw2+ϵ.\cos_t = \frac{ \left\langle \Delta\theta_t^{\text{actual}}, \Delta\theta_t^{\text{raw}} \right\rangle }{ \|\Delta\theta_t^{\text{actual}}\|_2 \|\Delta\theta_t^{\text{raw}}\|_2+\epsilon }.

If:

{projection preserved the intended learning directionif cost1projection made the update nearly orthogonal to the gradientif cost0projection is actively fighting the descent directionif cost<0\begin{cases} \text{projection preserved the intended learning direction} & \text{if } \cos_t\approx 1 \\ \text{projection made the update nearly orthogonal to the gradient} & \text{if } \cos_t\approx 0 \\ \text{projection is actively fighting the descent direction} & \text{if } \cos_t<0 \end{cases}

Summary box

Key insight: The row/column κ\kappa projection was enforcing local numerical constraints, but the projection operator was not geometry-preserving. It could rotate the effective update in parameter space. For inference-time gain control that may be acceptable, but during online training it can interfere with the optimizer’s descent direction.