In our initial experiments, we found that we had to disable κ-row projection (RowScale{} block) to get any meaningful learning. This is because the original κ mechanism was trying to enforce local gain constraints on each layer, row, or column. A simplified version is:
Wℓ←Πκℓ(Wℓ),
where:
Wℓ is the weight matrix of layer ℓ,
κℓ is the allowed gain/norm budget for that layer,
Πκℓ is some projection/rescaling operator that pushes Wℓ back into the allowed range.
For row scaling, this looks roughly like:
Wℓ,i:←sℓ,iWℓ,i:,
where Wℓ,i: is row i of layer ℓ, and sℓ,i is a row-dependent scale factor.
For column scaling:
Wℓ,:j←sℓ,jWℓ,:j.
The problem is that online learning is not just about whether the weights are inside a norm box. It is about the direction in parameter space.
If the unconstrained gradient update is:
θt+1raw=θt−ηGt,
where:
θt is the flattened vector of all trainable parameters,
Gt=∇θL(θt) is the gradient,
η is the learning rate,
then the ideal update direction is:
Δθtraw=−ηGt.
But with row/column κ projection, the actual update becomes:
θt+1actual=Πκ(θt−ηGt),
so the actual applied update is:
Δθtactual=Πκ(θt−ηGt)−θt.
This is generally not parallel to the gradient direction.
⎩⎨⎧projection preserved the intended learning directionprojection made the update nearly orthogonal to the gradientprojection is actively fighting the descent directionif cost≈1if cost≈0if cost<0
Key insight: The row/column κ projection was enforcing local numerical constraints, but the projection operator was not geometry-preserving. It could rotate the effective update in parameter space. For inference-time gain control that may be acceptable, but during online training it can interfere with the optimizer’s descent direction.