Skip to main content

Closed-Loop Ablation Architecture

The ablation harness should mirror the ENABOL online training loop, but remain small enough to inspect every tensor and compute exact curvature diagnostics. The first implementation target is software simulation, not HLS synthesis.

Training Loop

Each experiment follows the same high-level flow:

1. Generate a controlled dataset.
2. Train a small floating-point or high-precision reference model.
3. Quantize or simulate fixed-point training with selected precisions.
4. Apply input drift.
5. Continue online training with one controller variant enabled.
6. Log loss, norms, curvature proxies, throttle, update geometry, saturation, and rails.

The online loop should operate on a flattened global parameter vector:

theta = flatten(W1, b1, W2, b2, ...)
G = flatten(dL/dW1, dL/db1, dL/dW2, dL/db2, ...)

This makes global controllers easy to implement and lets us measure whether a method preserves the intended update direction.

Priority Controllers

Implement these first:

SwitchMeaning
controller=noneBaseline online training.
controller=dynamic_global_throttleCompute one scalar alpha(t) and scale the full update vector.
controller=global_static_kappa_scaleIf global gain exceeds K_max, scale all layers by one shared scalar.
controller=loose_kappa_plus_throttleKeep loose static rails and apply dynamic global throttle.
precisionFixed-point format or simulated fixed-point rails.

Legacy row/column kappa projection can be included later as controller=legacy_row_col_projection if it is already available or cheap to stub. It is not a first implementation requirement.

Dynamic Global Throttle

At each online step:

Algorithm 1: DynamicGlobalThrottle
  1. input current parameters θ(t)\theta(t), gradient G(t)G(t), learning rate η\eta
  2. input previous parameters θ(t1)\theta(t-1), previous gradient G(t1)G(t-1)
  3. Δraw(t)ηG(t)\Delta_{\mathrm{raw}}(t) \leftarrow -\eta G(t)
  4. C(t)G(t)G(t1)θ(t)θ(t1)+εC(t) \leftarrow \dfrac{\lVert G(t) - G(t-1) \rVert}{\lVert \theta(t) - \theta(t-1) \rVert + \varepsilon} curvature proxy
  5. S(t)EMA(C(t))S(t) \leftarrow \operatorname{EMA}(C(t))
  6. α(t)clamp(11+βS(t),αmin,1)\alpha(t) \leftarrow \operatorname{clamp}\left(\dfrac{1}{1 + \beta S(t)}, \alpha_{\min}, 1\right)
  7. Δactual(t)α(t)Δraw(t)\Delta_{\mathrm{actual}}(t) \leftarrow \alpha(t)\Delta_{\mathrm{raw}}(t)
  8. θ(t+1)θ(t)+Δactual(t)\theta(t+1) \leftarrow \theta(t) + \Delta_{\mathrm{actual}}(t)
  9. return θ(t+1)\theta(t+1), α(t)\alpha(t), C(t)C(t)
The scalar α(t)\alpha(t) is shared globally across all layers.

Because α(t)\alpha(t) is global, it preserves the raw update direction:

cos(Δactual,G)1cos(\Delta_{\mathrm{actual}}, -G) \approx 1

unless fixed-point saturation, projection, or another mechanism distorts the update.

Experiment 001: Single Dense Affine Regression

This is the minimum test case. It isolates closed-loop update stability without inter-layer interactions.

Math:

xU([0,1]d)x \sim U([0, 1]^d) y=Ax+cy = Ax + c y^=W1x+b1\hat{y} = W_1x + b_1 L=mean((y^y)2)L = \operatorname{mean}\left((\hat{y} - y)^2\right)

Backpass:

gy=Ly^g_y = \frac{\partial L}{\partial \hat{y}} gW1=xgyTg_{W_1} = x g_y^T gb1=gyg_{b_1} = g_y gx=W1Tgyg_x = W_1^T g_y

Drift:

x_drift = alpha x + beta

Primary question:

Can dynamic global throttling keep online fixed-point training stable in a known linear system where the exact solution and Hessian are easy to inspect?

Experiment 002: Two Dense Layers With ReLU

This introduces an intermediate activation and an inter-layer gradient path while still staying small enough to inspect.

Teacher model:

y=A2relu(A1x+c1)+c2y = A_2 \operatorname{relu}(A_1x + c_1) + c_2

Student model:

z1=W1x+b1z_1 = W_1x + b_1 a1=relu(z1)a_1 = \operatorname{relu}(z_1) y^=W2a1+b2\hat{y} = W_2a_1 + b_2

Backpass:

gy=Ly^g_y = \frac{\partial L}{\partial \hat{y}} gW2=a1gyTg_{W_2} = a_1 g_y^T gb2=gyg_{b_2} = g_y ga1=W2Tgyg_{a_1} = W_2^T g_y gz1=ga11[z1>0]g_{z_1} = g_{a_1}\,\mathbf{1}[z_1 > 0] gW1=xgz1Tg_{W_1} = x g_{z_1}^T gb1=gz1g_{b_1} = g_{z_1} gx=W1Tgz1g_x = W_1^T g_{z_1}

Primary question:

When there is an intermediate activation, can global throttling stabilize coupled layer dynamics without changing descent geometry?

Comparison Variants

The first matrix should be small and should not require rebuilding the old row/column machinery:

VariantRequired NowPurpose
Floating referenceyesEstablish expected behavior without fixed-point limits.
Fixed-point baselineyesFind regimes where online learning fails.
Dynamic global throttleyesTest closed-loop stabilization while preserving update geometry.
Loose kappa + throttleyesTest static safety rails plus dynamic control.
Global static kappa scaleyesTest global gain control without row/layer direction changes.
Legacy row/column projectionoptionalCompare against the old mechanism only if available or cheap to stub.

The key comparison is baseline fixed-point versus dynamic global throttle. Legacy row/column projection is useful for diagnosing direction distortion, but it is secondary.

Required Logs

Each run should produce machine-readable logs and notebook plots for:

  • loss before and after drift,
  • output error before and after drift,
  • global and per-layer weight norms,
  • global and per-layer gradient norms,
  • global and per-layer update norms,
  • curvature proxy C(t),
  • EMA instability signal S(t),
  • global throttle alpha(t),
  • update cosine between actual update and -G,
  • activation min/max/percentiles per layer,
  • gradient min/max/percentiles per layer,
  • fixed-point saturation counts per tensor,
  • rail pressure fractions per tensor,
  • product gain or approximate forward gain,
  • optional Hessian norm, lambda_max(H), rho(I - eta H), and rho(I - alpha eta H).

The update cosine is important because it directly measures whether budgeting preserves descent direction:

cos(θ)=Δraw,ΔbudgetedΔraw2Δbudgeted2\cos(\theta) = \frac{ \langle \Delta_{\mathrm{raw}}, \Delta_{\mathrm{budgeted}} \rangle }{ \lVert \Delta_{\mathrm{raw}} \rVert_2 \lVert \Delta_{\mathrm{budgeted}} \rVert_2 }

Values near 1 mean budgeting mostly rescales the update. Lower or negative values mean budgeting has substantially changed the direction.

For dynamic global throttle alone, this value should stay near 1. If it does not, the fixed-point path or saturation logic is changing the update.