The Training Inspect System
training monitorsignal detectionmodel diffsae feature diffgradient analysisregression trackingcalibration

The Training Inspect System

Aquin Labs · April 2026

Watching a model learn in real time

Fine-tuning a model is not a black box you submit a job to and collect a checkpoint from. It is a process with a structure: a loss trajectory that tells a story about optimization, gradient dynamics that reveal how information is flowing through the network, feature activations that shift as the model rewrites its internal representations to accommodate the training objective. Most of that structure goes unobserved. The loss curve is visible. Everything under it is not.

Aquin's training inspect system brings the structure to the surface. It receives a stream of step events from your training run (loss, learning rate, gradient norms, per-layer breakdown, dead layer list, epoch index) and runs a signal engine against each step in real time. When the engine detects a gradient spike, a loss plateau, a collapsed attention head, or the onset of divergence, a signal fires immediately with the specific metric that triggered it and the exact step at which it occurred.

The dashboard renders all of it live: a loss sparkline that updates on each event, a signals feed ordered by severity, a model diff panel when training completes, and a per-layer SAE feature diff that shows exactly which internal representations the fine-tune rewrote.

loss and gradient norm · 20-step window

LOSSplateau0.74MAX GRAD NORMgrad spike1.84s1s21s46s71s96

signal markers overlay each curve at the step they fired. plateau on loss, grad spike on grad norm.

AQUIN · EARLY ACCESS
Run this on your own training run
join waitlist

What gets streamed from the run

The training inspect system is agnostic to training framework. It consumes a step event schema: step index, loss, learning rate, max grad norm, per-layer grad norms as a record, dead layer list, epoch index. Any training loop that can emit those fields via the Aquin Python SDK or a direct API call is compatible. The schema is intentionally flat: no nested objects, no optional deep structures.

The per-layer grad norm breakdown is what enables the dead layer and dead attention head detectors. Without it, the engine can only observe aggregate gradient behavior. With it, it can name the specific layer that has collapsed and track how long it has been dead. The layer naming convention follows whatever key names your training framework uses; the engine applies a regex for attention layers and treats everything else as non-attention.

step event schema

StepSnapshot
stepnumberStep index
lossnumberTraining loss at this step
learning_ratenumber?Current LR from scheduler
maxGradnumber?Max gradient norm across all params
gradNormsRecord<string, number>?Per-layer grad norms, enables dead layer detection
deadLayersstring[]?Layers already over streak threshold
epochnumber?Current epoch, used in plateau message

gradNorms is the key field. without per-layer breakdown, dead layer and attention head detection are unavailable.

Lossper step
Learning ratescheduler
Grad normsmax + per layer
Weight normsper layer
Dead layersnamed list
#
Epochindex

The signal engine

The signal engine is a pure function that runs on each new step snapshot. It takes the full history of steps plus two persistent streak maps (one for non-attention layers, one for attention layers) and returns a signal if one fired, or null. The streak maps are the only stateful part: they persist across steps so that dead layer detection can track how many consecutive steps a layer has had near-zero gradient. Everything else is computed fresh from the step history.

The engine runs five detectors in priority order. Divergence and gradient spikes are checked first because they indicate active instability that may warrant stopping the run. Dead attention heads and dead layers are checked next, naming specific failed components. LR decay and plateau are checked last because they often describe expected, healthy behavior rather than problems. The priority ladder ensures the most actionable finding surfaces first when multiple signals could fire on the same step.

A cooldown of thirty steps prevents the same signal type from re-emitting continuously. A plateau that persists does not flood the feed. A gradient spike that resolves and re-occurs will fire again after thirty steps, which is the right behavior; the second occurrence is a distinct event that warrants attention.

signal priority order · detector 01 checked first

01
loss divergingcritical

Ten consecutive steps with monotonically increasing loss. The engine computes the raw rise across that window and flags critical when the delta exceeds 0.5. This is the earliest reliable sign that a run is heading toward divergence before the loss curve makes it visually obvious.

loss[i] ≥ loss[i−1] for 10 stepsrise > 0.5 → critical
02
gradient spikewarn / critical

Max grad norm versus the rolling mean of the last 20 steps. Fires when the latest norm exceeds five times the baseline and is above 1.0 in absolute terms. A single spike rarely terminates a run, but consecutive spikes indicate the optimizer is stepping into a region it cannot navigate cleanly.

maxGrad > 5× rolling mean AND > 1.0> 20× rolling mean → critical
03
attention head deadwarn

Attention layers with gradient norms below 1e-6 for five consecutive steps. Attention collapse is mechanistically distinct from MLP layer death: a collapsed head may still produce outputs but has lost the ability to differentiate across positions.

gradNorm[attn] < 1e-6 for 5 stepsfires on fifth consecutive step
04
dead layerswarn

Non-attention layers with gradient norms below 1e-6 for five consecutive steps are flagged dead. The signal names the specific layers. These are candidates for pruning or weight reinitialization.

gradNorm[layer] < 1e-6 for 5 stepsfires on fifth consecutive step
05
loss plateauinfo

Rolling variance over the last twenty steps divided by the squared rolling mean. When variance falls below 0.1% of mean-squared, the signal fires with the current epoch so you can judge whether this is healthy convergence or premature stalling.

var(loss[−20:]) < mean² × 0.001always info, optionally triggers early stop

cooldown of 30 steps per signal type. streak maps persist across steps for dead layer tracking.

The model diff: behavioral before and after

When training completes, the dashboard receives three scores that describe how the fine-tune changed the model's behavior: consistency, suppression, and robustness. These are the same metrics from Aquin's eval system, applied specifically to the before/after comparison. The base model is the reference. The fine-tuned checkpoint is the subject. The difference between the two scores is the behavioral delta your training objective produced.

What makes this useful during training rather than after is the framing. If consistency drops significantly from base to fine-tuned, it raises a question about the training data: did it contain examples that reward inconsistent responses across phrasings? If suppression increases, it raises a question about topic coverage: does the training set over-represent cautious responses on certain topic categories? These are questions about data quality, not model architecture, and they are best asked while you still have the training run in context.

The robustness score is particularly informative for factual fine-tuning. A fine-tune intended to add or reinforce factual knowledge should produce higher robustness on those facts. The model should be more confident under surface corruptions of the relevant prompts, not less. A robustness drop on the target facts after factual fine-tuning is a sign that the model has learned a surface pattern rather than a grounded representation.

model diff · base vs fine-tuned

consistency+0.141 − (mean KL / anchor entropy)
base
0.73
ft
0.87
suppression−0.090.6 × length_penalty + 0.4 × hedge_penalty
base
0.7
ft
0.61
robustness+0.071 − (mean_drop / clean_confidence)
base
0.67
ft
0.74

green = improved, red = regressed relative to base. same metrics as the eval system.

The SAE feature diff: what changed inside

Behavioral scores tell you how the model changed from the outside. The SAE feature diff tells you which internal representations changed and by how much. For each layer in the sparse autoencoder, the diff computes how many features shifted activation between the base and fine-tuned model, the mean absolute activation delta across features, and the single feature with the highest delta.

The layer-level change density is the most informative aggregate signal. A fine-tune that changes 14 out of 512 features at L8 and 2 out of 512 at L4 is doing something focused and deep: it is rewriting a specific representational layer, not spreading surface changes across the network. A fine-tune that changes many features at every layer with similar densities is making diffuse changes, which typically indicates that the training data was affecting many different concepts rather than targeting a specific one.

The top feature per layer is where mechanistic investigation should start. If L8's top shifted feature is F213 (geographic reference tracking) and the training objective was factual reinforcement on geographic queries, that alignment is expected and healthy. If L10's top shifted feature is F501 (refusal / safety language) and the training data had nothing to do with refusals, that is a finding that warrants investigation in the model inspector. The diff turns the post-training inspection from an open-ended search into a targeted inquiry.

sae feature diff · changed features per layer · blue cells = shifted

L4
2/ 512 changed
mean Δ 0.004F412 · punctuation / sentence boundary
L6
8/ 512 changed
mean Δ 0.012F089 · hedging / uncertainty markers
L8
14/ 512 changed
mean Δ 0.031F213 · geographic reference tracking
L10
5/ 512 changed
mean Δ 0.014F501 · refusal / safety language
L12
6/ 512 changed
mean Δ 0.019F047 · capital city associations
L14
3/ 512 changed
mean Δ 0.009F091 · factual recall trigger

each row is one layer. each cell is one SAE feature. blue = activation shifted post fine-tune. L8 carries the heaviest rewrite.

The regression tracker: what the next run cost you

A single model diff tells you how a fine-tune changed behavior relative to the base. That is useful once. The regression tracker extends it across runs: every time a model diff arrives, the category scores are appended to a per-category history so you can see how behavior has moved across all completed runs in the session. The question it answers is not whether fine-tuning improved consistency, but whether the third iteration of fine-tuning degraded something the second iteration had fixed.

Each category row shows a sparkline of its score across runs. A category that regressed by more than five percentage points on the latest run is flagged with a red indicator and a downward arrow. The regression detection is relative to the immediately prior run, not to the base. This matters because a score can look healthy against the base model while still trending negatively across iterations. The tracker catches that drift where the raw diff cannot.

The first completed run captures a baseline but shows no regression markers, because there is no prior to compare against. From the second run onward, every category is classified as improved, degraded, or unchanged relative to the run before it. The tracker is most useful in ablation workflows where multiple fine-tunes on the same base model are being compared: it makes the pattern of gains and losses across iterations legible without requiring you to hold the numbers in your head.

regression tracker · category score across 4 runs

factual72%
reasoning70%
refusal71%
code66%

each point is one completed run. red = category score regressed vs prior run.

Confidence calibration: whether the model knows what it knows

A model's stated confidence and its actual accuracy can diverge in ways that are invisible from loss alone. A fine-tune can lower loss while simultaneously making the model systematically overconfident — producing high probability outputs that are wrong more often than the confidence implies. Expected Calibration Error measures that gap directly: it bins outputs by stated confidence, computes actual accuracy within each bin, and reports the mean gap between the two. A perfectly calibrated model has an ECE of zero.

The calibration panel runs this comparison between the base and fine-tuned model using the training dataset as the evaluation set. The reliability diagram shows both models' accuracy-per-confidence-bin as bar pairs against a perfect-calibration diagonal. Bars that fall to the left of the diagonal indicate overconfidence in that bin; bars that fall to the right indicate underconfidence. The per-topic ECE table breaks the aggregate score down by category — models trained on domain-specific data frequently improve ECE on the target domain while degrading it on adjacent topics that share surface patterns with the training examples.

The low-confidence row list surfaces the specific dataset inputs where the fine-tuned model assigns probability below the configured threshold. These are the inputs the model is least certain about after training. They can be selected and exported directly as a labeled dataset for the next training iteration, closing the loop from finding a weakness to fixing it. The export feeds into the training pipeline directly; no manual file handling required.

calibration · reliability diagram + per-topic ECE · base ECE 0.148 → ft ECE 0.071

0.000.250.500.751.000.00.20.40.60.8
science
0.1200.060
history
0.1900.090
math
0.0800.040
coding
0.2100.110
medicine
0.3100.170
law
0.2700.190

left bar = base ECE per confidence bucket, right bar = fine-tuned. green = under-confident, red = over-confident relative to perfect diagonal.

Training as the start of the investigation

Mechanistic interpretability has traditionally studied models after they exist. The training inspect system changes where the investigation starts. A signal that fires at step 61 about a dead attention head at L6 is most usefully followed up by opening the fine-tuned checkpoint in the Model Inspector once training completes, going directly to L6, and running the causal trace to see whether that layer still contributes to the model's outputs. If L6 is genuinely inactive, it shows up in the trace as a gap. If it has recovered, the trace will show it. The training signal is the hypothesis; the mechanistic inspection is the test.

The same directional logic applies to the modelDiff results. A suppression score that rises from base to fine-tuned opens a data investigation: the training inspector can open the dataset in the Data Inspector and run the toxicity and PII modules against the columns that were most likely to produce suppressive signal. If those columns contain high hedge ratios or sensitive content, the data is the driver. If they are clean, the suppression is coming from the fine-tuning objective or the template design, which is a different kind of problem requiring a different kind of fix.

The SAE feature diff provides the entry point for the mechanistic side of that investigation. Once you know which features shifted most and at which layers, you can navigate the Model Inspector directly to those features, check their benchmark scores, run them through the logit lens, and steer them to confirm their role. The diff turns the post-training model inspection from an open-ended search into a targeted inquiry. You are not looking at the whole model; you are looking at the specific features the training run touched.

The calibration panel adds a third path out of the training run. Low-confidence rows can be exported directly as a labeled dataset for the next iteration. The model's own uncertainty after fine-tuning becomes the selection criterion for the data that trains the next version. That loop — find the inputs the model is least certain about, label them, fine-tune on them — is the most direct path from a completed training run to a better one. The regression tracker closes the loop in the other direction, confirming that the next iteration did not trade one weakness for another. Together they make the training session not the end of a workflow but the input to the next one.

AQUIN · EARLY ACCESS
Run this on your own training run
join waitlist
Aquin Labsaquin@aquin.app

Join the Aquin Research Community

LLM researchers & ML engineers — open research, fellowships, hackathons, and early beta access.

Join Discord

Not sure if Aquin is right for you?

StatusPoliciesResearchCommunity·© 2026 Aquin. All rights reserved.

Aquin