Training Sparse Autoencoders
Aquin Labs · June 2026
How Aquin closes the SAE gap when inspect, steer, and find-feature need a dictionary on the model you actually have loaded: capture, train, load, diff, and align as one connected toolchain.
When feature tools need a dictionary you do not have
SAE-based tools assume a sparse dictionary over the residual stream at a specific layer. Catalog models can pull public weights. Family HuggingFace ids, embedding encoders, and fine-tuned checkpoints usually cannot. The session still loads the model; inspect still runs causal traces; but feature decomposition, steering, and deception ranking stall until a dictionary exists for those exact weights.
Aquin treats that as an operational pipeline, not a one-off script. activation capture exports labeled probe activations with manifest metadata. sae train collects corpus-scale activations or reuses saved shards, fits a dictionary, and saves locally. user SAE binding wires the result into the same inspect and steer path as a pulled public SAE. Each step syncs a card to the web orchestrator so the run is visible beside your session, not buried on disk.
The question this article answers is not how sparse autoencoders work in the abstract. It is what each tool in the chain surfaces, when to use probes vs corpus collection, and where the output feeds next in an investigation.
SAE toolchain · CLI verbs to orchestrator cards
capture and train write cards to the web session. load sae --user switches feature tools to your dictionary without restarting the session.
The tooling chain
Five verbs cover the full loop. Probes and corpus collection are inputs. Train and load are the dictionary lifecycle. Diff and align handle the checkpoint case where a public SAE no longer matches internal geometry after fine-tuning.
capture-activations
activation capture is the probe-scale export path. It runs a curated prompt set through the loaded model, writes per-layer activation shards, and attaches manifest metadata (probe labels, checkpoint id, hook name). The run syncs an activationCapture card to the web orchestrator so you can compare captures side by side without digging through ~/.aquin.
Use capture for checkpoint comparison, deception slices, and labeled exports. Do not use it as the substrate for dictionary training. Six probe vectors can finish a train card but produce a dictionary with 90%+ dead features. That is a pipeline smoke test, not a feature tool you load into inspect.
sankey · activation volume by collection mode
corpus streams feed real training. probe captures branch to pipeline checks only, not production dictionaries.
sae train
sae train is the dictionary lifecycle. Without flags it streams corpus text through the session model, materializes normalized activation chunks to disk, fits a sparse autoencoder, and saves under ~/.aquin/sae/user/. With --activations <dir> it skips forward passes and retrains from saved shards: same trainer, no model rerun.
Every run mirrors status to a saeTrain card (step, recon, dead-feature count). When the card completes, user SAE binding via load sae --user switches inspect, steer, sae-stats, and find-feature to your dictionary without restarting the session.
Run signals
Aquin logs reconstruction loss, mean L0 sparsity, and dead-feature count every 500 steps on the saeTrain card. Treat these as operator go/no-go signals, not ML lecture material. Falling recon on corpus data means load and benchmark. Flat recon with high dead count on probe-only input means the card finished but the dictionary is not usable.
The table below is from the same Llama 3.2 1B quick run at layer 8, contrasted against a six-vector probe rerun. The gap is the main operational lesson: scale of activations matters more than step count.
recon MSE vs step
falls on corpus data. flat on probe-only runs.
mean L0 vs step
stabilizes once L1 penalty and feature competition reach equilibrium.
quadrant · dictionary strength vs activation scale
probe runs sit in the smoke-test corner. quick corpus hits the dev sweet spot. full corpus is the production target.
sae diff
After fine-tuning, a pulled public SAE may still reconstruct activations while assigning wrong feature indices. sae diff runs the same probe set through base and checkpoint weights, decodes both activation streams through the public dictionary, and reports per-feature delta. The result syncs a saeDiff card beside your training monitor.
A large diff is usually why you train a new dictionary on the checkpoint instead of steering with base weights. Diff tells you the public basis no longer matches internal geometry. It does not produce a replacement dictionary. That is sae train followed by load sae --user.
sae align
When you have two trained dictionaries (public base vs user checkpoint, or two training runs), feature indices are arbitrary. sae align Hungarian-matches decoder columns between two .pt files and reports mean cosine similarity plus the weakest pairs. The run syncs a saeAlign card.
decoder alignment is for index translation when you need correspondence across dictionaries, not as a quality gate. Low mean cosine after a large fine-tune means the feature basis moved. Run InterpScore and sae-stats on the checkpoint-trained dictionary before deciding whether to steer with it.
decoder alignment · base vs fine-tuned dictionaries
Hungarian matching pairs decoder columns. weak pairs flag features that rotated or split across the fine-tune.
LLM vs embedding
The trainer and card schema are shared. The activation tensor is not. LLMs record every token position in the post-block residual stream. Embedding encoders record one mean pooling vector per text. Embedding dictionaries are narrower (4,096 features typical vs 32,768 on small LLMs) with a lower L1 coefficient because pooled vectors are already compressed.
Feature tools differ by mode after user SAE binding. LLMs get inspect, steer, and circuit graphs on token positions. Embeddings get sae-browser, contrastive decomposition, and faithfulness probes on sentence pairs. Train at the hook you plan to inspect, not where reconstruction is globally minimal.
LLM vs embedding · activation geometry
same trainer, different tensor shape and dictionary width.
final recon MSE by layer · 1B instruct LLM · quick runs
layer 8 lowest in this sweep. use causal attribution on target prompts to pick an inspection layer, not reconstruction alone.
Connected investigation
SAE training is rarely the end state. It is the bridge between a loaded model and feature-level tools. Typical loop: load a checkpoint, run sae diff if a public dictionary exists, train when diff is large, bind with load sae --user, then move into attribution and benchmarks before steering or circuit work.
The Training monitor article explains when checkpoint SAE diff fires during a fine-tune and why that motivates a new dictionary. Attribution covers inspect, steer, and circuit graphs once a dictionary is bound. Benchmarks covers InterpScore, purity, and MUI for deciding which features to trust before you build on them.
SAE toolchain · CLI
aquin capture-activations --output <dir>Probe-scale labeled capture with manifest metadata and activationCapture card.aquin sae train --layer <n>Corpus collection and dictionary fit on the session model.aquin sae train --activations <dir>Retrain from saved activation shards without forward passes.aquin load sae --user <name>Bind a trained dictionary for feature-level tools.aquin sae diffBase vs checkpoint activation delta through a pulled public SAE.aquin sae align --sae-a <a> --sae-b <b>Decoder alignment between two dictionaries.Commands run against the active session after aquin session start. One model is locked per session — start a new session to load a different checkpoint.
