Training Sparse Autoencoders

Aquin Labs · June 2026

How Aquin closes the SAE gap when inspect, steer, and find-feature need a dictionary on the model you actually have loaded: capture, train, load, diff, and align as one connected toolchain.

When feature tools need a dictionary you do not have

SAE-based tools assume a sparse dictionary over the residual stream at a specific layer. Catalog models can pull public weights. Family HuggingFace ids, embedding encoders, and fine-tuned checkpoints usually cannot. The session still loads the model; inspect still runs causal traces; but feature decomposition, steering, and deception ranking stall until a dictionary exists for those exact weights.

Aquin treats that as an operational pipeline, not a one-off script. activation capture exports labeled probe activations with manifest metadata. sae train collects corpus-scale activations or reuses saved shards, fits a dictionary, and saves locally. user SAE binding wires the result into the same inspect and steer path as a pulled public SAE. Each step syncs a card to the web orchestrator so the run is visible beside your session, not buried on disk.

The question this article answers is not how sparse autoencoders work in the abstract. It is what each tool in the chain surfaces, when to use probes vs corpus collection, and where the output feeds next in an investigation.

SAE toolchain · CLI verbs to orchestrator cards

capture and train write cards to the web session. load sae --user switches feature tools to your dictionary without restarting the session.

The tooling chain

Five verbs cover the full loop. Probes and corpus collection are inputs. Train and load are the dictionary lifecycle. Diff and align handle the checkpoint case where a public SAE no longer matches internal geometry after fine-tuning.

ToolWeb cardWhat it does

capture-activationsactivationCaptureProbe-scale activations with manifest, layer shards, and probe metadata. Feeds checkpoint comparison and labeled slices.

sae trainsaeTrainCorpus collection or shard reuse, dictionary fit, local save under ~/.aquin/sae/user/. Mirrors run status to the orchestrator.

load sae --userbinding onlyBinds a trained dictionary to the active session so inspect, steer, sae-stats, and find-feature use your weights.

sae diffsaeDiffBase vs checkpoint activation delta through a pulled public SAE. Often the reason you train a new dictionary on the checkpoint.

sae alignsaeAlignHungarian decoder match between two .pt files. Maps feature indices across public and user-trained dictionaries.

capture-activations

activation capture is the probe-scale export path. It runs a curated prompt set through the loaded model, writes per-layer activation shards, and attaches manifest metadata (probe labels, checkpoint id, hook name). The run syncs an activationCapture card to the web orchestrator so you can compare captures side by side without digging through ~/.aquin.

Use capture for checkpoint comparison, deception slices, and labeled exports. Do not use it as the substrate for dictionary training. Six probe vectors can finish a train card but produce a dictionary with 90%+ dead features. That is a pipeline smoke test, not a feature tool you load into inspect.

sankey · activation volume by collection mode

corpus streams feed real training. probe captures branch to pipeline checks only, not production dictionaries.

sae train

sae train is the dictionary lifecycle. Without flags it streams corpus text through the session model, materializes normalized activation chunks to disk, fits a sparse autoencoder, and saves under ~/.aquin/sae/user/. With --activations <dir> it skips forward passes and retrains from saved shards: same trainer, no model rerun.

Every run mirrors status to a saeTrain card (step, recon, dead-feature count). When the card completes, user SAE binding via load sae --user switches inspect, steer, sae-stats, and find-feature to your dictionary without restarting the session.

SourceRole in training

streamed corpusOpenWebText or custom JSONL. One vector per token (LLM) or per text (embedding). Default path for dictionary quality.

labeled probesSmall curated prompt sets with metadata. Good for checkpoint comparison and deception slices, not for training scale.

saved shardsReuse chunk files from a prior collection. Retrain hyperparameters or dictionary width without rerunning the model.

model: meta-llama/Llama-3.2-1B-Instruct

hook: blocks.8.hook_resid_post

dictionary width: 16,384 features · d_model 2,048

activation budget: 99,152 vectors (quick) · 2 chunks · OpenWebText stream

L1 coeff 10.0 · batch 4,096 · lr 1e-4 · 3,000 steps

final recon MSE 0.14 · mean L0 251 · dead features 11%

Run signals

Aquin logs reconstruction loss, mean L0 sparsity, and dead-feature count every 500 steps on the saeTrain card. Treat these as operator go/no-go signals, not ML lecture material. Falling recon on corpus data means load and benchmark. Flat recon with high dead count on probe-only input means the card finished but the dictionary is not usable.

The table below is from the same Llama 3.2 1B quick run at layer 8, contrasted against a six-vector probe rerun. The gap is the main operational lesson: scale of activations matters more than step count.

RunVectorsReconDeadOperator read

6 probe vectors60.8994%smoke test

quick corpus99,1520.1411%dev baseline

full corpus2,000,0000.064%production dict

recon MSE vs step

falls on corpus data. flat on probe-only runs.

mean L0 vs step

stabilizes once L1 penalty and feature competition reach equilibrium.

quadrant · dictionary strength vs activation scale

probe runs sit in the smoke-test corner. quick corpus hits the dev sweet spot. full corpus is the production target.

sae diff

After fine-tuning, a pulled public SAE may still reconstruct activations while assigning wrong feature indices. sae diff runs the same probe set through base and checkpoint weights, decodes both activation streams through the public dictionary, and reports per-feature delta. The result syncs a saeDiff card beside your training monitor.

A large diff is usually why you train a new dictionary on the checkpoint instead of steering with base weights. Diff tells you the public basis no longer matches internal geometry. It does not produce a replacement dictionary. That is sae train followed by load sae --user.

sae align

When you have two trained dictionaries (public base vs user checkpoint, or two training runs), feature indices are arbitrary. sae align Hungarian-matches decoder columns between two .pt files and reports mean cosine similarity plus the weakest pairs. The run syncs a saeAlign card.

decoder alignment is for index translation when you need correspondence across dictionaries, not as a quality gate. Low mean cosine after a large fine-tune means the feature basis moved. Run InterpScore and sae-stats on the checkpoint-trained dictionary before deciding whether to steer with it.

decoder alignment · base vs fine-tuned dictionaries

Hungarian matching pairs decoder columns. weak pairs flag features that rotated or split across the fine-tune.

LLM vs embedding

The trainer and card schema are shared. The activation tensor is not. LLMs record every token position in the post-block residual stream. Embedding encoders record one mean pooling vector per text. Embedding dictionaries are narrower (4,096 features typical vs 32,768 on small LLMs) with a lower L1 coefficient because pooled vectors are already compressed.

Feature tools differ by mode after user SAE binding. LLMs get inspect, steer, and circuit graphs on token positions. Embeddings get sae-browser, contrastive decomposition, and faithfulness probes on sentence pairs. Train at the hook you plan to inspect, not where reconstruction is globally minimal.

LLM vs embedding · activation geometry

same trainer, different tensor shape and dictionary width.

final recon MSE by layer · 1B instruct LLM · quick runs

layer 8 lowest in this sweep. use causal attribution on target prompts to pick an inspection layer, not reconstruction alone.

Connected investigation

SAE training is rarely the end state. It is the bridge between a loaded model and feature-level tools. Typical loop: load a checkpoint, run sae diff if a public dictionary exists, train when diff is large, bind with load sae --user, then move into attribution and benchmarks before steering or circuit work.

The Training monitor article explains when checkpoint SAE diff fires during a fine-tune and why that motivates a new dictionary. Attribution covers inspect, steer, and circuit graphs once a dictionary is bound. Benchmarks covers InterpScore, purity, and MUI for deciding which features to trust before you build on them.

SAE toolchain · CLI

aquin capture-activations --output <dir>Probe-scale labeled capture with manifest metadata and activationCapture card.

aquin sae train --layer <n>Corpus collection and dictionary fit on the session model.

aquin sae train --activations <dir>Retrain from saved activation shards without forward passes.

aquin load sae --user <name>Bind a trained dictionary for feature-level tools.

aquin sae diffBase vs checkpoint activation delta through a pulled public SAE.

aquin sae align --sae-a <a> --sae-b <b>Decoder alignment between two dictionaries.

Commands run against the active session after aquin session start. One model is locked per session — start a new session to load a different checkpoint.