Inspection (SAE): LLMs
Sparse autoencoder tools for language models. Decomposes residual-stream activations into interpretable features, runs full attribution pipelines, steers features at inference time, and benchmarks feature quality. Use sae-stats for multi-layer batch exports and degradation heatmaps. After fine-tuning, use Checkpoint SAE (/docs/checkpoint-sae) for diff, temp train, and align on real checkpoints.
8 commands
aquin inspect
agent tool: run_full_inspection
Full attribution pipeline in one pass: generates the model response, runs causal mediation analysis per token/layer, decomposes top SAE features, renders logit lens per layer, and builds the circuit graph. This is the primary LLM inspection entry point. Syncs the complete inspection card to the web.
| Flag | Description |
|---|---|
| --prompt* | Input prompt (model completes from here). |
| --layer* | SAE layer with a downloaded checkpoint (aquin load sae <model>-l<n>). If omitted, prints layers on disk and pull commands. |
| --check | Save inspect-check.json and inspect-check.png in the current directory. |
| --model | Override active model. |
CLI-only shortcut. In aquin chat, ask: Run full inspection on "…"
aquin feature-logits
agent tool: get_feature_logits
Projects one SAE feature's decoder direction through the unembedding matrix and returns the top tokens promoted and suppressed. Answers: if I amplify this feature, which tokens become more likely?
| Flag | Description |
|---|---|
| --feature* | SAE feature index. |
| --topk | Number of tokens to show (default: 10). |
| --prompt | Optional prompt context for activation weighting. |
| --check | Save feature-logits-check.json and feature-logits-check.png in the current directory. |
aquin feature-neighbors
agent tool: get_feature_neighbors
Returns the nearest neighbor SAE features by decoder cosine similarity. Use to find redundant, related, or polysemous features near a target index.
| Flag | Description |
|---|---|
| --feature* | SAE feature index. |
| --topk | Number of neighbors (default: 8). |
| --check | Save feature-neighbors-check.json and feature-neighbors-check.png in the current directory. |
aquin steer
agent tool: run_steer_and_show
Adds a scaled multiple of one SAE feature's decoder direction to the residual stream at the SAE layer during the forward pass. Compares baseline vs steered output side-by-side. Feature label auto-resolves from causal labeling if omitted.
| Flag | Description |
|---|---|
| --prompt* | Input prompt. |
| --feature_idx* | SAE feature to steer. |
| --feature_label | Human-readable label (auto-resolved if omitted). |
| --strength | Steering multiplier (default: 1.0). |
| --max_new_tokens | Generation length. |
aquin multi-steer
agent tool: run_multi_steer
Steers multiple SAE features simultaneously in a single forward pass. Pass a JSON array of {feature_idx, strength, label} objects.
| Flag | Description |
|---|---|
| --prompt* | Input prompt. |
| --features* | JSON array, e.g. '[{"feature_idx":42,"strength":1.5}]' |
| --max_new_tokens | Generation length. |
aquin benchmark
agent tool: run_benchmarks_on_top_feature
Runs InterpScore, Feature Purity, and MUI (Monosemanticity Under Intervention) on a single SAE feature. Quantifies how interpretable and causally coherent the feature is.
| Flag | Description |
|---|---|
| --feature_idx* | SAE feature index. |
| --check | Save benchmark-check.json and benchmark-check.png in the current directory. |
Distinct from behavioral evals (audit, red-team). This scores one feature's interpretability.
aquin sae-stats
agent tool: run_sae_stats
Batch export of multi-layer SAE statistics over a probe dataset. Computes per-layer mean L0 sparsity, top-k firing features, and a probe×layer heatmap for degradation profiles and stressor comparisons. Works for any layer with a pulled SAE checkpoint.
| Flag | Description |
|---|---|
| --prompts* | JSON/JSONL probe file. Each row: text (or prompt) plus optional id, stressor, lang, quant_run_id. |
| --layers | Layer spec: all (default) or comma-separated, e.g. 9 or 0,9,15. |
| --topk | Top features per layer (default: 10). |
| --save | Write full schema_version=1 JSON export to this path. |
| --check | Save sae-stats-check.json and sae-stats-check.png in the current directory. |
| --output | json for machine-readable stdout. |
Probe schema: [{"id":"h1","text":"...","stressor":"baseline","lang":"en"}]. Export schema_version=1 JSON: layer_stats (per-layer top_features), layer_profile (mean_l0/sparsity), heatmap (probe×layer). Syncs SaeStatsCard to the web.
aquin umap
agent tool: ensure_umap_loaded
Loads the UMAP projection of all SAE decoder directions and opens the UMAP Explorer panel on the web. Requires a precomputed UMAP file for the pulled SAE.
