Research Community Policies Join Waitlist

autonomous agentmechanistic interpretabilityfeature analysiscausal tracingumaptraining monitordata inspectionobjective-driven

The Autonomous System

Aquin Labs · April 2026

What happens when an AI inspects another AI

Mechanistic interpretability is slow work. Tracing a single fact through a model's layers, identifying which features carried it, benchmarking their causal influence, verifying the output for accuracy and bias — done manually, that is an hour of work per query. Most of that time is not spent thinking about the model. It is spent navigating tools, choosing parameters, and deciding which analysis to run next.

Aquin's autonomous system compresses that into a single sentence. You describe what you want to understand. The system runs the full analysis, updates every panel as it goes, and delivers a finding with specific layer indices, feature activations, and benchmark scores — all visible, live, in front of you.

What makes this scientifically non-trivial is not the automation. It is the chaining. Each mechanistic analysis is most useful when interpreted against the others: a causal trace is more meaningful when you know which SAE features were active at the peak layer; a benchmark score is more actionable when you have already steered the feature to confirm its role. Running these in the right order, with findings carried forward, is where the insight comes from. The same logic extends to the geometry and the training data — three lenses on the same question, and the system moves between them as the investigation requires.

one message. pipeline runs. every panel updates. finding arrives with exact numbers.

AQUIN · EARLY ACCESS

Run this on your own model

join waitlist

Focused on improvement, not just observation

Inspection without a goal produces observations. Inspection with a goal produces decisions. The autonomous system supports both, but it is most useful when you tell it what you are trying to improve. Each agent chat — model inspection, dataset analysis, training monitor — carries an objective field. Whatever you write there is injected into the agent's context at the start of every message, shaping which findings it surfaces first, which analyses it runs without being asked, and how it frames what it finds.

The difference shows up in the findings. An agent that knows you are trying to reduce hallucination on medical topics will flag the suppression eval result as the primary finding and connect it directly to the circuit responsible, rather than presenting all findings at equal weight. An agent that knows you are auditing a dataset for medical fine-tuning will prioritize the PII and toxicity results over the copyright analysis. An agent monitoring a training run that knows you are watching for robustness degradation will call out the attack surface delta in the modelDiff before discussing the loss trend.

The objective persists across the session. It does not reset between messages, and it travels with the session tab so that switching between model inspection, data inspection, and training monitoring keeps the same goal in focus. The investigation follows the thread.

example objectives by context

Model inspection

Reduce hallucination on medical topics before production deployment

Identify which circuits are responsible for factual recall on geography

Determine whether this model is safe to fine-tune on legal documents

Dataset analysis

Ensure this training set is safe for medical fine-tuning

Find whether PII exposure is driving over-filtering in the trained model

Audit for synthetic contamination before submitting to a regulated review

Training monitor

Minimize loss spike without degrading robustness to adversarial prompts

Understand whether the fine-tune shifted the model toward suppression

Validate that the checkpoint is stable enough to open in the model inspector

the objective is set once per session and injected into every agent message. it does not constrain what you can ask — it shapes how findings are prioritized.

The chaining problem in interpretability

Mechanistic interpretability has no single universal analysis. It has a vocabulary of techniques: causal tracing, sparse autoencoder decomposition, logit lens, activation patching, feature steering. Each answers a different question about a model. The questions are related, and the answers compound. But the compounding only happens if you run the analyses in an order that makes sense and use each result to inform what you run next.

The causal trace tells you which layer peaks. The logit lens tells you when in the forward pass the model first commits to the answer. The SAE tells you which features were active at the peak layer. The circuit attribution graph tells you which of those features connected which prompt tokens to which response tokens. The feature logits tell you what each feature was promoting and suppressing. The benchmark scores tell you how much to trust the feature's label and its causal role. Steering tests the hypothesis by intervening directly.

Run in isolation, each of these produces a partial picture. Run in sequence, with findings carried forward, they produce a mechanistic account: not just that the model produced "Paris", but exactly where in the network that decision was made, which features carried it, how they connected back to the prompt, and what happens to the output if you ablate any one of them. That is a qualitatively different kind of knowledge about a model than any single analysis provides.

The autonomous system encodes that sequencing. It knows which analysis to run first, which findings unlock which follow-up analyses, and how to synthesize across them into a single coherent account. It also knows when to reach outside the model entirely, into the feature geometry or the training data, when those lenses are what the investigation actually needs.

canonical inspection sequence

Full inspection

Causal trace + SAE features + logit lens. Grounds everything that follows in live model data.

Circuit attribution

Maps which features connected which prompt tokens to which response tokens. Identifies hubs.

Feature analysis

Logit projection + neighborhood search. Reveals what each feature promotes, suppresses, and which features share its geometry.

UMAP exploration

Agent seeds the 3D feature space with pre-computed points, selects the relevant feature, and navigates to its neighborhood, all without the user touching the mouse.

Benchmarking

InterpScore + FeaturePurityScore + MUI. Quantifies whether the feature is interpretable, monosemantic, and causally critical.

Steering

Causal intervention. Amplify or suppress a feature at inference time to validate its functional role before committing a weight edit.

Output verification

Fact check + bias detection + censor audit. Behavioral evaluation cross-referenced against the mechanistic findings.

each step uses findings from the previous one. the full chain produces a mechanistic account, not just a collection of measurements.

What the system can tell you about a model

Every question below is answerable in a single session turn. The system selects the right combination of analyses, runs them in the right order, and synthesizes the results into a coherent finding. These are not separate modes, a single session might move through several of them as findings in one analysis open up the next question.

Which part of the model encoded this fact?

The causal trace isolates the layer. The logit lens shows how confidence built up to that point. Together they locate the retrieval site to within one or two transformer blocks.

Which features carried the information?

The circuit attribution graph maps which SAE features connected specific prompt tokens to specific response tokens. The feature sidebar shows what each feature boosts and suppresses in vocabulary space.

Is the model's knowledge of this fact robust?

The consistency eval runs the same query across seven phrasings and measures how much the output distribution shifts. A fact the model genuinely knows should produce the same answer regardless of how the question is framed.

Is the feature actually being used, or just firing?

MUI measures this directly. Some features fire consistently but produce near-zero KL divergence when ablated, meaning the model computed them and ignored them. MUI separates causally load-bearing features from decorative ones.

What happens if I amplify or suppress a feature?

Feature steering adds a scaled multiple of the feature's direction to the residual stream at inference time. The output shifts. The system runs baseline and steered generations side by side and highlights exactly what changed.

Is the model avoiding certain topics?

The suppression eval measures response length and hedging density across topic categories against a neutral baseline. A model that systematically produces shorter, more hedged responses on a topic class is suppressing it, whether or not it refuses outright.

The interaction between these analyses is where the scientific value sits. Finding that a feature has a high MUI score but low FeaturePurityScore, for example, tells you something specific: the model is relying heavily on a feature that fires for multiple unrelated concepts. That is a polysemantic feature carrying significant causal weight, a known risk factor when attempting a weight edit, because targeting that feature direction is likely to affect circuits unrelated to your intended subject. The autonomous system surfaces that compound finding and flags the implication without requiring you to construct the interpretation manually.

Three sessions

These are representative outputs from real sessions. The dark badges show which panels updated during each response. The finding is what the system actually produced, with the numbers it cited.

Objective

Reduce hallucination on medical topics before production deployment

Does this model suppress medical advice?

Causal traceCircuit graphSuppression evalFeature sidebar

Medical dosage is the most suppressed topic at 0.71 suppression score. The model produces responses at 38% of baseline length with 4.2x the hedging density on dosage queries. Feature f8821 (hedging language) activates at 7.3 on the token 'cannot', and ablating it removes the deflection entirely. The pattern holds across all seven probe phrasings, which suggests this is weight-level suppression from training, not a surface instruction patch. Given your objective, the circuit centered on f8821 is the primary intervention target.

Objective

Understand which features are load-bearing for factual recall

Walk me through how this model retrieves the capital of France.

Logit lensCausal traceCircuit graphFeature sidebarBenchmarks

The fact crystallizes at layer 8, which carries 87% of the causal signal. Before layer 8, the model's top prediction alternates between 'France' and generic city tokens. At layer 8, 'Paris' jumps to 78% confidence and barely shifts through layer 15. Five features are active at that layer, and the two highest are f13933 (geographic country associations, activation 9.75) and f13910 (capital/seat-of-government, activation 7.86). Both trace back to the same three prompt tokens: 'capital', 'of', 'France'. These are the features that are load-bearing for this class of factual recall.

Objective

Validate that f13933 is safe to target before attempting a weight edit

Benchmark the top feature and tell me if it's trustworthy.

Benchmarks · InterpScoreBenchmarks · PurityBenchmarks · MUI

Feature f13933 scores 84% on InterpScore, 79% on FeaturePurityScore, and 76% on MUI. The InterpScore means its label predicts where it fires with high precision. The purity score means it fires for one coherent concept, geographic country associations, not several unrelated ones. The MUI means ablating it shifts the model's output distribution significantly. Against your objective: this is the profile of a feature you can safely target. The high purity score means an edit is unlikely to affect unrelated circuits.

Notice what the second session establishes. It does not just say "layer 8 is important." It gives the full mechanistic account: the layer, the confidence trajectory across all 16 layers, the specific features by index, their activation values, and which prompt tokens they traced back to. That is the difference between locating a retrieval site and understanding the circuit that implements it. The autonomous system produces the latter because it ran all the analyses in the right order and carried each finding forward.

But locating a feature and naming it is still only part of the investigation. The next step is understanding where that feature sits relative to the rest of what the model knows, and whether the weights that shaped it were themselves shaped by something suspect in the training data. Those questions require different tools. In the autonomous system, they are asked in the same session.

Situating features in space

Knowing that f13933 is the primary carrier of geographic country associations tells you what the feature does. It does not tell you how that function relates to the rest of what the model has learned. For that, you need to see where the feature lives in the model's internal geometry: which features cluster near it, which are adjacent in activation space, whether the geographic cluster is isolated or entangled with something else.

Aquin's UMAP explorer renders all SAE features as a 3D point cloud: a geometric map of the model's internal vocabulary, where proximity reflects shared conceptual structure. Previously, navigating that space was a manual operation. You searched for a feature by index, rotated the cloud, switched views. That navigation was one more interruption between the mechanistic finding and its interpretation.

The autonomous system now drives the UMAP explorer as part of the same session that produced the finding. When it identifies a feature as the primary carrier of a concept, it selects that feature in the 3D view, switches to the neighbor tab, and surfaces the cluster of features that share its embedding geometry, all without the user touching the interface. It can also seed the point cloud with pre-computed coordinates, so the spatial context arrives at the same time as the mechanistic finding rather than after a separate dimensionality reduction step. The geometry stops being a separate tool you consult after the analysis. It becomes part of the analysis itself.

Watching a model learn

The autonomous system does not only inspect models after they are trained. It watches them train. A third agentic chat runs alongside the training dashboard, receiving a live snapshot of the session state every time you ask a question: current step, loss value, grad norms, learning rate, dead layers, and any active signals from the signal engine. The agent answers from that data directly, not from generic knowledge about training dynamics.

The snapshot is captured at send time rather than injected continuously, which means the agent's answer always reflects the state of training at the moment you asked. It cites the exact loss value at the current step, reads the sparkline trend across the last twenty steps, and flags whether the trajectory is healthy. If the signal engine has detected a plateau, a gradient spike, or dead attention heads, those signals are in context and the agent can explain the specific metric pattern that triggered each one.

When training completes, modelDiff and SAEDiff results arrive. The agent now has behavioral scores — consistency, suppression, robustness — and feature-level activation deltas in context. It can explain what the numbers mean about how the fine-tune changed the model's behavior, which features shifted most across which layers, and what that implies about the intervention. At that point, two tools become available: open the training dataset in the Data Inspector, or open the fine-tuned checkpoint in the Model Inspector. Both are single tool calls from the same training chat session, without navigating away or starting a new workflow.

training monitor sequence

Live metrics snapshot

At every user message, the agent receives the current step count, loss, grad norms, LR, dead layers, and active signals. It answers from data, not from generic training advice.

Signal analysis

The signal engine runs on each step event: plateau detection, divergence detection, gradient spike detection, dead layer tracking. When a signal fires the agent sees it immediately.

ModelDiff reading

When training completes, consistency, suppression, and robustness scores arrive. The agent reads them in context and explains the behavioral delta between base and fine-tuned model.

SAEDiff reading

Feature-level activation changes across layers. The agent sees how many features shifted, by how much, and can navigate the diffs section of the dashboard.

Open dataset inspector

One tool call opens the training dataset in the full data inspection workspace. The same analysis pipeline — PII, toxicity, synthetic, liability — runs against the data that shaped the model.

Open model inspector

One tool call opens the fine-tuned checkpoint in the model inspection workspace. The full mechanistic analysis pipeline — causal trace, SAE features, circuit graph, benchmarks, steering — runs against the post-training model.

snapshot captured at send time. agent answers from live data. one session covers training, data inspection, and model inspection.

What is the loss doing right now?

The agent reads the live snapshot at send time — exact step count, last loss value, sparkline of the last 20 steps. It does not paraphrase a chart. It cites the number, identifies the trend, and flags whether the trajectory is healthy.

Are there any active training signals?

The signal engine detects loss plateau, loss divergence, gradient spikes, dead layers, and dead attention heads. When a signal fires, the agent has it in context and can explain the specific metric pattern that triggered it.

How did fine-tuning change the model?

When a modelDiff arrives, the agent has consistency, suppression, and robustness scores in context. It can explain what the numbers mean about behavioral change, then open the model inspector on the fine-tuned checkpoint so you can interrogate the internal changes mechanistically.

Which SAE features shifted during training?

The SAEDiff tracks which features changed activation most between the base and fine-tuned model. The agent reads nChanged, nFeatures, and meanAbsDelta, and can send you directly to the diffs section of the dashboard.

Can I look at the training data from here?

Yes. The agent can open the training dataset in the Data Inspector with a single tool call. The same agentic data inspection system — toxicity, PII, synthetic detection, liability chain — runs against the exact dataset that produced the model you are watching train.

The structural point is that training, data, and mechanistic analysis are now one continuous session rather than three separate workflows. A fine-tuning run that produces an unexpected suppression score in the modelDiff can immediately raise the question of whether the training data was suppressive in that topic region. That question can be answered in the same chat, against the exact dataset that produced the run, using the same analysis pipeline that runs on any other dataset. The answer feeds back into the mechanistic investigation: if the data was clean, the suppression is in the fine-tuning objective or the prompt template, which points to a different layer of the circuit than if the data itself was driving it.

Following the finding back to the data

A mechanistic finding about a model is always, implicitly, a claim about training. When the causal trace shows that suppression of medical topics is encoded at the weight level, the natural next question is what produced it. Not which feature carries the suppression, but why that feature was shaped that way during training. The answer to that question lives in the data.

Aquin's data inspection system applies the same agentic approach to dataset analysis that the interpretability system applies to model internals. The agent runs analysis modules against a dataset, streams results live into the workspace, and synthesizes findings across them. A session that identifies weight-level suppression can immediately ask whether the training data itself was suppressive: whether relevant columns carried high toxicity, whether PII exposure may have driven over-filtering, whether the data was predominantly synthetic and what that synthetic lineage looked like.

data inspection modules

Text quality

Language distribution, deduplication, license resolution, GDPR jurisdiction mapping, topic classification across the full dataset.

Toxicity

Per-column and per-row flagging across six categories: toxicity, severe toxicity, obscenity, threat, insult, and identity attack, with severity thresholds and peak-label attribution.

PII detection

Entity recognition with four risk tiers (critical → low), category breakdown, and column-level density summaries for systematic exposure analysis.

Synthetic detection

Row-level synthetic scoring with confidence levels, histogram distribution, and a dataset-level verdict from 'human' to 'mostly_synthetic'.

Liability chain

Traces data provenance through synthetic → paraphrase → translation chains, computing a liability score per row and flagging deep chains.

Content scoring across copyright, open, news, and book categories, combined with domain analysis and license resolution.

any combination of modules can be chained in a single session. findings are streamed live into the dataset workspace.

The liability chain module deserves particular attention because it reveals something the other modules cannot. Synthetic data rarely arrives in one generation step. It is often paraphrased, translated, or re-generated from prior synthetic output, producing chains of transformation that make the original source ambiguous and the copyright and quality claims correspondingly weak. The system traces that provenance chain row by row, computing a liability score that reflects both the depth of the chain and the confidence at each step. A row that is synthetic, translated from a synthetic source, and then paraphrased carries a qualitatively different risk profile than one simply flagged as likely synthetic. The chain structure makes that distinction explicit in a way that a single score cannot.

This connection between model and data is where the autonomous system becomes something more than an interpretability tool. A compliance audit that previously required two separate workflows, one for the model and one for the data it was trained on, now runs in a single session, with the agent deciding when the mechanistic findings warrant a data investigation and which modules to run against which columns. The findings talk to each other because they were produced by the same reasoning process, not assembled after the fact from separate reports.

Transparency and verifiability

Every finding the system produces is fully traceable. Each analysis appears as an expandable entry showing the exact inputs sent to the backend and the raw data returned. If the system concludes that f13933 is causally critical with a 76% MUI score, you can open that entry and verify the ablation KL divergence yourself. If it flags that 12% of a training column's rows carry PII at critical risk, the underlying entity detections are inspectable row by row. If it identifies a suppression score of 0.71 on medical topics, the probe responses, length ratios, and hedging scores are all there.

This matters for a scientific reason, not just a trust reason. The autonomous system is making judgment calls at every step: which analyses to run, in what order, which findings to surface and which to treat as noise, when a mechanistic result warrants a data investigation. Those judgment calls should be auditable. A finding about feature polysemanticity that rests on a purity score of 0.48 is a weaker claim than one resting on 0.21. The difference is in the data, and the system cites the number rather than just the conclusion.

The interface updates live as each analysis completes, which means you are watching the reasoning unfold rather than receiving a finished report. The circuit graph highlights a feature before its benchmark scores are ready. The logit lens populates while the SAE labeling is still running. The UMAP view positions a feature in the geometry while the neighbor similarity scores are still computing. You can form your own interpretation at each stage and check it against what the system concludes. The autonomy is in the orchestration, not in the interpretation being hidden from you.

What this changes about interpretability work

The bottleneck in mechanistic interpretability has never been the theory. The literature on features, circuits, and knowledge localization is rich and growing. The bottleneck is the gap between having a hypothesis about a model and running the analysis that tests it. That gap is filled with tooling decisions, panel navigation, and parameter choices that consume most of a session's time without contributing to the insight.

When that gap closes, the nature of the work changes. You stop spending sessions on logistics and start spending them on questions. A finding about weight-level suppression at 0.71 on medical topics immediately raises the next question: which features are responsible, is the suppression instruction-tuned or encoded deeper, what does the training data look like in that topic region? In the old workflow, each of those sub-questions opened a new session, a new set of navigation decisions. In the autonomous system, they are part of the same investigation. The agent follows the thread.

There is also a structural shift in what kinds of questions become tractable. A full mechanistic audit of a model, running suppression evals across every regulated topic, tracing flagged topics to their feature circuits, situating those features in the geometry, checking the training data for toxicity and synthetic contamination, used to require a team and days of work. It now requires a conversation. The analysis a compliance team might run quarterly can be run in an afternoon. The analysis a researcher might run on one model to build intuition can be run across five models to build a comparison.

That scale change is what makes the autonomous approach scientifically significant rather than just convenient. Mechanistic interpretability findings have historically been expensive enough that they accumulate slowly. When the cost of a full investigation drops from hours to minutes, the field can start asking questions that require many investigations across many models and many prompts: comparative questions, distributional questions, questions about how representations shift across training stages. The autonomous system is infrastructure for that next class of question.

The system is not making interpretability easier by hiding it. Every finding it surfaces is backed by a causal trace you can inspect, a feature you can steer, a benchmark score you can verify, a dataset row you can open. It is making interpretability faster by removing the work that was never the point, so the work that is the point can actually get done.

AQUIN · EARLY ACCESS

Run this on your own model

join waitlist

Aquin Labsaquin@aquin.app

Join the Aquin Research Community

LLM researchers & ML engineers — open research, fellowships, hackathons, and early beta access.

Join Discord

Not sure if Aquin is right for you?

Aquin