Research Community Policies Join Waitlist

causal tracingSAEcircuit attributionlogit lensfeature steeringfact checkbias detectioncensor auditLlama 3.2 1B

The Attribution System

Aquin Labs · April 2026

Tracing facts through LLMs

When a language model answers "What is the capital of France?" with "Paris", it is not looking anything up. Somewhere in 1.2 billion parameters, the answer was encoded during training and is retrieved at inference time through a sequence of matrix multiplications. The question is: where, exactly? And once we know where, we can ask the harder question: is the answer actually right?

Aquin runs two systems in sequence on every model output. The first traces the mechanism: which layers, features, and prompt tokens caused each response token. The second evaluates the result: whether the claims are true, whether the framing leans, whether certain topics were quietly avoided. Neither is complete without the other.

the full pipeline: four attribution steps trace the mechanism, three checking steps evaluate the output.

AQUIN · EARLY ACCESS

Run this on your own model

join waitlist

The experiment

We ran a single factual query end-to-end through the full pipeline. The prompt was intentionally simple so the causal structure would be unambiguous.

prompt: "What is the capital of France?"

response: "The capital of France is Paris."

model: meta-llama/Llama-3.2-1B-Instruct

SAE: layer 8 · 16,384 features · L1_coeff 10.0 · L0 ~679

noise_scale: 3.0 · n_noise_runs: 10

We ran ROME-style causal mediation analysis: for each prompt token, corrupt its embedding with scaled Gaussian noise, run the forward pass, measure how much the target token's probability drops. Average over multiple noise runs. The result is a causal score for every (prompt token, response token) pair.

What the attribution shows

Three prompt tokens dominate: "capital", "of", and "France". Together they account for nearly all of the causal signal driving "Paris". "What" contributes almost nothing. The model is not doing full-sentence pattern matching. It identifies the semantically load-bearing words and routes most of the causal work through them.

prompt tokens

WhatisthecapitalofFrance?

response tokens

ThecapitalofFranceisParis.

causal attribution scores

France → Paris

94%

capital → Paris

81%

of → Paris

58%

What → Paris

amber = causal prompt driver · green = key response token · scores normalized

The same coloring links prompt tokens to the response tokens they causally influence. "Capital" in the response is driven primarily by "capital" and "France" in the prompt. The model does not attend uniformly to its context: it identifies the semantically decisive words and routes through them.

16 layers, one peak

Attribution tells us which prompt tokens matter. Causal patching tells us which layer in the network does the actual retrieval. For each layer, we restore its clean residual stream while corrupting all other layers, then measure how much the target token's probability recovers.

layer-level causal graph. node brightness = causal drop %. L8 is the peak at 87%.

high impactmediumlowpeak layer L8

Layer 8 accounts for 87% of the causal signal. The France then capital then Paris association lives in the MLP sublayers around the network's midpoint. Layers 4-7 show moderate warming as the "capital of France" representation develops. Layers 12-15 contribute mainly to formatting and refinement.

This is consistent with the mechanistic interpretability literature. Middle-layer MLPs act as key-value stores: the subject representation is used as a key to look up and write the associated value into the residual stream.

The logit lens: watching confidence build

The causal trace locates the retrieval site. The logit lens shows what the model is predicting at each layer along the way. After every transformer block, we apply the final layer norm and unembed the residual stream directly into vocabulary space, as if the model had stopped there and been forced to guess.

probability assigned to "Paris" at each layer

amber bar = peak at L8 (78%). bars show 0% when "Paris" is not the top prediction.

top token prediction per layer

L0the12.0%

a9.0%

L1the14.0%

city7.0%

L2city15.0%

the9.0%

L3city11.0%

its8.0%

L4France18.0%

city13.0%

L5France22.0%

Paris11.0%

L6Paris29.0%

France18.0%

L7Paris41.0%

France14.0%

L8 · peakParis78.0%

Lyon4.0%

L9Paris81.0%

Lyon3.0%

L10Paris83.0%

Lyon2.0%

L11Paris84.0%

Lyon2.0%

L12Paris85.0%

Lyon2.0%

L13Paris86.0%

Lyon1.0%

L14Paris87.0%

Lyon1.0%

L15Paris88.0%

Lyon1.0%

residual stream unembedded at each layer. L8 highlighted.

Early layers produce generic tokens: "the", "city", no commitment. Around layer 5, "France" briefly surfaces, the subject representation being assembled before the model commits. By layer 8, "Paris" dominates at 78% and barely shifts through layer 15. The two-step structure of the retrieval is directly visible: subject formation, then fact lookup.

What the SAE sees

We trained a 16,384-feature Sparse Autoencoder on 2M residual stream activations at layer 8, then ran the query through it to extract the top activating features at each token position. For each active feature, causal ablation zeroes out its contribution and re-runs the forward pass, comparing before-and-after logit distributions to define its functional role.

top SAE features when model produces "Paris"

f13933

geographic country associations

"France"9.75

f13910

capital/seat-of-government

"capital"7.86

f13007

European nation names

"France"6.72

f4592

city names after capitals

"Paris"5.82

f5042

relational prepositions

"capital"5.38

feature bridges: prompt → response

"France""capital""of"

f13933, f13007f13910, f5042f13910

"Paris""capital""France"

f13933 fires at 9.75 on "France" · f13910 fires at 7.86 on "capital" · f4592 encodes city names after capitals.

The SAE decomposition gives us named, interpretable units for what the model computed. But knowing which features fired is only the start. The next question is: what does each feature actually do to the model's predictions, what concepts live near it in weight space, and what happens to the output if you amplify or suppress it?

The circuit attribution graph

The feature bridge table above shows which features connect which tokens. The circuit graph makes this structure explicit as a directed bipartite visualization: prompt tokens on the left, SAE features in the middle, response tokens on the right. Edge weight encodes activation strength. Hover any node and the connected subgraph highlights while the rest fades.

The graph answers a question the token coloring alone cannot: which features act as hubs, carrying information from multiple prompt tokens to multiple response tokens simultaneously? In this query, f13910 (capital/seat-of-government) receives signal from both "capital" and "of" in the prompt and feeds both "capital" and "Paris" in the response. It is doing double duty as a relational and a geographic feature.

circuit attribution: prompt → features → response

prompt tokenSAE featureresponse tokenedge weight = activation

f13910 is the hub: receives from "capital" + "of", feeds both "capital" and "Paris" in the response.

What each feature does to the vocabulary

An SAE feature is a direction in residual stream space. Its effect on the model's output is determined by projecting that direction through the unembedding matrix: the tokens with the highest resulting logit are the ones this feature promotes, and the lowest are the ones it suppresses.

For f13933 (geographic country associations), the top boosted token is "Paris" at +4.21, followed by other French cities. The suppressed tokens are all non-French European capitals. The feature is not just "France-related": it specifically routes the model toward French place names and away from other national capitals. That specificity is what makes causal ablation so informative.

f13933 · geographic country associations · logit projection

Boosts

Paris4.21

Lyon2.14

Marseille1.87

Bordeaux1.52

capital1.31

Suppresses

Berlin-3.44

London-2.98

Rome-2.71

Madrid-2.45

Tokyo-2.01

logit scores: projection of decoder direction through W_U. positive = boosted, negative = suppressed.

This view also works in reverse. If a model produces a wrong city, the logit projection of the active features at layer 8 will show which tokens they were promoting. A factual error at the output level often has a clear mechanistic cause at the feature level: a feature that should not have fired, or one that boosted the wrong token.

Feature neighborhoods in weight space

SAE features are directions in the model's residual stream. Two features that are geometrically close in decoder weight space tend to fire in similar contexts and produce similar effects on the vocabulary. Computing cosine similarity across all 16,384 decoder directions gives a neighborhood map for any feature.

For f13933, the nearest neighbor at 91% similarity is f13007 (European nation names), which appeared in the top features list for this query. The neighborhood also contains f7834 (country-capital associations) and f2901 (seat-of-power contexts): a cluster of features that collectively handle geopolitical reference. When one fires, the others are likely nearby in activation space. Understanding this cluster is essential if you want to edit any one member without disturbing the others.

f13933 · nearest neighbors · cosine similarity in decoder space

f13007

European nation names91%

f5042

relational prepositions84%

f9211

geographic proper nouns79%

f7834

country-capital associations74%

f2901

seat-of-power contexts68%

similarity computed over W_dec rows. bar = cosine similarity normalized to [0, 1].

The feature space: a map of 16,384 directions

The neighborhood search above computes similarity one feature at a time. UMAP projects all 16,384 SAE decoder directions into three-dimensional space at once, so the full geometric structure of the feature space becomes navigable. Features that fire in similar contexts and produce similar vocabulary effects end up close together. Clusters emerge without any labels being imposed.

The five features active on the France query all fall inside or adjacent to the same cluster: a geopolitical reference region containing geographic country associations, European nation names, capital/seat-of-government, and country-capital associations. City names after capitals sits in a neighboring cluster connected by short edges. Relational prepositions forms its own island nearby. The spatial arrangement matches the functional relationships: these features work together because they live together.

UMAP projection · 16,384 SAE features · layer 8 · Llama 3.2 1B

geopolitical referencerelational syntaxcity / place namesother featuresactive on this query

computed with umap-learn on normalized W_dec rows. 3D projection; shown here as 2D slice. full explorer is interactive in the app.

The UMAP view is most useful as a pre-edit diagnostic. Before modifying any feature with a weight edit, you want to know what else lives nearby. A tight cluster means the features share decoder geometry: an edit that shifts one will likely perturb the others. A feature sitting in open space is a safer edit target. The explorer lets you navigate by clicking any point to open its logit projection and neighborhood list, then hop through the cluster without leaving the map.

Feature steering: intervening directly

The circuit graph and logit projections describe what the model is doing. Steering lets you intervene on it in real time. For a given feature, we add a scaled multiple of its decoder direction to the residual stream at layer 8 on every forward pass. Positive strength amplifies the feature; negative strength suppresses it. The output shifts accordingly.

Steering is the fastest way to validate a feature's causal role. If f13933 genuinely encodes "geographic country associations" relevant to France, amplifying it with a different country representation in context should redirect the model's output toward that country's capital. Below, we steer at +4.0 on a prompt asking about France, and "Paris" becomes "Lyon" while the rest of the sentence is preserved. The feature was doing exactly what its label said.

f13933 · geographic country associations · strength +4.0

Baseline

The capital of France is Paris, which has been the country's political and cultural center since the 10th century.

Steered

+4.0

The capital of France is Lyon, which has been the country's political and cultural center since the 10th century.

highlighted words differ from baseline. strength slider runs from −10 (suppress) to +10 (amplify).

Steering is not the same as weight editing. It modifies activations at inference time; the model weights are unchanged. It is a diagnostic tool: a way to test hypotheses about what a feature does before committing to a permanent edit. If steering confirms the feature's role and the logit projection confirms its vocabulary effect, a ROME-style weight edit to correct a factual association becomes a targeted and well-understood intervention rather than a guess.

Now: is it actually right?

We know how the model produced "Paris." Layer 8, five specific features, three prompt tokens, a geopolitical feature cluster with a clear logit signature. The mechanism is clean and well-localized. But knowing the mechanism tells us nothing about whether the model's output is correct, fairly framed, or complete. A feature can trace cleanly to a factual claim that happens to be wrong. A response can flow through the network without any anomaly and still systematically avoid entire topic areas.

The checking system operates on the response itself. It runs automatically after every generation and produces three analyses in parallel: a fact check against live web search, a bias detection pass across content-derived axes, and a censor audit mapping what the model addressed, softened, or avoided.

Fact Check: is it true?

The response is passed to a verification pipeline with web search enabled. Every distinct verifiable claim is extracted, opinions and filler are skipped, and each claim is classified as supported, refuted, or unverifiable with a one-sentence explanation and up to three sources.

Live web search rather than retrieval augmentation matters here: a model may assert something that was accurate at training time and has since become false. Claims that cannot be verified return as unverifiable rather than a guess.

example: "tell me about the Eiffel Tower"

Supported

The Eiffel Tower is 330 meters tall

The Eiffel Tower stands 330 meters tall including its broadcast antenna.

Eiffel Tower official site

Supported

The Eiffel Tower was built in 1889

Construction was completed in 1889 for the World's Fair.

Britannica: Eiffel Tower

Refuted

The Eiffel Tower is the tallest structure in Europe

Several structures including the Ostankino Tower in Moscow are taller.

List of tallest structures in Europe

the third claim is incorrect and gets caught. the check annotates the response; correction requires weight editing.

When the fact check finds a refuted claim, the causal trace and logit lens become the follow-up tools. If confidence in the wrong token crystallizes at layer 8, the SAE features active there are the natural candidates for steering to test the hypothesis, and for a weight edit to correct the association permanently.

Bias Detection: which direction does it lean?

Most bias tools apply a fixed set of axes regardless of what the text is about. Aquin derives axes from the content. For each response, the pipeline identifies 2-4 bias dimensions genuinely relevant to that specific prompt, then scores the lean on each from −1.0 to +1.0. A response about climate policy yields axes like "alarmist vs dismissive." A medical response yields "conservative vs aggressive treatment." The axes shift with the content rather than being imposed on it.

bias axes: Eiffel Tower response

hedgedcertainty framingconfident

The response states facts without qualification even where debate exists.

Western-centriccultural lensglobal

Examples and framing draw primarily from Western European and American contexts.

certainty framing +0.55 · cultural lens -0.4. center = 0.0.

Censor Audit: what did it not say?

Fact check and bias detection work on what the model said. Censor audit works on what it did not. Given the prompt, the pipeline identifies 3-6 topic areas that were present in the query or naturally relevant to it, then assesses how each one was treated: addressed directly (unfiltered), touched with excessive caveats (softened), or avoided entirely (suppressed).

The check also attempts to classify the origin of any suppression it finds. Weight-level suppression is baked into the model's parameters through training; it shows up as consistent avoidance across different prompt framings of the same topic. Surface-level suppression looks more like an instruction-following patch. This classification is presented as a hypothesis to investigate, not a finding. Confirming it requires causal intervention: exactly the kind of analysis the attribution system, and specifically steering, supports.

censor audit: Eiffel Tower response

construction costModel discussed budget and financing details without hedging.

unfiltered

safety incidentsAcknowledged historical accidents but framed them as resolved.

softened

political oppositionAvoided the substantial public and political opposition to the tower's construction.

suppressed

surface-level RLHF patch detected on political opposition

the model discussed the tower freely but avoided the historical controversy around its construction.

When a topic is flagged as suppressed, the attribution system is the right follow-up. Which SAE features were active when the model began to engage with the topic and then redirected? Steering those features at negative strength removes the deflection in isolation. If the suppression disappears, the feature was its mechanism. That is the hypothesis the weight editing system then acts on.

Reading the systems together

A model can pass every behavioral check and still encode a factual error that mechanistic analysis would catch immediately. Conversely, a clean causal trace does not guarantee a correct or unbiased output. The mechanism and the result are independent questions and both need asking.

The full pipeline is designed to be read in sequence. The fact check finds a wrong claim; the logit lens shows when the model committed to it; the SAE identifies which features carried it; the logit projection shows what those features were promoting; steering confirms the causal role; weight editing closes the loop permanently. The censor audit finds a suppressed topic; the circuit graph shows which features were active at the point of deflection; steering removes the deflection to confirm the mechanism.

For teams deploying models in regulated or high-stakes contexts, this is the difference between knowing a model got 90% on a benchmark and knowing why — which answers it gets right for the right reasons, which ones it suppresses, where in the network to look when something is wrong, and how to fix it.

AQUIN · EARLY ACCESS

Run this on your own model

join waitlist

Aquin Labsaquin@aquin.app

Join the Aquin Research Community

LLM researchers & ML engineers — open research, fellowships, hackathons, and early beta access.

Join Discord

Not sure if Aquin is right for you?

Aquin