Benchmarks
InterpScoreFeaturePurityScoreMUISAEweight editingbenchmarkslocalitygeneralizationsequential editsLEMEzsREportability

Benchmarks

Aquin Labs · April 2026

Part I: Feature Benchmarks

Llama 3.2 1B Instruct · SAE layer 8 · 16,384 features

What makes a feature real?

We trained a sparse autoencoder on layer 8 of Llama 3.2 1B Instruct with 16,384 features. Once the SAE was trained, the immediate problem was working out which of those features were worth trusting. A feature vector is just a direction in a high-dimensional space. Before using one for anything downstream, several distinct things need to be true about it. Its label should predict where it fires. It should fire for one coherent concept rather than many unrelated ones. And when it fires, the model should actually use it.

These are not the same question. A feature can be well-labeled but polysemantic. It can be monosemantic but ignored downstream. It can causally matter to predictions but carry a label that misses the point entirely. We evaluate these dimensions independently with three benchmarks, each surfaced as a live panel inside the tool when you select a feature.

The three feature benchmarks

01InterpScoreLabel predictiveness

Cohen's d between activation distributions on matching vs. non-matching sentences. Measures how well a feature's label predicts where the feature fires.

02FeaturePurityScoreMonosemanticity

Mean pairwise cosine similarity of sentence embeddings that activate the feature. Measures whether a feature fires for one coherent concept or many unrelated ones.

03MUICausal influence

KL divergence when the feature is ablated, normalized by baseline entropy. Measures whether the model actually uses the feature or largely ignores it downstream.

InterpScore: does the label predict firing?

For each feature in our SAE, we generate two sentence sets using an isolated model: sentences where the feature should fire based on its label, and sentences where it should not. We run each sentence through Llama 3.2 1B Instruct, extract the feature's maximum activation across all token positions at layer 8, and compute Cohen's d between the two activation distributions.

Cohen's d is the difference in means divided by the pooled standard deviation. We clip to [0, 1] for display. A high InterpScore means the label is doing real predictive work. We run 10 positive and 10 negative sentences per feature by default, each as its own forward pass through the full model and SAE.

The panel surfaces the highest-activating examples from each set alongside the score, so you can read what the feature is actually responding to rather than trusting the label alone. This turned out to matter: several features in our SAE had accurate labels that nonetheless undersold the breadth of what the feature was firing on.

feature f13910 "capital/seat-of-government" · Llama 3.2 1B Instruct

Matching sentences
84%
Non-matching sentences
12%

Fires on

The capital of France is a major European hub.

8.41

Parliament sits at the seat of government.

7.86

Washington D.C. is where the president works.

6.93

Silent on

She ordered a coffee and opened her laptop.

0.12

The algorithm runs in linear time.

0.08

Three dogs sat under the oak tree.

0.03

Cohen's d = (8.07 - 0.08) / pooled_std ~= 0.84. InterpScore clipped to 84%.

A low InterpScore does not always mean the label is wrong. It can mean the label is too abstract to generate useful contrasting sentences, or that the feature fires broadly enough that any reasonable label undershoots it. That is where FeaturePurityScore comes in.

FeaturePurityScore: is it about one thing?

InterpScore evaluates the label. FeaturePurityScore evaluates the feature itself, independently of its label. We take the positive examples from the InterpScore run, the sentences where the feature actually fired above threshold, and embed them. We compute mean pairwise cosine similarity of the resulting embedding matrix, keeping only the upper triangle to exclude self-similarity. Cosine similarity lives in [-1, 1]; we remap to [0, 1] for display.

High purity means the activating contexts cluster tightly in embedding space: the feature is monosemantic. Low purity means the contexts are scattered across concepts. Running purity across our 16,384-feature SAE surfaced a meaningful tail of polysemantic features, concentrated in features trained near the sparsity penalty boundary, consistent with what the mechanistic interpretability literature predicts.

purity contrast: high vs low

High purity · f5042 "relational prepositions"

"The cat sat on the mat."

"She lives near the river."

"The book is beside the lamp."

"He stood behind the door."

mean cosine sim ~= 0.81 - purity 90%

Low purity · hypothetical polysemantic

"The merger was announced at noon."

"She whispered in the dark."

"The algorithm converged slowly."

"He scored three goals."

mean cosine sim ~= 0.21 - purity 61%

A low purity score is an early warning before running weight edits or causal ablations: if a feature is polysemantic, targeting it is likely to produce unexpected side effects in circuits unrelated to the intended subject.

Model Utilization Index: does the model use it?

A feature can score well on both InterpScore and FeaturePurityScore and still be functionally irrelevant. We found this in practice: some of the most cleanly labeled, monosemantic features in our SAE produced near-zero KL divergence when ablated. The model was computing them but not routing through them downstream.

MUI measures this directly. For each token position where the feature fires above a small threshold, we zero out that feature's contribution to the residual stream and run the ablated forward pass. We compute KL divergence between the baseline and ablated output distributions at that position, then take the mean across all firing positions, normalized by the baseline Shannon entropy of the output distribution.

feature f13933 "geographic country associations" · per-position ablation KL

pos 3
act 9.75KL 0.2841

""...capital of France...""

pos 7
act 6.12KL 0.1503

""...France is a country...""

pos 12
act 4.88KL 0.0912

""...visiting Paris soon...""

Mean KL

0.1752

KL divergence

Baseline H

2.3104

entropy

MUI

0.1752 / 2.3104

~= 76%

feature fires strongest at the "France" token. ablating it shifts the output distribution significantly. MUI = 76%.

High MUI features that are causally load-bearing but poorly labeled became the first targets for manual relabeling in our review pass. Low MUI features, even well-labeled ones, are deprioritized: they fire but carry little causal weight in any given inference.

Reading the feature scores together

Running all three benchmarks across our feature set, the most common pattern we encountered was the fourth row below: high purity and high MUI with low InterpScore. The feature is coherent and causally important, but the auto-generated label missed the concept. A second prompt pass with inspected activating examples was usually enough to resolve it.

InterpPurityMUIReading
HighHighHigh

Ideal feature. Well-labeled, monosemantic, and causally critical.

HighHighLow

Well-understood but largely decorative. The model doesn't rely on it.

HighLowHigh

Label is predictive but too broad. Feature fires for several related contexts.

LowHighHigh

Coherent and causally important, but mislabeled. Priority for relabeling.

LowLowLow

Dead or noise feature. Consider filtering it out.

threshold for "high" is 0.6 across all benchmarks. scores are displayed as percentages in the panel.

The fifth row, low across the board, represents dead features. These showed up more often than expected in the high-sparsity-penalty region of our training sweep. The panel surfaces them as candidates for filtering before any downstream analysis.

Part II: Edit Benchmarks

Pythia 2.8B · ROME-style rank-one updates · 13 benchmarks

An edit that passes validation is not necessarily a good edit

The weight editing pipeline's validation loop checks three things before committing an edit: does it hold across paraphrases, does the model's behavioral output stay stable, and does the residual stream's representational structure stay intact. Those checks are necessary; an edit that fails any of them is rolled back immediately. But they are not sufficient.

Running edits on Pythia 2.8B, we found that edits could pass all three validation checks and still be shallow, poorly generalized, or fragile when subsequent edits were applied. We built thirteen benchmarks to surface these failure modes. They run after a committed edit and characterize its quality across independent dimensions.

The thirteen benchmarks

All benchmarks use dynamic triple generation: given the subject, relation, and new target of the edit, a generator produces test cases appropriate for each benchmark's purpose. This means the suite adapts to the specific content of each edit rather than relying on a fixed test set.

01EditBenchRetention

Does the edit hold beyond the training prompt?

Generates direct and paraphrased probes of the edited fact and measures how reliably the model produces the new target across all of them. Score is mean activation probability, pass threshold 0.5.

02EditGeneralizationGeneralization

Has the model internalized the edit or memorized a surface form?

Generates semantically related prompts that imply the edited fact through different wording, context, and indirect reference not simple paraphrases but genuine inference tests. Pass threshold 0.4.

03RippleBenchLocality

Did the edit disturb nearby facts on the same subject?

Generates unrelated facts about the same subject that should not change. All triples are is_positive=False. Score is mean locality preservation, pass threshold 0.7.

04FineTuneDiffSignal-to-noise

Is the edit targeted or does it bleed into unrelated knowledge?

Measures how much the edited fact probability shifted relative to completely unrelated facts. A well-targeted edit has a large shift on the target and near-zero shift elsewhere. Pass threshold 0.6.

05SequentialCollapseThresholdStability

How many edits before the model degrades?

Applies several synthetic edits sequentially and measures KL divergence from behavioral baselines after each one. Collapse is defined as mean KL > 0.3. All synthetic edits are rolled back after the benchmark.

06BatchEditConsistencyConcurrency

Do multiple edits coexist without mutual interference?

Applies the original edit alongside several synthetic edits, then checks whether all of them still hold simultaneously. Modeled on MEMIT-style batch evaluation. Pass threshold 0.6. All synthetic edits are rolled back.

07SequentialEditRetentionDurability

Does an edit survive being followed by other edits?

Applies the original edit, then subsequent edits on different subjects, checking after each one whether the original still holds. Pass threshold 0.7. All subsequent edits are rolled back.

08LocalitySensitivityScoreCross-domain locality

Did the edit shift unrelated knowledge in other domains?

Unlike RippleBench which tests same-subject facts, this measures how much the model's output distribution shifts on completely cross-domain facts after the edit. Score is exp(-mean_kl), pass threshold 0.6.

09LEMELong-form generation

Does free-form generation about the subject reflect the edit?

Generates a short paragraph about the subject and evaluates it on three axes: factual consistency with the edit, fluency, and absence of contradictions. Final score is the mean of the three axes. Pass threshold 0.5.

10IndirectFactRecoveryChained inference

Are implied facts recoverable after the edit?

Tests multi-hop and inference-requiring probes where answering correctly requires having internalized the edit. For example, asking about the edited target from an alternate angle or perspective. Pass threshold 0.35.

11PortabilitySubMetricsPortability

Does the edit transfer across surface variation axes?

Tests edit transfer across three sub-axes: template variation (different sentence structures), alias variation (referring to the subject by alternate names), and compositional (fact embedded in longer sentences). Score is mean across all three sub-axes, pass threshold 0.4.

12PreservationVsMemorizationPM objective

Genuine generalization or surface memorization?

Compares performance on nearly-identical surface probes vs strongly paraphrased probes. The PM score is the ratio paraphrase_score / surface_score, clipped to [0, 1]. A ratio near 1.0 indicates genuine generalization. Pass threshold 0.5.

13zsRERelation extraction

Does the model pass zero-shot relation extraction probes?

Evaluates the model on Q&A style probes derived from the edited fact, mimicking the zsRE benchmark from the ROME and MEMIT papers. The model must produce the new target as the most probable next token after a natural-language question. Pass threshold 0.4.

Dynamic triple generation

Each edit touches a different subject-relation-target triple, so static test sets are not useful here. For each benchmark, the pipeline generates 6 to 8 test triples appropriate to that benchmark's purpose. Each triple specifies a prompt, a target, and whether the model should (is_positive=true) or should not (is_positive=false) produce that target.

example triples generated for edit: "The Eiffel Tower is located in Berlin"

EditBenchdirect and paraphrased probes
"Where is the Eiffel Tower?"- "Berlin"should fire
EditGeneralizationsemantically implied, different wording
"The tower on the Champ de Mars is in"- "Berlin"should fire
RippleBenchnearby unrelated facts on same subject
"The Eiffel Tower was built in"- "Berlin"should not fire
IndirectFactRecoverymulti-hop inference from the edit
"Q: What city is the Eiffel Tower's home? A:"- "Berlin"should fire
zsREzero-shot Q&A relation extraction
"Q: Where is the Eiffel Tower located? A:"- "Berlin"should fire

triples are generated fresh for each benchmark run. the same edit will produce different but equivalent test cases on re-runs.

The sequential benchmarks

Three benchmarks apply additional edits to the model beyond the one being evaluated. These emerged from a pattern we observed early in testing: edits that looked clean in isolation started degrading when the model received subsequent edits at nearby layers. We built these benchmarks specifically to quantify that interference.

SequentialCollapseThreshold applies several synthetic edits sequentially and measures cumulative KL divergence from behavioral baselines after each one. The collapse threshold is KL greater than 0.3. The benchmark reports both the score (fraction of edits without collapse) and the specific edit index where collapse occurred.

SequentialEditRetention checks whether the original edit specifically survives as subsequent edits are applied on unrelated subjects. An edit stored in a direction shared with other facts in weight space shows declining probability for the original target as subsequent edits overwrite it.

BatchEditConsistency tests MEMIT-style concurrent editing: the original edit and several synthetic edits applied in a single session, then all probed simultaneously. This catches mutual interference that doesn't show up in sequential evaluation. All synthetic edits are rolled back after the benchmark.

Long-form generation quality

Most weight editing benchmarks probe next-token probability on cloze-style completions. We found this left a gap: edits that scored well on direct probes occasionally failed to show up in free-form generation about the same subject. LEME (Long-form Edit Metric) was built to catch this.

The model is prompted to generate a short paragraph about the subject. That paragraph is evaluated on three axes: factual consistency with the edit, fluency and coherence, and absence of contradictions. The final LEME score is the mean of the three. An edit that succeeds on cloze probes but fails LEME has changed what the model says in response to a direct question, but not how it writes about the subject when given latitude.

LocalitySensitivityScore complements this by measuring KL divergence on cross-domain facts after the edit, using exp(-mean_kl) as the score. Unlike RippleBench which tests same-subject facts, this catches whether the edit disturbed facts in entirely different domains that should be unrelated to the edit target.

Portability and indirect recovery

PortabilitySubMetrics breaks edit transfer into three sub-axes. Template variation tests whether the edit holds across different sentence structures: active, passive, question, fill-in-the-blank. Alias variation tests whether the edit holds when the subject is referred to by a nickname, abbreviation, or description. Compositional variation tests whether the edit holds when the fact is embedded as a sub-clause of a longer sentence. The final score is the mean across all three.

IndirectFactRecovery tests multi-hop inference. Given that the edit says "The CEO of Acme is Alice", can the model answer "Who leads Acme?" or "What company does Alice run?" These require inferring from the edit rather than directly recalling it. The pass threshold is intentionally low (0.35) because these are genuinely hard probes.

PreservationVsMemorization (PM objective) directly measures the gap between surface memorization and genuine generalization. It runs two probe sets: nearly-identical surface probes and strongly paraphrased probes. The PM score is the ratio of paraphrase performance to surface performance. A ratio near 1.0 means the edit generalized; a ratio near 0.0 means only the surface form was memorized.

zsRE mirrors the evaluation protocol from the ROME and MEMIT papers. Probes are in "Q: [question] A:" format and the model must produce the edited target as the most probable next token, making results directly comparable to published work on factual editing in transformers.

Reading the edit scores together

EditBench, EditGeneralization, and RippleBench form the core quality triangle. The most common failure pattern we encountered was high EditBench with low EditGeneralization: surface memorization. The rank-one update wrote the new target strongly enough for the exact training prompt, but the key vector didn't generalize to other surface forms. The PM objective benchmark gives a direct score for this; a low PM ratio confirms the diagnosis.

EditBenchGeneralizeRippleReading
HighHighHigh

Edit is robust, well-generalized, and local. The ideal profile.

HighLowHigh

Surface memorization. Edit holds on direct probes but hasn't generalized.

HighHighLow

Edit generalized but caused ripple effects on the same subject.

LowLowHigh

Edit didn't hold. Probe probability below threshold even on direct probes.

pass threshold: EditBench - 0.5, EditGeneralization - 0.4, RippleBench - 0.7

When LEME fails while EditBench passes, the edit is holding in cloze probes but not showing up in generation. When IndirectFactRecovery fails while EditGeneralization passes, the model can paraphrase the edited fact but cannot use it as a premise for further inference. These patterns point to different root causes and inform which follow-up experiments to run next.

Running the benchmarks

Benchmarks run from the Benchmarks tab in the Weight Editor panel, available after a successful edit. The panel shows the subject, relation, and target of the most recent successful edit and a progress bar as results stream in one by one.

The full suite runs end-to-end in roughly 3 to 6 minutes depending on model response latency. Each benchmark card is expandable: clicking it reveals the per-probe breakdown. PortabilitySubMetrics groups its details by sub-axis (template, alias, compositional) so you can see exactly where portability fails.

The benchmarks are informative rather than gatekeeping. Failing a benchmark does not roll back the edit; that is the validation loop's job. The benchmarks run after a committed edit and give you a characterization of its quality. An edit that passes validation but fails EditGeneralization is still live in the model's weights. The benchmark result tells you how much to trust it.

Aquin Labsaquin@aquin.app

Join the Aquin Research Community

LLM researchers & ML engineers — open research, fellowships, hackathons, and early beta access.

Join Discord

Not sure if Aquin is right for you?

StatusPoliciesResearchCommunity·© 2026 Aquin. All rights reserved.

Aquin