
The Benchmark Builder
Aquin Labs · April 2026
Contents
Benchmarking should not be a separate job
Standard benchmark workflows look like this: choose a suite, configure a harness, run the evaluation, wait for results, parse outputs, visualize separately. For a scheduled evaluation of a production checkpoint, that pipeline is reasonable. For an in-session question that comes up while you are mid-inspection — you are looking at a suspicious feature, or the model just gave a bad answer and you want to know how bad — it is a full context switch that typically does not happen. The question gets dropped.
The Benchmark Builder removes the context switch. It is a tool inside each of the three agent chats: model inspection, data inspection, and training monitor. You describe what you want to measure in plain language. The agent writes the prompt suite, runs it against whatever is currently loaded in that session, and returns a scored card directly in the chat thread. The benchmark is grounded in the same session context as the question that prompted it, and it lives next to the conversation that produced it.
The card is interactive. Chart type is switchable between bar, pie, radial, and line. The card expands into a full dialog showing the complete capability breakdown. Results export in four formats. And because the card is in the chat, the reasoning that led to the benchmark — the conversation, the prior findings, the specific question being asked — is preserved alongside the result.
The three contexts
Benchmark Builder is available in all three agent chats. Each one has access to different state at send time, so the same request sent to different chats produces different results. The context a card was generated in is recorded in its metadata and included in all exports.
Benchmark any TransformerLens-compatible model on reasoning, instruction following, factuality, refusal, or safety. The agent has direct access to the model currently loaded in the inspector — prompts are run against it directly, not against an API.
Benchmark a dataset on quality dimensions: diversity, toxicity coverage, factual density, label balance, or domain spread. Scores attach to the dataset that is open in the data inspector at the time of the request.
Benchmark a checkpoint at a specific training step. Capability scores mid-run let you compare progression across steps without waiting for training to finish, and the results are indexed against the step number so you can track them in the regression panel.
The most common pattern is model inspection: you are looking at a feature, running a causal trace, and a question about reasoning capability or refusal behavior comes up naturally. Sending the benchmark request in that chat means the card is grounded in the same model and the same session state. You do not need to navigate elsewhere or re-specify the model.
The training monitor context is the least obvious but often the most useful. When a training run completes and the model diff shows a behavioral regression, a benchmark request inside the training monitor benchmarks the specific checkpoint at issue. The score is indexed against the training step and can be tracked in the regression panel alongside behavioral diffs from prior runs. This connects the capability measure directly to the training run that produced it.
How a benchmark is built
When you describe a benchmark, the agent identifies capability dimensions in the request and generates a prompt suite for each one. The suite size is determined by the task complexity: factual recall probes with unambiguous ground truth get fewer prompts; multi-step reasoning tasks with partial-credit scoring get more. You can override both in your request: asking for a quick benchmark produces a smaller suite with faster turnaround, specifying a prompt count gives you exactly that many.
Prompts are generated fresh on every run. Two requests for the same benchmark produce equivalent but not identical prompt sets. This prevents scores from being artifacts of a specific phrasing and means repeated runs give a distribution rather than a single point. For direct A/B comparison, export both runs as JSON and compare the capability rank order, which is more stable across prompt variation than the absolute scores.
example requests, across all three contexts
context: model inspection
context: data inspection
context: training monitor
There is no fixed set of capability dimensions. You describe what you care about and the agent maps it to dimensions it can score. Common requests resolve to well-defined dimensions like those below. Less common requests — dataset coverage of a specific regulatory category, model consistency on a particular topic area — produce custom dimensions that the agent defines before scoring. The definition is shown in the card summary alongside the score.
Scoring methods
Each capability dimension gets a score between 0 and 100. The scoring method depends on the task type. The agent selects the appropriate method automatically and notes it in the card summary. Understanding which method was used matters when you compare scores across different benchmarks: a 67% on chain-of-thought math with partial-credit scoring is not directly comparable to a 67% on factual recall with next-token probability.
factual recall, cloze completion, multiple choice
The model's probability on the target token is measured directly. No generation required. This is the fastest method and most appropriate for knowledge probes where the answer is a specific word or phrase.
code generation, function completion
Generated code is executed against a test suite. Pass@1 is the fraction of problems where the first generated solution passes all tests. Requires a sandbox runtime.
summarization, translation
Generated output is compared against reference text. ROUGE-L measures the longest common subsequence as a proxy for content coverage. Penalizes both verbosity and omission.
refusal, safety
Each prompt either produces the expected refusal or it does not. The score is the fraction of prompts that pass the refusal gate. The threshold is configurable.
data diversity, label balance
For dataset benchmarks, the score is computed over the full dataset distribution rather than per-prompt. KL divergence from an expected reference distribution, normalized to [0, 1].
The six capability dimensions below are the most commonly requested. They are not a fixed taxonomy. They emerge from the distribution of requests, and any of them can be refined in your request: "factuality on medical claims only", "reasoning on spatial problems", "refusal on political topics". The more specific the dimension, the more targeted the prompt suite and the more interpretable the score.
12 prompts requiring decomposition and intermediate steps. CoT scoring.
Arithmetic, algebra, and word problems. Exact-match plus partial-credit scoring.
Function completion and bug fix tasks. Execution-based pass@1 scoring.
Source fidelity and brevity. ROUGE-L against reference summaries.
Refusal rate on a curated set of clearly out-of-scope requests.
Accuracy on factual recall probes with verified ground truth.
6-capability benchmark · horizontal bar view
overall · 80% · mean across 6 dimensions · 36 prompts total · model inspection context
The inline card
When the benchmark finishes, the agent appends a card to its message in the chat thread. The card shows the benchmark title, context label, prompt count, and a scored bar for each capability. It is sized to sit naturally in the conversation without dominating it: compact enough to scroll past, detailed enough to read at a glance.
The card is immediately interactive. The chart type switcher in the top right changes the visualization without a round trip. The expand button opens a full-screen dialog with the complete capability breakdown, a larger chart, and the scoring method and prompt count for each dimension. Export buttons live in the card footer and are available without opening the dialog.
inline card · as it appears in the chat thread
Multi-Step Capability Benchmark
model inspection · 36 prompts · llama-3.2-1b
card lives in the thread alongside the message that requested it. expand opens a full dialog. chart type switcher is per-card and remembered across expand/collapse.
The overall score in the footer is the mean across all capability dimensions, weighted equally by default. If some capabilities matter more than others for your use case, specify the weighting in your request and the agent will note it in the card summary. The overall score will reflect the custom weighting.
Chart types
Four chart types are available on every card. They render the same underlying scores, so switching between them is purely a display choice. The active chart type is remembered per card and carries over when you open the expand dialog.
Side-by-side comparison across capabilities. Best for seeing which dimension is weakest at a glance.
Proportional breakdown. Good for a quick summary when all capabilities are being compared at once.
Radar-style layout. Shows the shape of the model's capability profile rather than its absolute scores.
Useful when comparing the same benchmark across multiple checkpoints, datasets, or model versions.
Line chart is a special case: it is designed for comparison across multiple benchmark runs rather than a single result. If you run the same benchmark twice at different training steps, opening both cards and switching to line view shows the score trajectory across the two runs. This works best when the capability dimensions are the same across runs, which they will be if you use the same request text.
Export formats
All four export formats are available from the card footer in both the inline view and the expanded dialog. Every export includes the benchmark metadata: title, context, model or dataset identifier, prompt count per capability, scoring method per capability, and timestamp.
One row per capability with score, prompt count, scoring method, and run metadata. Opens directly in Excel or pandas.
Full structured output including prompt texts, per-prompt scores, capability summaries, and card metadata.
PNG snapshot of the card at the current chart type, at display resolution. Suitable for reports or slides.
A single-page document with the card title, scores table, and chart rendered at print resolution. Includes run metadata in the footer.
JSON is the most complete format for downstream use. The per-prompt scores let you identify exactly which prompts failed within a capability, which is often more informative than the aggregate score. A reasoning score of 67% means 8 of 12 multi-step prompts were answered correctly. The JSON output tells you which 4 failed.
Reading the results
Scores produced by the Benchmark Builder are relative to the prompt suite generated for your specific request. They are not directly comparable to published benchmark scores unless you explicitly request a standardized benchmark by name. A reasoning score of 82% means 82% of the generated multi-step reasoning prompts were answered correctly by the model in that session. It is not a claim about the model's score on a specific external leaderboard.
The most reliable use of these scores is comparison within a session. Run the same benchmark request against two models, two checkpoints, or two dataset variants. Because prompts are regenerated each time, absolute scores carry run-to-run noise at the level of a few percentage points. What is stable is the rank order: if model A scores higher than model B on reasoning across three separate runs, that ordering is reliable even if the absolute scores shift slightly between runs.
A low score on a capability is not a verdict. It is a starting point. The agent chat is still open after the card is returned. The natural follow-up is to drill into the failing capability: ask which prompts failed, ask why the model is getting them wrong, or request a more targeted benchmark focused on the specific sub-type that is failing. A reasoning score of 67% might be driven entirely by spatial reasoning problems while arithmetic scores perfectly. A follow-up benchmark on spatial reasoning alone tells you that in one additional request.
full 6-capability result · as shown in the expand dialog
Multi-Step Capability Benchmark
model inspection · 36 prompts · llama-3.2-1b
math scores lowest at 67%. code scores highest at 91%. the gap points to where the next targeted benchmark should go.
When you see a capability gap this clearly, the investigation path is direct. Low math, high code, in a model that has seen substantial code training, suggests the math deficit is likely in the reasoning chain rather than the arithmetic operations themselves. A follow-up benchmark on chain-of-thought math specifically, with the agent instructed to log the full reasoning trace for each prompt, will confirm or rule out that hypothesis in the same session. The Benchmark Builder does not close that loop automatically. It gives you the signal; the conversation continues from there.
Not sure if Aquin is right for you?
Aquin
