Evals: Embedding
Behavioral probes for embedding models: confidence-analysis, custom Q&A with cosine-similarity scoring, and semantic match tasks. Embedding-specific retrieval evals live under Inspection (retrieval, sae-faithfulness). Requires embedding mode.
2 commands
aquin confidence-analysis
agent tool: run_confidence_analysis
Per-probe representation confidence over a probe dataset. Embedding mode uses cosine similarity to a baseline-centroid reference and spectral entropy (diffuse = higher uncertainty). Optional --join-sae attaches SAE mean L0 and top feature per probe for confidence ↔ feature ↔ layer analysis under stressors.
| Flag | Description |
|---|---|
| --prompts* | JSON/JSONL probe file (text + optional id, stressor, lang, quant_run_id). |
| --threshold | Low-confidence cutoff 0–1 (default: 0.40). |
| --join-sae | Attach SAE mean L0 + top feature per probe. |
| --layer | SAE layer for join (default: model embed SAE layer, e.g. 11 for gte-small). |
| --save | Write schema_version=1 JSON export (stressor deltas + heatmap). |
| --check | Save confidence-analysis-check.json and confidence-analysis-check.png in the current directory. |
| --output json | Print raw JSON to stdout. |
Same command as LLM confidence-analysis; metrics backend switches automatically. Tag baseline probes with stressor: baseline for centroid reference.
aquin eval
agent tool: run_custom_eval
Custom eval for embedding models: encodes each prompt and reference answer, scores by cosine similarity instead of keyword overlap. Use for semantic match tasks (paraphrase detection, retrieval-style Q&A).
| Flag | Description |
|---|---|
| --name* | Eval name. |
| --prompts* | JSON array of query strings. |
| --reference_answers* | JSON array of target strings. |
| --threshold | Cosine similarity pass threshold (default: 0.5). |
| --check | Save eval-check.json and eval-check.png in the current directory. |
Same command as LLM eval; scoring backend switches automatically based on loaded model type.
