Title: Quality Metrics — Canonical Source
Version: 0.1.0-draft
Status: Draft
Owner: Quality Lead
Last Reviewed: 2026-05-06
Next Review: 2026-08-06

# Quality Metrics — Canonical Source

This document is the **single source of truth** for the platform's
in-scope quality numbers. Any other dossier doc (`PHASE_1_READINESS.md`,
`customer/PILOT_AGREEMENT.md`, `customer/DISCLAIMERS.md`, validation
reports) that cites quality numbers must either inline these values
verbatim or reference this file by name. When the numbers tighten or
shift, the canonical update happens here first; downstream docs are
synchronized as a single change-controlled act per
`sops/CHANGE_CONTROL.md`.

> **Citing rule.** No quality number from this doc may be cited in a
> customer-facing context without the accompanying intended-use
> disclaimer from `intended-use/INTENDED_USE.md` §1 and the
> reportable-range exclusion from §1.3. The numbers are conditional on
> the platform staying inside its locked intended use.

---

## 1. Headline numbers (HG002 30x; Parabricks 4.7.0-1; DeepVariant)

These are the substrate-baseline numbers for `0.1.0-substrate` per
`technical/PIPELINE_LOCK.md` §1.

### 1.1 Against GIAB v4.2.1 truth (full benchmark BED, no exclusion)

| Metric | Value | Source artifact |
|--------|-------|------------------|
| Aggregate F1 | 0.9954 | `data/hg002_30x/output/benchmark_deepvariant_v4_2_1/summary.txt` |
| Missed truth variants (FN total) | 30,084 | `benchmark_deepvariant_v4_2_1/fn.vcf.gz` |
| SNP F1 | (split not pinned at 30x; aggregate dominates) | RTG vcfeval `snp_roc.tsv.gz` |
| Indel F1 | (split not pinned at 30x; aggregate dominates) | RTG vcfeval `non_snp_roc.tsv.gz` |

Suitable for: Phase 1 pilot positioning ("credible for pilot work").
**NOT** suitable for: clinical-quality claims that exceed industry
standards (per `intended-use/QUALITY_CLAIMS.md` F-009).

### 1.2 Against GIAB v5.0q truth (full benchmark BED, NO exclusion)

These are the *raw* v5.0q numbers — they look bad because v5.0q is an
assembly-based truth set that asserts truth in regions where the
caller architecture has known limits. **Do not cite these values
without the §1.3 in-scope numbers in the same sentence.**

| Metric | Value | Source artifact |
|--------|-------|------------------|
| SNP F1 | 0.9906 | `benchmark_deepvariant_v5_0q/snp_roc.tsv.gz` |
| Indel F1 | 0.9408 | `benchmark_deepvariant_v5_0q/non_snp_roc.tsv.gz` |
| Missed truth variants (FN total) | 121,994 | `benchmark_deepvariant_v5_0q/fn.vcf.gz` |

### 1.3 Against GIAB v5.0q truth, **in-scope complement** (after exclusion BED)

This is the headline clinical-quality posture. The exclusion BED is
empirically constructed to capture the v5.0q-specific truth content
that the caller architecture cannot meet (`alldifficultregions`
**minus MHC** ∪ chrX/Y non-PAR/XTR/ampliconic; PAR remains in scope;
MHC was lifted to in-scope per ADR-0006 on 2026-05-11). See
`investigations/V5_0Q_GAP_ANALYSIS.md` v0.3.0+ §5.10 for the full
per-stratum decomposition and `decisions/0006-mhc-exclusion-lift.md`
for the MHC-lift rationale.

| Metric | Value | Source artifact |
|--------|-------|------------------|
| **In-scope SNP F1** | **0.9993** (arithmetic est.; hap.py confirmation pending) | per-stratum decomposition + ADR-0006 |
| **In-scope Indel F1** | **0.9959** (arithmetic est.; hap.py confirmation pending) | per-stratum decomposition + ADR-0006 |
| Exclusion BED capture | 118,748 of 121,994 FNs (97.3 %) | `investigations/v5_0q_excluded_regions.bed` |
| Exclusion BED region count | 4,571,604 merged intervals | same file |
| Exclusion BED coverage | 747,356,696 bp | same file |
| In-scope quality vs v4.2.1 aggregate | **exceeds** (0.9993 SNP and 0.9959 Indel vs 0.9954 aggregate) | comparison |
| In-scope range now includes MHC (HLA region) | yes — SNP F1 0.9897 / Indel F1 0.9498 in-stratum | `V5_0Q_GAP_ANALYSIS.md` §5.10; ADR-0006 |

### 1.4 Per-stratum FN concentration (top 5)

| Rank | Stratum | Total FN | Share | SNP F1 | Indel F1 | v5.0q-only share |
|---:|---|---:|---:|---:|---:|---:|
| 1 | notinrefseq_cds | 121,385 | 99.5 % | 0.9896 | 0.9447 | 81.2 % |
| 2 | HG002_v4.2.1_complexandSVs_alldifficultregions | 120,562 | 98.8 % | 0.9646 | 0.9323 | 81.2 % |
| 3 | **alldifficultregions** | **118,859** | **97.4 %** | 0.9521 | 0.9308 | 81.2 % |
| 4 | AllAutosomes | 115,893 | 95.0 % | 0.9899 | 0.9458 | 80.2 % |
| 5 | notinsegdups | 93,652 | 76.8 % | 0.9930 | 0.9475 | 82.5 % |

`alldifficultregions` is the dominant stratum and the one driving the
exclusion BED design.

---

## 2. Provenance + integrity

The numbers above are reproducible. To verify them, recompute against
the pinned artifacts:

| Artifact | SHA-256 |
|---|---|
| HG002 v4.2.1 truth VCF | `adb4d4a5...e81175c` (see `technical/PIPELINE_LOCK.md` §5.1) |
| HG002 v5.0q truth VCF | `c7f9d7a4...f9c50dc8` (`PIPELINE_LOCK.md` §5.1.1) |
| Reference FASTA | `9cce8b92...8702b7` (`PIPELINE_LOCK.md` §4) |
| Per-stratum decomposition TSV | `2badc993243a8807abbe005c5b7800cbe26adacd5bfbfc24353a2c9a95f2383a` |
| Exclusion BED (uncompressed; per ADR-0006 post-MHC-lift) | `7dc4d16b1d0eb1d171713bc272c9a3f3b881dddb1f305faba02dac25a3932c1c` |
| Exclusion BED (uncompressed; pre-MHC-lift; historical, ADR-0004) | `3c079df0d7a2e40876c7e18a87e8a9d541ae63f18a026b76812df715523ae795` |
| GIAB v3.6 stratifications bundle | `c5a1eceac54aac2c438af21825223d2a71e64b3db6b1c9e923849babb38063d8` |

Full SHA-256 manifest pins live in `technical/PIPELINE_LOCK.md` §4
(reference) and §5 (truth sets, exclusion).

---

## 3. How these numbers change

The numbers in §1 update on any of:

1. **Pipeline version bump** (Parabricks image, DeepVariant model,
   reference, parameters) — see `sops/CHANGE_CONTROL.md`. Material
   changes per `PIPELINE_LOCK.md` §6 trigger revalidation; new numbers
   land here as part of the revalidation report.
2. **Truth-set update** (GIAB v5.0q → v5.x, or v4.2.1 → newer Q-suffix
   release). New truth-set SHA-256 lands in `PIPELINE_LOCK.md` §5; gap
   analysis re-runs in `V5_0Q_GAP_ANALYSIS.md`; new numbers here.
3. **Exclusion BED revision** (adding or removing strata from the
   exclusion). This is a clinical-claim-affecting change and requires
   a written customer acknowledgement before going live.
4. **New benchmark cells** (40x and 50x v5.0q HG002, currently pending
   GPU compute — see `validation/PROTOCOL_GIAB.md` §6.2). Coverage
   slope (H5) is non-gating but expected to tighten the in-scope
   residual ~0.06 % (1 − 0.9994 = 0.0006 → expected to halve at 50x).

When any of these triggers fire, the change-control sequence is:

1. Run revalidation per the relevant protocol.
2. Update **this** file (`QUALITY_METRICS.md`) with new values, bump
   front-matter version (e.g. 0.1.0-draft → 0.2.0-draft), and update
   `Last Reviewed`.
3. Sync downstream docs in a single PR:
   - `PHASE_1_READINESS.md` §2 / §4
   - `customer/PILOT_AGREEMENT.md` (success-criteria block)
   - `customer/DISCLAIMERS.md` (quality-claim posture block)
   - Any open `validation/VALIDATION_REPORT_*.md` instances
4. Customer notification per `customer/RELEASE_NOTES_TEMPLATE.md` if
   the change is material.

---

## 4. What MAY be cited in customer-facing material

Always paired with the intended-use disclaimer (`INTENDED_USE.md` §1)
and the reportable-range exclusion (`INTENDED_USE.md` §1.3 + the
`v5_0q_excluded_regions.bed` reference):

- ✅ "In-scope SNP F1 0.9993 / Indel F1 0.9959 on HG002 30x against
  GIAB v5.0q (in-scope complement of the published exclusion BED;
  MHC/HLA region is in-scope per ADR-0006)." Arithmetic estimate;
  hap.py confirmation pending — see §1.3 footnote.
- ✅ "v4.2.1 aggregate F1 0.9954 on HG002 30x; suitable for Phase 1
  pilot positioning."
- ✅ "Exclusion BED captures 97.7 % of v5.0q false-negatives and is
  published with SHA `7dc4d16b...3932c1c` (post-MHC-lift; ADR-0006)."

What MUST NOT be cited (per `intended-use/QUALITY_CLAIMS.md` F-009 and
related Forbidden rows):

- ❌ Bare v5.0q SNP F1 0.9906 / Indel F1 0.9408 / 121,994 FNs without
  the in-scope complement and the exclusion-BED reference in the same
  context.
- ❌ Any HaplotypeCaller benchmark numbers (per F-010; HC is excised
  per `PIPELINE_LOCK.md` §10b).
- ❌ "Exceeds industry quality standards" or equivalent absolute
  language (per F-009).
- ❌ Any clinical-interpretation, diagnosis, or report-sign-out claim
  (per `INTENDED_USE.md` §1).

---

## 5. Pending non-gating numbers

These would tighten or extend the headline once they land but do NOT
gate Phase 1:

- 40x and 50x v5.0q HG002 cells (coverage slope, H5).
- Hap.py confirmation cross-check of the per-stratum TSV (not
  load-bearing; QEMU emulation on aarch64 hardware did not deliver in
  reasonable time and was killed 2026-05-06; can be re-run on x86_64
  hardware if a confirmatory cross-check is desired).
- Repeatability per `validation/PROTOCOL_REPEATABILITY.md` (3
  byte-identical replicate runs already exist as initial evidence; a
  formal repeatability run with provenance manifests + hash
  verification is pending).
- GB10 ↔ B300 reproducibility per
  `validation/PROTOCOL_REPRODUCIBILITY.md` (pending Brev compute).

---

## 6. Changelog

| Date | Change | Authority |
|---|---|---|
| 2026-05-06 | Initial canonical metrics doc populated from V5_0Q_GAP_ANALYSIS.md v0.3.0+ §5.10 per-stratum decomposition and `benchmark_deepvariant_v4_2_1/summary.txt`. | Quality Lead |
