Multi‑encoder retrieval over 1024‑bit sign‑bit signatures; Hamming distance against a labeled reference library; dual‑filter decision logic with built‑in OOD deferral. No model training in the pipeline.
Fig 01 · In‑distribution reproducibility · Phase B21× scale range · 4 runs · latest is 4‑encoder follow‑up
Tumor sensitivity holds at 100.00% across an order of magnitude of cart size
Every run reproduced the same result: zero cancers missed, zero false benigns. The over‑flag rate on non‑tumor tissue actually improved with scale.
Missed cancers per run: 0 · 0 · 0 · 0. Release‑tier specificity: 100.00% on every run. Over‑flag rate at 50K → 100K (2‑enc) → 100K (4‑enc): 0.10% → 0.067% → 0.25% — more conservative consensus catches more borderline tissue, none of which were cancers. The 4‑encoder run uses 11,111 real cancer patches (up from 5,555 in the 2‑encoder stratified subset). Encoders: Phikon‑v2 + UNI + Virchow2 + CONCH.
The 30‑point gap between solid‑organ cancers and skin is the visual story
Cross‑hospital retrieval per cancer. The SKCM gap is what the deferral layer is meant to catch — those queries are flagged for human review, not silently misclassified.
Lower SKCM accuracy is paired with an 88.2% OOD flag rate — uncertain queries are routed to human‑in‑the‑loop, not committed to.
Fig 03 · Encoder comparison · Phase APhikon‑v2 vs UNI · OOD set · 2‑encoder
Phikon‑v2 leads on retrieval and on its own honesty
Phikon‑v2 retrieves better at top‑1 and top‑5, and flags far more of its uncertain queries for human review. UNI is over‑confident on out‑of‑distribution data.
Phikon‑v2UNI
OOD flag rate = Hamming distance > 95th percentile of intra‑cart distances. Higher is more honest, not less accurate.
Diagonal dominance — with one interesting off‑diagonal cell
Rows are truth, columns are top‑1 retrieved. The 279 SKCM → BRCA cell is the largest off‑diagonal — soft‑tissue / connective texture appears to be where the skin‑cancer error budget lives.
Diagonal (correct) = 8,604 / 9,870. Off‑diagonal mass is concentrated in BRCA confusions (BRCA→SKCM 324, BRCA→COAD 239, SKCM→BRCA 279).
Release‑benign patches contain zero cancers. Tumor‑flag patches are 98.2% true cancer. Human‑in‑the‑loop captures more of the long tail. There are no contradictions. The 2 cancers the system was uncertain about went to HIL — not to silent release.
Over‑flag on non‑tumor tissue: 207 / 82,411 = 0.25%. The shift from 44/12/44 (2‑enc) to 30/12/58 (4‑enc) shows the consensus tightening: more conservative about auto‑release, more patches routed to human review — and the safety statistic (0 cancers released as benign) holds at clinical scale.
When the two encoders agree, they are right 91.49% of the time — they agree on 42.3% of queries
Agreement frequency (recall) and agreement correctness (precision) read together as a usable confidence signal. Plus an independent OOD‑flag agreement of 8.9%.
Recall · how often they agree
42.3%
4,172 of 9,870 OOD queries. Agreement is a useful gate precisely because it doesn't fire on everything.
Precision · correct when they agree
91.49%
3,817 of 4,172 agreements. Joint agreement on top‑1 retrieval is a strong high‑confidence signal in OOD conditions.
Independent OOD‑flag agreement (both encoders flag the query as out‑of‑distribution): 883 / 9,870 = 8.9% — the high‑confidence “punt to a human” bucket.
Fig 07 · Per‑encoder Hamming fingerprints · Phase B · NEWCONCH · Virchow2 · Phikon‑v2 · UNI
Each encoder has its own “tightness”
Lower median nearest‑neighbor distance = more confident matches. The diversity across encoders is what makes the 4‑encoder consensus meaningful — they make different mistakes, so when they agree the signal is uncommonly trustworthy.
CONCH (vision‑language, newer) is the tightest matcher. Virchow2 (largest model) is second‑tightest. Phikon‑v2 and UNI are the historical 2‑encoder baseline. None is “best” alone — the value is that they make different mistakes, so the consensus among them is the actually‑trustworthy signal.