Validation results · case-based retrieval · UPDATED 4‑encoder follow‑up

Pathology AI — seven charts from four runs and one cross-hospital test

Andy Grossberg · Waving Cat Learning Systems 2026‑05‑23 93,522 in‑dist patches · 11,111 cancers · 9,870 OOD queries

Multi‑encoder retrieval over 1024‑bit sign‑bit signatures; Hamming distance against a labeled reference library; dual‑filter decision logic with built‑in OOD deferral. No model training in the pipeline.

Fig 01 · In‑distribution reproducibility · Phase B 21× scale range · 4 runs · latest is 4‑encoder follow‑up

Tumor sensitivity holds at 100.00% across an order of magnitude of cart size

Every run reproduced the same result: zero cancers missed, zero false benigns. The over‑flag rate on non‑tumor tissue actually improved with scale.

Four bars at 100% sensitivity across runs of increasing size. 100% 75% 50% 25% 0% 100.00% multi5k 4,500 patches · 4 enc 100.00% nct7k 5,915 patches · 4 enc 100.00% nct50k 49,995 patches · 2 enc ● 4‑encoder follow‑up · 2026‑05‑23 100.00% nct100k 93,522 patches · 4 enc 11,111 cancers
Missed cancers per run: 0 · 0 · 0 · 0. Release‑tier specificity: 100.00% on every run. Over‑flag rate at 50K → 100K (2‑enc) → 100K (4‑enc): 0.10% → 0.067% → 0.25% — more conservative consensus catches more borderline tissue, none of which were cancers. The 4‑encoder run uses 11,111 real cancer patches (up from 5,555 in the 2‑encoder stratified subset). Encoders: Phikon‑v2 + UNI + Virchow2 + CONCH.
Fig 02 · OOD per‑cancer top‑1 · Phase A Phikon‑v2 · TCGA‑UT external split · 2‑encoder

The 30‑point gap between solid‑organ cancers and skin is the visual story

Cross‑hospital retrieval per cancer. The SKCM gap is what the deferral layer is meant to catch — those queries are flagged for human review, not silently misclassified.

0% 25% 50% 75% 100% Glioblastoma GBM · n=3,300 97.09% Colon–Rectum adeno. COAD · n=1,510 95.43% Breast invasive carc. BRCA · n=3,560 82.98% Skin Cutaneous Melanoma SKCM · n=1,500 · MOHS‑relevant 67.00% −30.1 pts vs. GBM
Lower SKCM accuracy is paired with an 88.2% OOD flag rate — uncertain queries are routed to human‑in‑the‑loop, not committed to.
Fig 03 · Encoder comparison · Phase A Phikon‑v2 vs UNI · OOD set · 2‑encoder

Phikon‑v2 leads on retrieval and on its own honesty

Phikon‑v2 retrieves better at top‑1 and top‑5, and flags far more of its uncertain queries for human review. UNI is over‑confident on out‑of‑distribution data.

Phikon‑v2 UNI
0% 25% 50% 75% 100% Top‑1 retrieval 87.17 42.40 Top‑5 retrieval 94.88 80.59 OOD flag rate 88.2 9.6 Percentage of 9,870 OOD queries
OOD flag rate = Hamming distance > 95th percentile of intra‑cart distances. Higher is more honest, not less accurate.
Fig 04 · Cross‑hospital confusion matrix · Phase A Phikon‑v2 · 9,870 queries · 2‑encoder

Diagonal dominance — with one interesting off‑diagonal cell

Rows are truth, columns are top‑1 retrieved. The 279 SKCM → BRCA cell is the largest off‑diagonal — soft‑tissue / connective texture appears to be where the skin‑cancer error budget lives.

truth ↓   retrieved → SKCM COAD GBM BRCA row total SKCM COAD GBM BRCA 1,005 17 199 279 1,500 3 1,441 3 63 1,510 73 3 3,204 20 3,300 324 239 43 2,954 3,560 largest off‑diagonal SKCM mistaken for BRCA
Diagonal (correct) = 8,604 / 9,870. Off‑diagonal mass is concentrated in BRCA confusions (BRCA→SKCM 324, BRCA→COAD 239, SKCM→BRCA 279).
Fig 05 · Clinical‑workflow tier distribution · Phase B Full NCT‑CRC · 93,522 patches · 4‑encoder

30 / 12 / 58 — consensus tightens, safety statistic holds

Release‑benign patches contain zero cancers. Tumor‑flag patches are 98.2% true cancer. Human‑in‑the‑loop captures more of the long tail. There are no contradictions. The 2 cancers the system was uncertain about went to HIL — not to silent release.

93,522 patches 0 cancers missed of 11,111 real RELEASE_BENIGN 27,927 · 29.9% · 0 tumors TUMOR_FLAG 11,316 · 12.1% · 98.2% prec. HUMAN_IN_LOOP 54,279 · 58.0% · 2 tumors CONTRADICTION 0 · 0.0% · never triggered
Over‑flag on non‑tumor tissue: 207 / 82,411 = 0.25%. The shift from 44/12/44 (2‑enc) to 30/12/58 (4‑enc) shows the consensus tightening: more conservative about auto‑release, more patches routed to human review — and the safety statistic (0 cancers released as benign) holds at clinical scale.
Fig 06 · Dual‑encoder agreement · Phase A Phikon‑v2 ∩ UNI · 2‑encoder

When the two encoders agree, they are right 91.49% of the time — they agree on 42.3% of queries

Agreement frequency (recall) and agreement correctness (precision) read together as a usable confidence signal. Plus an independent OOD‑flag agreement of 8.9%.

All OOD queries 9,870 · 100% 9,870 Phikon‑v2 and UNI agree on top‑1 4,172 · 42.3% 4,172 …and top‑1 is correct 3,817 · 91.49% of agreements 3,817
Recall · how often they agree
42.3%
4,172 of 9,870 OOD queries. Agreement is a useful gate precisely because it doesn't fire on everything.
Precision · correct when they agree
91.49%
3,817 of 4,172 agreements. Joint agreement on top‑1 retrieval is a strong high‑confidence signal in OOD conditions.
Independent OOD‑flag agreement (both encoders flag the query as out‑of‑distribution): 883 / 9,870 = 8.9% — the high‑confidence “punt to a human” bucket.
Fig 07 · Per‑encoder Hamming fingerprints · Phase B · NEW CONCH · Virchow2 · Phikon‑v2 · UNI

Each encoder has its own “tightness”

Lower median nearest‑neighbor distance = more confident matches. The diversity across encoders is what makes the 4‑encoder consensus meaningful — they make different mistakes, so when they agree the signal is uncommonly trustworthy.

0.00 0.05 0.10 0.15 0.20 CONCH Mahmood Lab · vision‑language 0.078 tightest matcher Virchow2 Paige AI · largest model 0.102 Phikon‑v2 Owkin · 2‑encoder baseline 0.132 UNI Mahmood Lab · 2‑encoder baseline 0.169 Median nearest‑neighbor Hamming distance (lower = tighter matches)
CONCH (vision‑language, newer) is the tightest matcher. Virchow2 (largest model) is second‑tightest. Phikon‑v2 and UNI are the historical 2‑encoder baseline. None is “best” alone — the value is that they make different mistakes, so the consensus among them is the actually‑trustworthy signal.