Validation results · case-based retrieval · UPDATED 4‑encoder follow‑up

Pathology AI — seven charts from four runs and one cross-hospital test

Andy Grossberg · Waving Cat Learning Systems 2026‑05‑23 93,522 in‑dist patches · 11,111 cancers · 9,870 OOD queries

Multi‑encoder retrieval over 1024‑bit sign‑bit signatures; Hamming distance against a labeled reference library; dual‑filter decision logic with built‑in OOD deferral. No model training in the pipeline.

Fig 01 · In‑distribution reproducibility · Phase B 21× scale range · 4 runs · latest is 4‑encoder follow‑up

Tumor sensitivity holds at 100.00% across an order of magnitude of cart size

Every run reproduced the same result: zero cancers missed, zero false benigns. The over‑flag rate on non‑tumor tissue actually improved with scale.

Missed cancers per run: 0 · 0 · 0 · 0. Release‑tier specificity: 100.00% on every run. Over‑flag rate at 50K → 100K (2‑enc) → 100K (4‑enc): 0.10% → 0.067% → 0.25% — more conservative consensus catches more borderline tissue, none of which were cancers. The 4‑encoder run uses 11,111 real cancer patches (up from 5,555 in the 2‑encoder stratified subset). Encoders: Phikon‑v2 + UNI + Virchow2 + CONCH.

Fig 02 · OOD per‑cancer top‑1 · Phase A Phikon‑v2 · TCGA‑UT external split · 2‑encoder

The 30‑point gap between solid‑organ cancers and skin is the visual story

Cross‑hospital retrieval per cancer. The SKCM gap is what the deferral layer is meant to catch — those queries are flagged for human review, not silently misclassified.

Lower SKCM accuracy is paired with an 88.2% OOD flag rate — uncertain queries are routed to human‑in‑the‑loop, not committed to.

Fig 03 · Encoder comparison · Phase A Phikon‑v2 vs UNI · OOD set · 2‑encoder

Phikon‑v2 leads on retrieval and on its own honesty

Phikon‑v2 retrieves better at top‑1 and top‑5, and flags far more of its uncertain queries for human review. UNI is over‑confident on out‑of‑distribution data.

Phikon‑v2 UNI

OOD flag rate = Hamming distance > 95th percentile of intra‑cart distances. Higher is more honest, not less accurate.

Fig 04 · Cross‑hospital confusion matrix · Phase A Phikon‑v2 · 9,870 queries · 2‑encoder

Diagonal dominance — with one interesting off‑diagonal cell

Rows are truth, columns are top‑1 retrieved. The 279 SKCM → BRCA cell is the largest off‑diagonal — soft‑tissue / connective texture appears to be where the skin‑cancer error budget lives.

Diagonal (correct) = 8,604 / 9,870. Off‑diagonal mass is concentrated in BRCA confusions (BRCA→SKCM 324, BRCA→COAD 239, SKCM→BRCA 279).

Fig 05 · Clinical‑workflow tier distribution · Phase B Full NCT‑CRC · 93,522 patches · 4‑encoder

30 / 12 / 58 — consensus tightens, safety statistic holds

Release‑benign patches contain zero cancers. Tumor‑flag patches are 98.2% true cancer. Human‑in‑the‑loop captures more of the long tail. There are no contradictions. The 2 cancers the system was uncertain about went to HIL — not to silent release.

Over‑flag on non‑tumor tissue: 207 / 82,411 = 0.25%. The shift from 44/12/44 (2‑enc) to 30/12/58 (4‑enc) shows the consensus tightening: more conservative about auto‑release, more patches routed to human review — and the safety statistic (0 cancers released as benign) holds at clinical scale.

Fig 06 · Dual‑encoder agreement · Phase A Phikon‑v2 ∩ UNI · 2‑encoder

When the two encoders agree, they are right 91.49% of the time — they agree on 42.3% of queries

Agreement frequency (recall) and agreement correctness (precision) read together as a usable confidence signal. Plus an independent OOD‑flag agreement of 8.9%.

Recall · how often they agree

42.3%

4,172 of 9,870 OOD queries. Agreement is a useful gate precisely because it doesn't fire on everything.

Precision · correct when they agree

91.49%

3,817 of 4,172 agreements. Joint agreement on top‑1 retrieval is a strong high‑confidence signal in OOD conditions.

Independent OOD‑flag agreement (both encoders flag the query as out‑of‑distribution): 883 / 9,870 = 8.9% — the high‑confidence “punt to a human” bucket.

Fig 07 · Per‑encoder Hamming fingerprints · Phase B · NEW CONCH · Virchow2 · Phikon‑v2 · UNI

Each encoder has its own “tightness”

Lower median nearest‑neighbor distance = more confident matches. The diversity across encoders is what makes the 4‑encoder consensus meaningful — they make different mistakes, so when they agree the signal is uncommonly trustworthy.

CONCH (vision‑language, newer) is the tightest matcher. Virchow2 (largest model) is second‑tightest. Phikon‑v2 and UNI are the historical 2‑encoder baseline. None is “best” alone — the value is that they make different mistakes, so the consensus among them is the actually‑trustworthy signal.