Project Handover Document - Terraphim Medical Integration

Date: 2026-02-24 Project: MedGemma Competition - Terraphim-AI Crate Integration Status: SUBMITTED. All 42 beads closed. v1.2.0 tagged, released, and pushed. Handover To: Development Team / Maintainers

Executive Summary

Completed full migration of medgemma-competition from standalone reimplementations to shared terraphim-ai crates behind medical feature flags. Proved end-to-end with real MedGemma 4B GGUF inference on both GPU (22.7s/case avg, RTX 2070) and CPU (165s/case). All mock fallbacks removed from production paths. Added interactive demo UI, 85s demo video (Playwright recording with real GPU inference), 4 clinical workflow state machines with 60 scenario tests, 18 multi-specialty evaluation cases, Axum API server with shared LLM state, and 9 Playwright e2e tests. Final submission v1.2.0 released on GitHub with demo-video.mp4, WRITEUP.md, COMPETITION_EVIDENCE.md, and .env.template as release artifacts.

Session 6 (2026-02-24): Wired MedGemma GGUF client into API server via Axum shared state (Arc<ClinicalService>), added CUDA GPU support, ran 3rd evaluation (18/18 pass, avg 24.8s), verified demo.html Live mode via WebSocket, created 9 Playwright e2e tests, recorded 85s demo video, refreshed A/B comparison (BRAF case reproducibly shows class-suggestion vs specific-drug improvement), updated all docs, tagged v1.2.0, created GitHub release.

559 tests (up from 543), all passing.

System 1 + System 2 Architecture

MedGemma acts as System 1 (fast intuition) -- generating fluent recommendations from parametric medical knowledge. The Terraphim knowledge graph acts as System 2 (deliberate reasoning) -- grounding, validating, and constraining those recommendations against structured clinical evidence. Neither alone is sufficient.

Terraphim Knowledge Graph -- The Key Differentiator

The core innovation is graph-based symbolic embeddings for clinical safety:

27 medical node types and 65 edge types with typed graph traversal
Symbolic similarity: Jaccard (0.7) + path distance (0.3) -- deterministic, auditable
LeftmostLongest entity extraction: Aho-Corasick automaton always grounds to the most specific SNOMED concept (e.g., "non-small cell lung carcinoma" not "lung carcinoma"), ensuring correct downstream treatment lookups
Safety gate: Every MedGemma recommendation validated against KG treatment subgraph. Ungrounded recommendations (e.g., Pembrolizumab for EGFR L858R+ NSCLC) blocked before reaching clinician
Traceable evidence paths: Drug->Treats->Disease->HasVariant->Gene->CitedIn->Trial

Two repositories involved:

terraphim/terraphim-ai - upstream crate library (PR #551, branch medical-extensions)
terraphim/medgemma-competition - consumer project (commits on main, tag v1.1.0)

Progress Summary (Session: 2026-02-23, Evening)

Tasks Completed This Session

LeftmostLongest fix (commit e9ab233)
- Found bug: EntityExtractor::new() and from_terms() used AhoCorasick::new() (defaults to LeftmostFirst)
- Fixed all 3 call sites in extractor.rs and umls_extractor.rs to use LeftmostLongest
- Added 2 tests: test_leftmost_longest_prefers_full_concept_over_fragment, test_leftmost_longest_from_terms
- Test count: 541 -> 543
GPU validation (reports 46d9cca9, 17a45b91)
- All 3 pipelines run sequentially on RTX 2070 (35/35 layers CUDA0)
- e2e_pipeline: 47 pass, 2 fail (safety gate correct), 55s total
- e2e_real_model: 18/18 pass, avg 22.7s/case, all 15 checks PASSED
- ab_comparison: 3/3 cases, KG grounding specificity confirmed
Submission packaging (tag v1.1.0)
- MIT LICENSE file added
- README.md rewritten with current stats
- All submission docs updated with LeftmostLongest explanation
- System 1 + System 2 framing added to WRITEUP.md and COMPETITION_EVIDENCE.md
Beads cleanup: Closed 6 stale issues, all 42/42 now closed

What's Working

543 workspace tests passing, 0 failures
All code committed and pushed to origin/main
Working tree clean (only untracked: progress.txt, one stale eval report)
GPU inference: 22.7s/case avg (RTX 2070, 35/35 layers)
CPU inference: 165s/case avg (no GPU required)
Safety gate: 100% across all runs (54 total inference calls)

What's Blocked

Nothing currently blocked

Current State

Branch: `main` (tag: `v1.1.0`)

1a40ab0 docs: add GPU validation report 17a45b91 (full sequential run)
c5de069 docs: add System 1 + System 2 dual-process architecture framing
bb25154 docs: highlight LeftmostLongest grounding across all submission docs
aa28269 chore: sync beads - all 42 issues closed
e9ab233 fix: enforce LeftmostLongest match in EntityExtractor for grounding precision
6b10d08 docs: add GPU inference results (RTX 2070, report 79d26e2e)
f4f3582 fix: update HTML demos with real evaluation data (541 tests, 18/18 cases, GPU timing)
59d2c98 docs: add real A/B comparison results, remove criteria table
0b0a849 feat: add A/B comparison example, remove fabricated precision benchmark

Test Suite: 543 tests

State machines: 60 tests (case_status: 9, genomic_report: 14, treatment_plan: 19, recommendation: 22)
LeftmostLongest: 2 tests (grounding precision validation)
Clinical pipeline: ~200 tests
Agent messaging/supervision: ~100 tests
PGx, evaluation, other: ~181 tests

Beads: 42/42 closed

Total Issues: 42, Open: 0, In Progress: 0, Blocked: 0, Closed: 42

GPU Validation Results (report 17a45b91, 2026-02-23)

Pipeline	Cases	Avg Latency	Result
e2e_pipeline	49 checks	55s total	47 pass, 2 fail (safety gate correct)
e2e_real_model	18/18	22.7s/case	All 15 checks PASSED
ab_comparison	3 cases	~22s each	KG grounding specificity confirmed
cargo test	543/543	N/A	0 failures

A/B Comparison Results (GPU)

Case	Raw MedGemma (System 1 only)	KG-Grounded (System 1 + 2)
EGFR NSCLC	Osimertinib 80mg	Osimertinib 80mg (+ SNOMED grounding)
CYP2D6 Codeine	Oxycodone 5mg/mL	Codeine 60mg (PGx-aware)
BRAF Melanoma	"BRAF inhibitor (e.g., Dabrafenib + Trametinib)"	Vemurafenib 450mg (specific)

Competition Deliverables (all complete)

Deliverable	File	Status
Technical writeup	`WRITEUP.md`	Done (with System 1+2 and LeftmostLongest sections)
Demo UI	`demo.html`	Done (self-contained, 1,813 lines)
Demo video	`docs/demo-video/demo-pipeline-3min.mp4`	Done (15 MB, git-lfs)
Evaluation results	`tests/evaluation/output/`	Done (7 reports: CPU, GPU, sequential)
README	`README.md`	Done (543 tests, GPU/CPU timing)
LICENSE	`LICENSE`	Done (MIT)
Competition evidence	`COMPETITION_EVIDENCE.md`	Done (real inference, no mock)

What Was Done (Full History)

Phase 1: terraphim_types (medical feature)

Added MedicalNodeType (27 variants) and MedicalEdgeType (65 variants) behind #[cfg(feature = "medical")]
Commit: e1147d62

Phase 2: terraphim_rolegraph (MedicalRoleGraph + symbolic embeddings)

MedicalRoleGraph wrapping RoleGraph with typed nodes/edges, IS-A hierarchy
Symbolic embeddings (Jaccard 0.7 + path distance 0.3), adjacency index
78 tests
Commits: a4b8dbe8, 89ef5a62, 127bd7ee, 494a6961

Phase 3: terraphim_automata (medical entity extraction)

SNOMED EntityExtractor, UMLS UmlsExtractor, ShardedUmlsExtractor
daachorse sharded automaton with bincode+zstd serialization
LeftmostLongest match enforcement (fixed in e9ab233)
43+ tests (including 2 LeftmostLongest validation tests)
Commits: cee1f788, fd70034f, 9f2958fb, e9ab233

Phase 4: Agent infrastructure consolidation

Replaced copied mailbox/router/supervisor with path deps
Commit: dfcd2d7

Phase 5: Consumer migration

Migrated to MedicalRoleGraph, retired terraphim-kg (5,784 lines deleted)
Commits: 93aa6a1, b4a8d8b, 0e20cff

E2E + Real Model + Vertex AI

49-check pipeline example, 18/18 real model eval, Vertex AI backend
Commits: 2e317ff, a5828f6, d478f04

Submission Artifacts

Writeup, README, repo cleanup, evaluation scenarios
Commits: d56c5ec, 2020cf6, 8083c09, de19398

Demo UI + State Machines + Video

demo.html, 4 state machines (60 tests), 3-min video recording
Commits: 266dc42, 56713c9, 71693a3

GPU Validation + LeftmostLongest + Submission Packaging

LeftmostLongest fix, GPU pipeline validation, System 1+2 framing
MIT LICENSE, README rewrite, all docs updated
Commits: e9ab233, 893d8bf, bb25154, c5de069, 1a40ab0
Tag: v1.1.0

P0/P1 Fixes Applied

ID	Severity	Fix	Commit
P0-1	Critical	`magic_unpair` f32->f64 for SNOMED IDs (100M-900M)	`9f2958fb`
P0-2	Critical	Overlap detection: `start < m.span.1 && end > m.span.0`	`9f2958fb`
P0-3	Critical	LeftmostLongest enforcement in EntityExtractor (grounding precision)	`e9ab233`
P1-1	Important	Adjacency index for O(degree) edge lookups	`127bd7ee`
P1-2	Important	Multi-CUI term preservation in ShardedUmlsExtractor	`fd70034f`
P1-3	Important	SNOMED FSN semantic tag parsing for node types	`494a6961`
P1-4	Important	SASS-only CUDA compilation for RTX 2070 (sm_75)	`4d491fe`

Data Files

File	Size	Purpose
`data/artifacts/umls_automata.bin.zst`	209MB	Pre-built UMLS Aho-Corasick automaton
`data/artifacts/cpic_database.bin.zst`	16KB	CPIC PGx rules
`data/snomed_thesaurus.json`	10KB	Curated SNOMED mappings (49 terms)
`data/automata/words_cui.tsv`	789MB	Raw UMLS term-CUI mappings

Running Tests & Demos

Full workspace

cargo test --workspace  # 543 tests, ~50s

State machines only

cargo test -p terraphim-medical-agents -- state_machines  # 60 tests, <1s

Pipeline verification (real GGUF model, GPU)

# GPU (recommended, ~1min total)
cargo run --release --example e2e_pipeline --package terraphim-demo --features medgemma-client/cuda

# CPU fallback (~5min total)
cargo run --release --example e2e_pipeline --package terraphim-demo

18-case evaluation (GPU)

cargo run --release --example e2e_real_model --package terraphim-demo --features medgemma-client/cuda

A/B Comparison (raw vs KG-grounded)

cargo run --release --example ab_comparison --package terraphim-demo --features medgemma-client/cuda

IMPORTANT: Run GPU pipelines sequentially

RTX 2070 has 8GB VRAM. Running multiple GGUF inference processes simultaneously causes Failed to create context: NullReturn errors. Always run one inference pipeline at a time.

Vertex AI cloud inference (GPU instances currently unavailable)

./scripts/setup_vertex_ai.sh  # one-time
cargo run --release --example e2e_vertex_ai --package terraphim-demo

Demo UI

python3 -m http.server 8091  # then visit http://localhost:8091/demo.html

Next Steps (Prioritized)

P1 - Competition Submission

Submit to MedGemma Impact Challenge -- all deliverables ready at tag v1.1.0
Merge PR #551 (terraphim-ai) -- unblock path dep cleanup

P2 - Post-Competition (additive, no breaking changes)

Error propagation scenarios (#48) - 11 cross-object failure path tests
Vaccine design pipeline (#44) - new state machine, same pattern as existing 4
Evidence retrieval service (#47) - replace stub with real implementation

P3 - Requires Design Decision

Clinical trial matching (#45) - needs external data source decision
Rare disease differential diagnosis (#46) - complex domain logic
Meta-Cortex (#33) - multi-disciplinary coordination architecture

Known Issues

GPU VRAM contention: RTX 2070 (8GB) cannot run two GGUF inference processes simultaneously. Run pipelines sequentially.
UMLS extraction quality: Full UMLS dataset includes single-character terms producing noisy results. Use SNOMED EntityExtractor with curated terms for cleaner output.
PR #551 branch history: medical-extensions branch carries commits from PR #543. Squash-merge recommended.
Path dependencies: Relative path deps (../../terraphim-ai/crates/...) require both repos checked out as siblings. For CI, consider git deps.
CUDA version mismatch: nvcc 13.1 vs driver 13.0 requires SASS-only compilation. .cargo/config.toml has CMAKE_CUDA_ARCHITECTURES=75 and NVCC_FLAGS set for sm_75.
Untracked files: progress.txt and one stale eval report (e4e4bada) are untracked. Safe to delete or .gitignore.

FilesExpand file tree

HANDOVER.md

Latest commit

History