Skip to content

Latest commit

 

History

History
294 lines (224 loc) · 12.6 KB

File metadata and controls

294 lines (224 loc) · 12.6 KB

Project Handover Document - Terraphim Medical Integration

Date: 2026-02-24 Project: MedGemma Competition - Terraphim-AI Crate Integration Status: SUBMITTED. All 42 beads closed. v1.2.0 tagged, released, and pushed. Handover To: Development Team / Maintainers


Executive Summary

Completed full migration of medgemma-competition from standalone reimplementations to shared terraphim-ai crates behind medical feature flags. Proved end-to-end with real MedGemma 4B GGUF inference on both GPU (22.7s/case avg, RTX 2070) and CPU (165s/case). All mock fallbacks removed from production paths. Added interactive demo UI, 85s demo video (Playwright recording with real GPU inference), 4 clinical workflow state machines with 60 scenario tests, 18 multi-specialty evaluation cases, Axum API server with shared LLM state, and 9 Playwright e2e tests. Final submission v1.2.0 released on GitHub with demo-video.mp4, WRITEUP.md, COMPETITION_EVIDENCE.md, and .env.template as release artifacts.

Session 6 (2026-02-24): Wired MedGemma GGUF client into API server via Axum shared state (Arc<ClinicalService>), added CUDA GPU support, ran 3rd evaluation (18/18 pass, avg 24.8s), verified demo.html Live mode via WebSocket, created 9 Playwright e2e tests, recorded 85s demo video, refreshed A/B comparison (BRAF case reproducibly shows class-suggestion vs specific-drug improvement), updated all docs, tagged v1.2.0, created GitHub release.

559 tests (up from 543), all passing.

System 1 + System 2 Architecture

MedGemma acts as System 1 (fast intuition) -- generating fluent recommendations from parametric medical knowledge. The Terraphim knowledge graph acts as System 2 (deliberate reasoning) -- grounding, validating, and constraining those recommendations against structured clinical evidence. Neither alone is sufficient.

Terraphim Knowledge Graph -- The Key Differentiator

The core innovation is graph-based symbolic embeddings for clinical safety:

  • 27 medical node types and 65 edge types with typed graph traversal
  • Symbolic similarity: Jaccard (0.7) + path distance (0.3) -- deterministic, auditable
  • LeftmostLongest entity extraction: Aho-Corasick automaton always grounds to the most specific SNOMED concept (e.g., "non-small cell lung carcinoma" not "lung carcinoma"), ensuring correct downstream treatment lookups
  • Safety gate: Every MedGemma recommendation validated against KG treatment subgraph. Ungrounded recommendations (e.g., Pembrolizumab for EGFR L858R+ NSCLC) blocked before reaching clinician
  • Traceable evidence paths: Drug->Treats->Disease->HasVariant->Gene->CitedIn->Trial

Two repositories involved:

  • terraphim/terraphim-ai - upstream crate library (PR #551, branch medical-extensions)
  • terraphim/medgemma-competition - consumer project (commits on main, tag v1.1.0)

Progress Summary (Session: 2026-02-23, Evening)

Tasks Completed This Session

  1. LeftmostLongest fix (commit e9ab233)

    • Found bug: EntityExtractor::new() and from_terms() used AhoCorasick::new() (defaults to LeftmostFirst)
    • Fixed all 3 call sites in extractor.rs and umls_extractor.rs to use LeftmostLongest
    • Added 2 tests: test_leftmost_longest_prefers_full_concept_over_fragment, test_leftmost_longest_from_terms
    • Test count: 541 -> 543
  2. GPU validation (reports 46d9cca9, 17a45b91)

    • All 3 pipelines run sequentially on RTX 2070 (35/35 layers CUDA0)
    • e2e_pipeline: 47 pass, 2 fail (safety gate correct), 55s total
    • e2e_real_model: 18/18 pass, avg 22.7s/case, all 15 checks PASSED
    • ab_comparison: 3/3 cases, KG grounding specificity confirmed
  3. Submission packaging (tag v1.1.0)

    • MIT LICENSE file added
    • README.md rewritten with current stats
    • All submission docs updated with LeftmostLongest explanation
    • System 1 + System 2 framing added to WRITEUP.md and COMPETITION_EVIDENCE.md
  4. Beads cleanup: Closed 6 stale issues, all 42/42 now closed

What's Working

  • 543 workspace tests passing, 0 failures
  • All code committed and pushed to origin/main
  • Working tree clean (only untracked: progress.txt, one stale eval report)
  • GPU inference: 22.7s/case avg (RTX 2070, 35/35 layers)
  • CPU inference: 165s/case avg (no GPU required)
  • Safety gate: 100% across all runs (54 total inference calls)

What's Blocked

  • Nothing currently blocked

Current State

Branch: main (tag: v1.1.0)

1a40ab0 docs: add GPU validation report 17a45b91 (full sequential run)
c5de069 docs: add System 1 + System 2 dual-process architecture framing
bb25154 docs: highlight LeftmostLongest grounding across all submission docs
aa28269 chore: sync beads - all 42 issues closed
e9ab233 fix: enforce LeftmostLongest match in EntityExtractor for grounding precision
6b10d08 docs: add GPU inference results (RTX 2070, report 79d26e2e)
f4f3582 fix: update HTML demos with real evaluation data (541 tests, 18/18 cases, GPU timing)
59d2c98 docs: add real A/B comparison results, remove criteria table
0b0a849 feat: add A/B comparison example, remove fabricated precision benchmark

Test Suite: 543 tests

  • State machines: 60 tests (case_status: 9, genomic_report: 14, treatment_plan: 19, recommendation: 22)
  • LeftmostLongest: 2 tests (grounding precision validation)
  • Clinical pipeline: ~200 tests
  • Agent messaging/supervision: ~100 tests
  • PGx, evaluation, other: ~181 tests

Beads: 42/42 closed

Total Issues: 42, Open: 0, In Progress: 0, Blocked: 0, Closed: 42

GPU Validation Results (report 17a45b91, 2026-02-23)

Pipeline Cases Avg Latency Result
e2e_pipeline 49 checks 55s total 47 pass, 2 fail (safety gate correct)
e2e_real_model 18/18 22.7s/case All 15 checks PASSED
ab_comparison 3 cases ~22s each KG grounding specificity confirmed
cargo test 543/543 N/A 0 failures

A/B Comparison Results (GPU)

Case Raw MedGemma (System 1 only) KG-Grounded (System 1 + 2)
EGFR NSCLC Osimertinib 80mg Osimertinib 80mg (+ SNOMED grounding)
CYP2D6 Codeine Oxycodone 5mg/mL Codeine 60mg (PGx-aware)
BRAF Melanoma "BRAF inhibitor (e.g., Dabrafenib + Trametinib)" Vemurafenib 450mg (specific)

Competition Deliverables (all complete)

Deliverable File Status
Technical writeup WRITEUP.md Done (with System 1+2 and LeftmostLongest sections)
Demo UI demo.html Done (self-contained, 1,813 lines)
Demo video docs/demo-video/demo-pipeline-3min.mp4 Done (15 MB, git-lfs)
Evaluation results tests/evaluation/output/ Done (7 reports: CPU, GPU, sequential)
README README.md Done (543 tests, GPU/CPU timing)
LICENSE LICENSE Done (MIT)
Competition evidence COMPETITION_EVIDENCE.md Done (real inference, no mock)

What Was Done (Full History)

Phase 1: terraphim_types (medical feature)

  • Added MedicalNodeType (27 variants) and MedicalEdgeType (65 variants) behind #[cfg(feature = "medical")]
  • Commit: e1147d62

Phase 2: terraphim_rolegraph (MedicalRoleGraph + symbolic embeddings)

  • MedicalRoleGraph wrapping RoleGraph with typed nodes/edges, IS-A hierarchy
  • Symbolic embeddings (Jaccard 0.7 + path distance 0.3), adjacency index
  • 78 tests
  • Commits: a4b8dbe8, 89ef5a62, 127bd7ee, 494a6961

Phase 3: terraphim_automata (medical entity extraction)

  • SNOMED EntityExtractor, UMLS UmlsExtractor, ShardedUmlsExtractor
  • daachorse sharded automaton with bincode+zstd serialization
  • LeftmostLongest match enforcement (fixed in e9ab233)
  • 43+ tests (including 2 LeftmostLongest validation tests)
  • Commits: cee1f788, fd70034f, 9f2958fb, e9ab233

Phase 4: Agent infrastructure consolidation

  • Replaced copied mailbox/router/supervisor with path deps
  • Commit: dfcd2d7

Phase 5: Consumer migration

  • Migrated to MedicalRoleGraph, retired terraphim-kg (5,784 lines deleted)
  • Commits: 93aa6a1, b4a8d8b, 0e20cff

E2E + Real Model + Vertex AI

  • 49-check pipeline example, 18/18 real model eval, Vertex AI backend
  • Commits: 2e317ff, a5828f6, d478f04

Submission Artifacts

  • Writeup, README, repo cleanup, evaluation scenarios
  • Commits: d56c5ec, 2020cf6, 8083c09, de19398

Demo UI + State Machines + Video

  • demo.html, 4 state machines (60 tests), 3-min video recording
  • Commits: 266dc42, 56713c9, 71693a3

GPU Validation + LeftmostLongest + Submission Packaging

  • LeftmostLongest fix, GPU pipeline validation, System 1+2 framing
  • MIT LICENSE, README rewrite, all docs updated
  • Commits: e9ab233, 893d8bf, bb25154, c5de069, 1a40ab0
  • Tag: v1.1.0

P0/P1 Fixes Applied

ID Severity Fix Commit
P0-1 Critical magic_unpair f32->f64 for SNOMED IDs (100M-900M) 9f2958fb
P0-2 Critical Overlap detection: start < m.span.1 && end > m.span.0 9f2958fb
P0-3 Critical LeftmostLongest enforcement in EntityExtractor (grounding precision) e9ab233
P1-1 Important Adjacency index for O(degree) edge lookups 127bd7ee
P1-2 Important Multi-CUI term preservation in ShardedUmlsExtractor fd70034f
P1-3 Important SNOMED FSN semantic tag parsing for node types 494a6961
P1-4 Important SASS-only CUDA compilation for RTX 2070 (sm_75) 4d491fe

Data Files

File Size Purpose
data/artifacts/umls_automata.bin.zst 209MB Pre-built UMLS Aho-Corasick automaton
data/artifacts/cpic_database.bin.zst 16KB CPIC PGx rules
data/snomed_thesaurus.json 10KB Curated SNOMED mappings (49 terms)
data/automata/words_cui.tsv 789MB Raw UMLS term-CUI mappings

Running Tests & Demos

Full workspace

cargo test --workspace  # 543 tests, ~50s

State machines only

cargo test -p terraphim-medical-agents -- state_machines  # 60 tests, <1s

Pipeline verification (real GGUF model, GPU)

# GPU (recommended, ~1min total)
cargo run --release --example e2e_pipeline --package terraphim-demo --features medgemma-client/cuda

# CPU fallback (~5min total)
cargo run --release --example e2e_pipeline --package terraphim-demo

18-case evaluation (GPU)

cargo run --release --example e2e_real_model --package terraphim-demo --features medgemma-client/cuda

A/B Comparison (raw vs KG-grounded)

cargo run --release --example ab_comparison --package terraphim-demo --features medgemma-client/cuda

IMPORTANT: Run GPU pipelines sequentially

RTX 2070 has 8GB VRAM. Running multiple GGUF inference processes simultaneously causes Failed to create context: NullReturn errors. Always run one inference pipeline at a time.

Vertex AI cloud inference (GPU instances currently unavailable)

./scripts/setup_vertex_ai.sh  # one-time
cargo run --release --example e2e_vertex_ai --package terraphim-demo

Demo UI

python3 -m http.server 8091  # then visit http://localhost:8091/demo.html

Next Steps (Prioritized)

P1 - Competition Submission

  1. Submit to MedGemma Impact Challenge -- all deliverables ready at tag v1.1.0
  2. Merge PR #551 (terraphim-ai) -- unblock path dep cleanup

P2 - Post-Competition (additive, no breaking changes)

  1. Error propagation scenarios (#48) - 11 cross-object failure path tests
  2. Vaccine design pipeline (#44) - new state machine, same pattern as existing 4
  3. Evidence retrieval service (#47) - replace stub with real implementation

P3 - Requires Design Decision

  1. Clinical trial matching (#45) - needs external data source decision
  2. Rare disease differential diagnosis (#46) - complex domain logic
  3. Meta-Cortex (#33) - multi-disciplinary coordination architecture

Known Issues

  1. GPU VRAM contention: RTX 2070 (8GB) cannot run two GGUF inference processes simultaneously. Run pipelines sequentially.

  2. UMLS extraction quality: Full UMLS dataset includes single-character terms producing noisy results. Use SNOMED EntityExtractor with curated terms for cleaner output.

  3. PR #551 branch history: medical-extensions branch carries commits from PR #543. Squash-merge recommended.

  4. Path dependencies: Relative path deps (../../terraphim-ai/crates/...) require both repos checked out as siblings. For CI, consider git deps.

  5. CUDA version mismatch: nvcc 13.1 vs driver 13.0 requires SASS-only compilation. .cargo/config.toml has CMAKE_CUDA_ARCHITECTURES=75 and NVCC_FLAGS set for sm_75.

  6. Untracked files: progress.txt and one stale eval report (e4e4bada) are untracked. Safe to delete or .gitignore.