Add ClinicalJargonDataset and ClinicalJargonVerification benchmark task by John-Carson · Pull Request #941 · sunlabuiuc/PyHealth

John-Carson · 2026-04-04T19:34:09Z

PyHealth PR Description

Summary

Adds ClinicalJargonDataset backed by public MedLingo and CASI benchmark assets.
Adds ClinicalJargonVerification, a binary candidate-verification task for public clinical jargon evaluation.
Adds docs, example usage, synthetic test resources, and unit tests.

Contributors

John Carson
johnwc4@illinois.edu

Contribution Type

Dataset + Task

Original Paper

Furong Jia, David Sontag, and Monica Agrawal. "How does my language model understand clinical text?" CHIL 2025.
https://arxiv.org/abs/2505.15024

Implementation Overview

ClinicalJargonDataset downloads and normalizes the public MedLingo and CASI assets into a PyHealth dataset.
ClinicalJargonVerification converts each benchmark item into paired-text binary classification samples over candidate expansions.
The example script demonstrates task configuration ablations through benchmark, casi_variant, and medlingo_distractors.
The tests use only synthetic/demo resources and validate dataset loading, patient parsing, task generation, and sample structure.

Files To Review

pyhealth/datasets/clinical_jargon.py
pyhealth/datasets/configs/clinical_jargon.yaml
pyhealth/tasks/clinical_jargon_verification.py
examples/clinical_jargon_clinical_jargon_verification_transformers.py
tests/core/test_clinical_jargon.py
docs/api/datasets/pyhealth.datasets.ClinicalJargonDataset.rst
docs/api/tasks/pyhealth.tasks.ClinicalJargonVerification.rst

Validation

python3 -m unittest discover -s 598-DLH/clinical_jargon_project/tests -p 'test_*.py'
PYTHONPATH=598-DLH/PyHealth python3 -m unittest 598-DLH/PyHealth/tests/core/test_clinical_jargon.py
python3 598-DLH/PyHealth/examples/clinical_jargon_clinical_jargon_verification_transformers.py --model-name hf-internal-testing/tiny-random-bert --benchmark medlingo --medlingo-distractors 1 --epochs 1 --batch-size 2

Copilot

Pull request overview

Adds a new public clinical jargon benchmark dataset and an associated binary verification task, plus supporting docs, example usage, and unit tests.

Changes:

Introduces ClinicalJargonDataset with normalized MedLingo + CASI metadata and a YAML dataset config.
Adds ClinicalJargonVerification task that generates paired-text binary samples over candidate expansions.
Adds a runnable Transformers example, Sphinx API docs, synthetic test resources, and unit tests.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`pyhealth/datasets/clinical_jargon.py`	Implements dataset normalization and (currently automatic) remote asset fetching.
`pyhealth/datasets/configs/clinical_jargon.yaml`	Declares the `examples` table schema for the dataset.
`pyhealth/datasets/__init__.py`	Exposes `ClinicalJargonDataset` at package import level.
`pyhealth/tasks/clinical_jargon_verification.py`	Implements the candidate-verification task and sample generation.
`pyhealth/tasks/__init__.py`	Exposes `ClinicalJargonVerification` at package import level.
`examples/clinical_jargon_clinical_jargon_verification_transformers.py`	Demonstrates training/evaluating a Transformers model on the task.
`tests/core/test_clinical_jargon.py`	Adds unit tests covering dataset/task loading and sample structure.
`test-resources/clinical_jargon/clinical_jargon_examples.csv`	Adds synthetic/demo benchmark rows used by tests/examples.
`docs/api/datasets/pyhealth.datasets.ClinicalJargonDataset.rst`	Adds Sphinx API stub for the dataset.
`docs/api/datasets.rst`	Adds dataset entry to the datasets API index.
`docs/api/tasks/pyhealth.tasks.ClinicalJargonVerification.rst`	Adds Sphinx API stub for the task.
`docs/api/tasks.rst`	Adds task entry to the tasks API index.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T19:39:31Z

pyhealth/datasets/clinical_jargon.py

+        root_path = Path(root)
+        root_path.mkdir(parents=True, exist_ok=True)
+        if config_path is None:
+            config_path = Path(__file__).parent / "configs" / "clinical_jargon.yaml"
+        normalized_csv = root_path / "clinical_jargon_examples.csv"
+        if not normalized_csv.exists():
+            self.prepare_metadata(root_path)
+        super().__init__(
+            root=str(root_path),
+            tables=["examples"],
+            dataset_name=dataset_name or "clinical_jargon",
+            config_path=str(config_path),
+            **kwargs,
+        )


ClinicalJargonDataset.__init__ automatically calls prepare_metadata() (which performs network downloads) whenever clinical_jargon_examples.csv is missing. This makes dataset construction unexpectedly fail/hang in offline or restricted environments. Consider adding an explicit flag (e.g., download / prepare_metadata) to control this behavior; when disabled, raise a clear FileNotFoundError explaining how to obtain or generate the CSV.

Copilot · 2026-04-04T19:39:31Z

pyhealth/datasets/clinical_jargon.py

+    def _download_text(url: str, destination: Path) -> str:
+        if destination.exists():
+            return destination.read_text()
+        payload = urllib.request.urlopen(url).read().decode("utf-8", errors="replace")
+        destination.write_text(payload)
+        return payload


_download_text uses urllib.request.urlopen(url) without a timeout / context manager and reads/writes text without specifying an encoding. This can hang indefinitely on network issues and can mis-decode/encode on non-UTF-8 locales. Use a request with an explicit timeout (and close the response via a context manager), and pass encoding="utf-8" (plus errors=) to read_text/write_text.

Copilot · 2026-04-04T19:39:31Z

pyhealth/datasets/clinical_jargon.py

+        for entry in entries:
+            file_name = entry["name"]
+            file_text = cls._download_text(entry["download_url"], cache_dir / file_name)
+            for row in csv.DictReader(file_text.splitlines()):


In _fetch_casi_rows, file_name from the GitHub API is used directly in cache_dir / file_name. If an unexpected value contains path separators (e.g., ../), this would write outside the cache directory. Please sanitize/validate file_name (e.g., enforce Path(file_name).name == file_name and reject absolute/parent paths) before using it as a local filename.

Copilot · 2026-04-04T19:39:31Z

examples/clinical_jargon_clinical_jargon_verification_transformers.py

+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))


This example mutates sys.path at runtime to import pyhealth. Other repo examples import pyhealth directly (e.g., examples/omop_dataset_demo.py) and rely on running from an installed/editable environment. Consider removing the sys.path manipulation and instead documenting the expected invocation (e.g., pip install -e . or running from repo root) to avoid masking import/path issues.

Add clinical jargon benchmark dataset and verification task

30756f7

Copilot AI review requested due to automatic review settings April 4, 2026 19:34

Copilot started reviewing on behalf of John-Carson April 4, 2026 19:34 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

John-Carson marked this pull request as draft April 4, 2026 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ClinicalJargonDataset and ClinicalJargonVerification benchmark task#941

Add ClinicalJargonDataset and ClinicalJargonVerification benchmark task#941
John-Carson wants to merge 1 commit intosunlabuiuc:masterfrom
John-Carson:cs598-clinical-jargon

John-Carson commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

John-Carson commented Apr 4, 2026

PyHealth PR Description

Summary

Contributors

Contribution Type

Original Paper

Implementation Overview

Files To Review

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants