Add ClinicalJargonDataset and ClinicalJargonVerification benchmark task#941
Add ClinicalJargonDataset and ClinicalJargonVerification benchmark task#941John-Carson wants to merge 1 commit intosunlabuiuc:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new public clinical jargon benchmark dataset and an associated binary verification task, plus supporting docs, example usage, and unit tests.
Changes:
- Introduces
ClinicalJargonDatasetwith normalized MedLingo + CASI metadata and a YAML dataset config. - Adds
ClinicalJargonVerificationtask that generates paired-text binary samples over candidate expansions. - Adds a runnable Transformers example, Sphinx API docs, synthetic test resources, and unit tests.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
pyhealth/datasets/clinical_jargon.py |
Implements dataset normalization and (currently automatic) remote asset fetching. |
pyhealth/datasets/configs/clinical_jargon.yaml |
Declares the examples table schema for the dataset. |
pyhealth/datasets/__init__.py |
Exposes ClinicalJargonDataset at package import level. |
pyhealth/tasks/clinical_jargon_verification.py |
Implements the candidate-verification task and sample generation. |
pyhealth/tasks/__init__.py |
Exposes ClinicalJargonVerification at package import level. |
examples/clinical_jargon_clinical_jargon_verification_transformers.py |
Demonstrates training/evaluating a Transformers model on the task. |
tests/core/test_clinical_jargon.py |
Adds unit tests covering dataset/task loading and sample structure. |
test-resources/clinical_jargon/clinical_jargon_examples.csv |
Adds synthetic/demo benchmark rows used by tests/examples. |
docs/api/datasets/pyhealth.datasets.ClinicalJargonDataset.rst |
Adds Sphinx API stub for the dataset. |
docs/api/datasets.rst |
Adds dataset entry to the datasets API index. |
docs/api/tasks/pyhealth.tasks.ClinicalJargonVerification.rst |
Adds Sphinx API stub for the task. |
docs/api/tasks.rst |
Adds task entry to the tasks API index. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| root_path = Path(root) | ||
| root_path.mkdir(parents=True, exist_ok=True) | ||
| if config_path is None: | ||
| config_path = Path(__file__).parent / "configs" / "clinical_jargon.yaml" | ||
| normalized_csv = root_path / "clinical_jargon_examples.csv" | ||
| if not normalized_csv.exists(): | ||
| self.prepare_metadata(root_path) | ||
| super().__init__( | ||
| root=str(root_path), | ||
| tables=["examples"], | ||
| dataset_name=dataset_name or "clinical_jargon", | ||
| config_path=str(config_path), | ||
| **kwargs, | ||
| ) |
There was a problem hiding this comment.
ClinicalJargonDataset.__init__ automatically calls prepare_metadata() (which performs network downloads) whenever clinical_jargon_examples.csv is missing. This makes dataset construction unexpectedly fail/hang in offline or restricted environments. Consider adding an explicit flag (e.g., download / prepare_metadata) to control this behavior; when disabled, raise a clear FileNotFoundError explaining how to obtain or generate the CSV.
| def _download_text(url: str, destination: Path) -> str: | ||
| if destination.exists(): | ||
| return destination.read_text() | ||
| payload = urllib.request.urlopen(url).read().decode("utf-8", errors="replace") | ||
| destination.write_text(payload) | ||
| return payload |
There was a problem hiding this comment.
_download_text uses urllib.request.urlopen(url) without a timeout / context manager and reads/writes text without specifying an encoding. This can hang indefinitely on network issues and can mis-decode/encode on non-UTF-8 locales. Use a request with an explicit timeout (and close the response via a context manager), and pass encoding="utf-8" (plus errors=) to read_text/write_text.
| for entry in entries: | ||
| file_name = entry["name"] | ||
| file_text = cls._download_text(entry["download_url"], cache_dir / file_name) | ||
| for row in csv.DictReader(file_text.splitlines()): |
There was a problem hiding this comment.
In _fetch_casi_rows, file_name from the GitHub API is used directly in cache_dir / file_name. If an unexpected value contains path separators (e.g., ../), this would write outside the cache directory. Please sanitize/validate file_name (e.g., enforce Path(file_name).name == file_name and reject absolute/parent paths) before using it as a local filename.
| PROJECT_ROOT = Path(__file__).resolve().parents[1] | ||
| if str(PROJECT_ROOT) not in sys.path: | ||
| sys.path.insert(0, str(PROJECT_ROOT)) |
There was a problem hiding this comment.
This example mutates sys.path at runtime to import pyhealth. Other repo examples import pyhealth directly (e.g., examples/omop_dataset_demo.py) and rely on running from an installed/editable environment. Consider removing the sys.path manipulation and instead documenting the expected invocation (e.g., pip install -e . or running from repo root) to avoid masking import/path issues.
PyHealth PR Description
Summary
ClinicalJargonDatasetbacked by public MedLingo and CASI benchmark assets.ClinicalJargonVerification, a binary candidate-verification task for public clinical jargon evaluation.Contributors
Contribution Type
Original Paper
Implementation Overview
ClinicalJargonDatasetdownloads and normalizes the public MedLingo and CASI assets into a PyHealth dataset.ClinicalJargonVerificationconverts each benchmark item into paired-text binary classification samples over candidate expansions.benchmark,casi_variant, andmedlingo_distractors.Files To Review
pyhealth/datasets/clinical_jargon.pypyhealth/datasets/configs/clinical_jargon.yamlpyhealth/tasks/clinical_jargon_verification.pyexamples/clinical_jargon_clinical_jargon_verification_transformers.pytests/core/test_clinical_jargon.pydocs/api/datasets/pyhealth.datasets.ClinicalJargonDataset.rstdocs/api/tasks/pyhealth.tasks.ClinicalJargonVerification.rstValidation
python3 -m unittest discover -s 598-DLH/clinical_jargon_project/tests -p 'test_*.py'PYTHONPATH=598-DLH/PyHealth python3 -m unittest 598-DLH/PyHealth/tests/core/test_clinical_jargon.pypython3 598-DLH/PyHealth/examples/clinical_jargon_clinical_jargon_verification_transformers.py --model-name hf-internal-testing/tiny-random-bert --benchmark medlingo --medlingo-distractors 1 --epochs 1 --batch-size 2