Skip to content

Add ClinicalJargonDataset and ClinicalJargonVerification benchmark task#941

Draft
John-Carson wants to merge 1 commit intosunlabuiuc:masterfrom
John-Carson:cs598-clinical-jargon
Draft

Add ClinicalJargonDataset and ClinicalJargonVerification benchmark task#941
John-Carson wants to merge 1 commit intosunlabuiuc:masterfrom
John-Carson:cs598-clinical-jargon

Conversation

@John-Carson
Copy link
Copy Markdown

PyHealth PR Description

Summary

  • Adds ClinicalJargonDataset backed by public MedLingo and CASI benchmark assets.
  • Adds ClinicalJargonVerification, a binary candidate-verification task for public clinical jargon evaluation.
  • Adds docs, example usage, synthetic test resources, and unit tests.

Contributors

Contribution Type

  • Dataset + Task

Original Paper

Implementation Overview

  • ClinicalJargonDataset downloads and normalizes the public MedLingo and CASI assets into a PyHealth dataset.
  • ClinicalJargonVerification converts each benchmark item into paired-text binary classification samples over candidate expansions.
  • The example script demonstrates task configuration ablations through benchmark, casi_variant, and medlingo_distractors.
  • The tests use only synthetic/demo resources and validate dataset loading, patient parsing, task generation, and sample structure.

Files To Review

  • pyhealth/datasets/clinical_jargon.py
  • pyhealth/datasets/configs/clinical_jargon.yaml
  • pyhealth/tasks/clinical_jargon_verification.py
  • examples/clinical_jargon_clinical_jargon_verification_transformers.py
  • tests/core/test_clinical_jargon.py
  • docs/api/datasets/pyhealth.datasets.ClinicalJargonDataset.rst
  • docs/api/tasks/pyhealth.tasks.ClinicalJargonVerification.rst

Validation

  • python3 -m unittest discover -s 598-DLH/clinical_jargon_project/tests -p 'test_*.py'
  • PYTHONPATH=598-DLH/PyHealth python3 -m unittest 598-DLH/PyHealth/tests/core/test_clinical_jargon.py
  • python3 598-DLH/PyHealth/examples/clinical_jargon_clinical_jargon_verification_transformers.py --model-name hf-internal-testing/tiny-random-bert --benchmark medlingo --medlingo-distractors 1 --epochs 1 --batch-size 2

Copilot AI review requested due to automatic review settings April 4, 2026 19:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new public clinical jargon benchmark dataset and an associated binary verification task, plus supporting docs, example usage, and unit tests.

Changes:

  • Introduces ClinicalJargonDataset with normalized MedLingo + CASI metadata and a YAML dataset config.
  • Adds ClinicalJargonVerification task that generates paired-text binary samples over candidate expansions.
  • Adds a runnable Transformers example, Sphinx API docs, synthetic test resources, and unit tests.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pyhealth/datasets/clinical_jargon.py Implements dataset normalization and (currently automatic) remote asset fetching.
pyhealth/datasets/configs/clinical_jargon.yaml Declares the examples table schema for the dataset.
pyhealth/datasets/__init__.py Exposes ClinicalJargonDataset at package import level.
pyhealth/tasks/clinical_jargon_verification.py Implements the candidate-verification task and sample generation.
pyhealth/tasks/__init__.py Exposes ClinicalJargonVerification at package import level.
examples/clinical_jargon_clinical_jargon_verification_transformers.py Demonstrates training/evaluating a Transformers model on the task.
tests/core/test_clinical_jargon.py Adds unit tests covering dataset/task loading and sample structure.
test-resources/clinical_jargon/clinical_jargon_examples.csv Adds synthetic/demo benchmark rows used by tests/examples.
docs/api/datasets/pyhealth.datasets.ClinicalJargonDataset.rst Adds Sphinx API stub for the dataset.
docs/api/datasets.rst Adds dataset entry to the datasets API index.
docs/api/tasks/pyhealth.tasks.ClinicalJargonVerification.rst Adds Sphinx API stub for the task.
docs/api/tasks.rst Adds task entry to the tasks API index.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +122 to +135
root_path = Path(root)
root_path.mkdir(parents=True, exist_ok=True)
if config_path is None:
config_path = Path(__file__).parent / "configs" / "clinical_jargon.yaml"
normalized_csv = root_path / "clinical_jargon_examples.csv"
if not normalized_csv.exists():
self.prepare_metadata(root_path)
super().__init__(
root=str(root_path),
tables=["examples"],
dataset_name=dataset_name or "clinical_jargon",
config_path=str(config_path),
**kwargs,
)
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClinicalJargonDataset.__init__ automatically calls prepare_metadata() (which performs network downloads) whenever clinical_jargon_examples.csv is missing. This makes dataset construction unexpectedly fail/hang in offline or restricted environments. Consider adding an explicit flag (e.g., download / prepare_metadata) to control this behavior; when disabled, raise a clear FileNotFoundError explaining how to obtain or generate the CSV.

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +143
def _download_text(url: str, destination: Path) -> str:
if destination.exists():
return destination.read_text()
payload = urllib.request.urlopen(url).read().decode("utf-8", errors="replace")
destination.write_text(payload)
return payload
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_download_text uses urllib.request.urlopen(url) without a timeout / context manager and reads/writes text without specifying an encoding. This can hang indefinitely on network issues and can mis-decode/encode on non-UTF-8 locales. Use a request with an explicit timeout (and close the response via a context manager), and pass encoding="utf-8" (plus errors=) to read_text/write_text.

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +166
for entry in entries:
file_name = entry["name"]
file_text = cls._download_text(entry["download_url"], cache_dir / file_name)
for row in csv.DictReader(file_text.splitlines()):
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _fetch_casi_rows, file_name from the GitHub API is used directly in cache_dir / file_name. If an unexpected value contains path separators (e.g., ../), this would write outside the cache directory. Please sanitize/validate file_name (e.g., enforce Path(file_name).name == file_name and reject absolute/parent paths) before using it as a local filename.

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +7
PROJECT_ROOT = Path(__file__).resolve().parents[1]
if str(PROJECT_ROOT) not in sys.path:
sys.path.insert(0, str(PROJECT_ROOT))
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example mutates sys.path at runtime to import pyhealth. Other repo examples import pyhealth directly (e.g., examples/omop_dataset_demo.py) and rely on running from an installed/editable environment. Consider removing the sys.path manipulation and instead documenting the expected invocation (e.g., pip install -e . or running from repo root) to avoid masking import/path issues.

Copilot uses AI. Check for mistakes.
@John-Carson John-Carson marked this pull request as draft April 4, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants