A config-driven, modular pipeline for preprocessing multiple Sign Language datasets. Supports multiple extractors including MediaPipe Holistic, MMPose, MMDet, and YOLO. Supports two pipeline modes including Pose Landmarks and Video Clips.
- Config-Driven — YAML job configs, experiment configs, and CLI overrides
- Multiple Extractors — MediaPipe Holistic, MMPose, MMDet, and YOLO
- Two Pipeline Modes —
pose(landmarks) andvideo(clip extraction) - WebDataset Output — sharded tar archives for efficient training data loading
| Dataset | Venue | Description | License |
|---|---|---|---|
| YouTube-ASL | NeurIPS 2023 | 11,000+ videos, 73,000+ segments -- open-domain ASL-English parallel corpus | Apache-2.0 |
| How2Sign | CVPR 2021 | 80+ hours of instructional ASL in a controlled studio environment | CC BY-NC 4.0 |
| WLASL | WACV 2020 | 12,000+ isolated sign clips across 2,000 ASL glosses | Dataset site |
| MS-ASL | CVPR 2019 | Large-scale isolated ASL dataset with signer-diverse lexical clips | Microsoft Download Center terms |
For paper-aligned preprocessing methodology, see Research-Aligned Preprocessing.
git clone https://github.com/balaboom123/signdata-slt.git
cd signdata-slt
python -m venv venv
source venv/bin/activate # Linux/macOS — use venv\Scripts\activate on Windows
pip install -r requirements.txtMediaPipe works on CPU out of the box. MMPose, MMDet, and YOLO require a CUDA-capable GPU and additional dependencies -- see the Installation Guide for full setup instructions.
# YouTube-ASL: download, extract MediaPipe landmarks, normalize, package
python -m signdata run configs/jobs/youtube_asl/mediapipe.yaml
# How2Sign: extract MMPose landmarks (CUDA required)
python -m signdata run configs/jobs/how2sign/mmpose.yaml
# MS-ASL: validate local clips, extract MediaPipe landmarks, normalize, package
python -m signdata run configs/jobs/msasl/mediapipe.yaml
# Override config values from the command line
python -m signdata run configs/jobs/youtube_asl/mediapipe.yaml \
--override processing.max_workers=8 stop_at=extractBoth modes produce WebDataset tar shards for efficient training data loading. See Pipeline Stages for detailed output formats and data shapes.
- Installation Guide -- base setup and MMPose GPU dependencies
- Architecture -- system design, registry, pipeline flow
- Configuration -- job/experiment layout and CLI overrides
- Pipeline Stages -- recipe stages and optional stages
- Datasets -- YouTube-ASL, How2Sign, WLASL, and MS-ASL setup
- Contributing -- required dataset package structure and extension guide
- Research-Aligned Preprocessing -- paper-aligned preprocessing notes
If you use SignDATA in your research, please cite:
@Article{chen2026signdata,
author = {Kuanwei Chen and Tingyi Lin},
journal = {arXiv:2604.20357},
title = {SignDATA: Data Pipeline for Sign Language Translation},
year = {2026},
}The MIT license in this repository applies to the code and documentation in this project. Use of external datasets, research artifacts, and upstream repos referenced above must comply with their original licenses and usage terms.
MIT -- see LICENSE.