SignDATA: Data Pipeline for Sign Language Translation

A config-driven, modular pipeline for preprocessing multiple Sign Language datasets. Supports multiple extractors including MediaPipe Holistic, MMPose, MMDet, and YOLO. Supports two pipeline modes including Pose Landmarks and Video Clips.

Key Features

Config-Driven — YAML job configs, experiment configs, and CLI overrides
Multiple Extractors — MediaPipe Holistic, MMPose, MMDet, and YOLO
Two Pipeline Modes — pose (landmarks) and video (clip extraction)
WebDataset Output — sharded tar archives for efficient training data loading

Supported Datasets

Dataset	Venue	Description	License
YouTube-ASL	NeurIPS 2023	11,000+ videos, 73,000+ segments -- open-domain ASL-English parallel corpus	Apache-2.0
How2Sign	CVPR 2021	80+ hours of instructional ASL in a controlled studio environment	CC BY-NC 4.0
WLASL	WACV 2020	12,000+ isolated sign clips across 2,000 ASL glosses	Dataset site
MS-ASL	CVPR 2019	Large-scale isolated ASL dataset with signer-diverse lexical clips	Microsoft Download Center terms

For paper-aligned preprocessing methodology, see Research-Aligned Preprocessing.

Installation

git clone https://github.com/balaboom123/signdata-slt.git
cd signdata-slt
python -m venv venv
source venv/bin/activate  # Linux/macOS — use venv\Scripts\activate on Windows
pip install -r requirements.txt

Optional: GPU-based Extractors (MMPose, MMDet, YOLO)

MediaPipe works on CPU out of the box. MMPose, MMDet, and YOLO require a CUDA-capable GPU and additional dependencies -- see the Installation Guide for full setup instructions.

Quick Start

# YouTube-ASL: download, extract MediaPipe landmarks, normalize, package
python -m signdata run configs/jobs/youtube_asl/mediapipe.yaml

# How2Sign: extract MMPose landmarks (CUDA required)
python -m signdata run configs/jobs/how2sign/mmpose.yaml

# MS-ASL: validate local clips, extract MediaPipe landmarks, normalize, package
python -m signdata run configs/jobs/msasl/mediapipe.yaml

# Override config values from the command line
python -m signdata run configs/jobs/youtube_asl/mediapipe.yaml \
  --override processing.max_workers=8 stop_at=extract

Both modes produce WebDataset tar shards for efficient training data loading. See Pipeline Stages for detailed output formats and data shapes.

Documentation

Installation Guide -- base setup and MMPose GPU dependencies
Architecture -- system design, registry, pipeline flow
Configuration -- job/experiment layout and CLI overrides
Pipeline Stages -- recipe stages and optional stages
Datasets -- YouTube-ASL, How2Sign, WLASL, and MS-ASL setup
Contributing -- required dataset package structure and extension guide
Research-Aligned Preprocessing -- paper-aligned preprocessing notes

Citation

If you use SignDATA in your research, please cite:

@Article{chen2026signdata,
    author  = {Kuanwei Chen and Tingyi Lin},
    journal = {arXiv:2604.20357},
    title   = {SignDATA: Data Pipeline for Sign Language Translation},
    year    = {2026},
}

License

The MIT license in this repository applies to the code and documentation in this project. Use of external datasets, research artifacts, and upstream repos referenced above must comply with their original licenses and usage terms.

MIT -- see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
assets		assets
configs		configs
docs		docs
resources		resources
scripts		scripts
src/signdata		src/signdata
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SignDATA: Data Pipeline for Sign Language Translation

Key Features

Supported Datasets

Installation

Optional: GPU-based Extractors (MMPose, MMDet, YOLO)

Quick Start

Documentation

Citation

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SignDATA: Data Pipeline for Sign Language Translation

Key Features

Supported Datasets

Installation

Optional: GPU-based Extractors (MMPose, MMDet, YOLO)

Quick Start

Documentation

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages