Skip to content

balaboom123/signdata-slt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

107 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SignDATA: Data Pipeline for Sign Language Translation

SignDATA – Data Pipeline for Sign Language Translation

arXiv   License   Python 3.11+

A config-driven, modular pipeline for preprocessing multiple Sign Language datasets. Supports multiple extractors including MediaPipe Holistic, MMPose, MMDet, and YOLO. Supports two pipeline modes including Pose Landmarks and Video Clips.


Key Features

  • Config-Driven — YAML job configs, experiment configs, and CLI overrides
  • Multiple Extractors — MediaPipe Holistic, MMPose, MMDet, and YOLO
  • Two Pipeline Modespose (landmarks) and video (clip extraction)
  • WebDataset Output — sharded tar archives for efficient training data loading

Supported Datasets

Dataset Venue Description License
YouTube-ASL NeurIPS 2023 11,000+ videos, 73,000+ segments -- open-domain ASL-English parallel corpus Apache-2.0
How2Sign CVPR 2021 80+ hours of instructional ASL in a controlled studio environment CC BY-NC 4.0
WLASL WACV 2020 12,000+ isolated sign clips across 2,000 ASL glosses Dataset site
MS-ASL CVPR 2019 Large-scale isolated ASL dataset with signer-diverse lexical clips Microsoft Download Center terms

For paper-aligned preprocessing methodology, see Research-Aligned Preprocessing.


Installation

git clone https://github.com/balaboom123/signdata-slt.git
cd signdata-slt
python -m venv venv
source venv/bin/activate  # Linux/macOS — use venv\Scripts\activate on Windows
pip install -r requirements.txt

Optional: GPU-based Extractors (MMPose, MMDet, YOLO)

MediaPipe works on CPU out of the box. MMPose, MMDet, and YOLO require a CUDA-capable GPU and additional dependencies -- see the Installation Guide for full setup instructions.


Quick Start

# YouTube-ASL: download, extract MediaPipe landmarks, normalize, package
python -m signdata run configs/jobs/youtube_asl/mediapipe.yaml

# How2Sign: extract MMPose landmarks (CUDA required)
python -m signdata run configs/jobs/how2sign/mmpose.yaml

# MS-ASL: validate local clips, extract MediaPipe landmarks, normalize, package
python -m signdata run configs/jobs/msasl/mediapipe.yaml

# Override config values from the command line
python -m signdata run configs/jobs/youtube_asl/mediapipe.yaml \
  --override processing.max_workers=8 stop_at=extract

Both modes produce WebDataset tar shards for efficient training data loading. See Pipeline Stages for detailed output formats and data shapes.


Documentation

Citation

If you use SignDATA in your research, please cite:

@Article{chen2026signdata,
    author  = {Kuanwei Chen and Tingyi Lin},
    journal = {arXiv:2604.20357},
    title   = {SignDATA: Data Pipeline for Sign Language Translation},
    year    = {2026},
}

License

The MIT license in this repository applies to the code and documentation in this project. Use of external datasets, research artifacts, and upstream repos referenced above must comply with their original licenses and usage terms.

MIT -- see LICENSE.

About

Modular, config-driven pipeline for preprocessing Sign Language datasets with pose and video outputs using MediaPipe, MMPose, and YOLO..

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages