A comprehensive toolkit designed to convert PDF documents into structured, high-quality Markdown files. The conversion process is executed in three distinct stages:
- PDF to Images: High-resolution conversion of PDF pages into image formats.
- Image to Clean MD (EN): Automated OCR error correction and translation from Khmer to English using AI models.
- Markdown Merging: Aggregation of individual markdown pages into a consolidated document.
Utilizes pdftoppm to convert PDF files into various image formats (PNG, JPEG, TIFF).
- Haskell Stack
- poppler-utils (
pdftoppmrequired)# Debian/Ubuntu sudo apt-get install poppler-utils # macOS brew install poppler
cd 1-pdf-to-images
stack buildstack run -- <INPUT_PDF> [OPTIONS]| Option | Shorthand | Description | Default |
|---|---|---|---|
--output-dir DIR |
-o |
Directory to save output images | . |
--prefix PREFIX |
-p |
Prefix for output filenames | page |
--format FORMAT |
-f |
Output format: png, jpeg, tiff |
png |
Automates OCR error correction and translates Khmer markdown content into English while maintaining the original document structure.
- AI Access (e.g., Google Gemini API or a compatible AI assistant).
Refer to the instructions and implementation prompts located in 2-img-to-clean-md-en/gemini.prompt.md.
Aggregates multiple markdown files from a specified directory into a single document, ensuring correct sequential order (e.g., page-001.md, page-002.md).
cd 3-combine-markdown
stack buildstack run -- -i <INPUT_DIR> -o <OUTPUT_DIR> [-n <FILE_NAME>]The output document will be saved as <OUTPUT_DIR>/<FILE_NAME or INPUT_DIR_NAME>.md.
To convert a standard PDF (such as Khmer legal text) to clean, translated markdown, follow this recommended procedure:
- Preparation:
Create a dedicated project directory within
temp/and establish the required output subdirectories:
# Replace 'my-pdf-name' with the actual document identifier
mkdir -p temp/my-pdf-name/1-output-images
mkdir -p temp/my-pdf-name/2-clean-markdown
mkdir -p temp/my-pdf-name/2.1-en-markdown
mkdir -p temp/my-pdf-name/3-combine-markdown
# Copy the source PDF into the project directory
cp path/to/your/document.pdf temp/my-pdf-name/-
Stage 1: Image Generation:
cd 1-pdf-to-images stack run -- ../temp/my-pdf-name/document.pdf \ -o ../temp/my-pdf-name/1-output-images \ -p page -f png -
Stage 2: Processing and Translation: Utilize the provided instructions in
2-img-to-clean-md-en/gemini.prompt.mdto process images fromtemp/my-pdf-name/1-output-images/. Output should be directed totemp/my-pdf-name/2-clean-markdown/(Khmer) andtemp/my-pdf-name/2.1-en-markdown/(English). -
Stage 3: Final Document Consolidation: Merge the processed Khmer and English files:
cd ../3-combine-markdown # Consolidate Khmer version stack run -- \ -i ../temp/my-pdf-name/2-clean-markdown \ -o ../temp/my-pdf-name/3-combine-markdown \ -n my-pdf-name # Consolidate English version stack run -- \ -i ../temp/my-pdf-name/2.1-en-markdown \ -o ../temp/my-pdf-name/3-combine-markdown \ -n my-pdf-name-en
This process generates
temp/my-pdf-name/3-combine-markdown/my-pdf-name.mdandtemp/my-pdf-name/3-combine-markdown/my-pdf-name-en.md.
The examples/ directory contains sample legal documents processed with this toolkit, including automation scripts.
This project is licensed under the GNU General Public License v2.0. For more details, see the LICENSE file.