PDF to Markdown Toolkit

A comprehensive toolkit designed to convert PDF documents into structured, high-quality Markdown files. The conversion process is executed in three distinct stages:

PDF to Images: High-resolution conversion of PDF pages into image formats.
Image to Clean MD (EN): Automated OCR error correction and translation from Khmer to English using AI models.
Markdown Merging: Aggregation of individual markdown pages into a consolidated document.

Tools Overview

1. `1-pdf-to-images`

Utilizes pdftoppm to convert PDF files into various image formats (PNG, JPEG, TIFF).

Prerequisites

Haskell Stack

poppler-utils (pdftoppm required)

# Debian/Ubuntu
sudo apt-get install poppler-utils
# macOS
brew install poppler

Installation

cd 1-pdf-to-images
stack build

Usage

stack run -- <INPUT_PDF> [OPTIONS]

Option	Shorthand	Description	Default
`--output-dir DIR`	`-o`	Directory to save output images	`.`
`--prefix PREFIX`	`-p`	Prefix for output filenames	`page`
`--format FORMAT`	`-f`	Output format: `png`, `jpeg`, `tiff`	`png`

2. `2-img-to-clean-md-en`

Automates OCR error correction and translates Khmer markdown content into English while maintaining the original document structure.

Prerequisites

AI Access (e.g., Google Gemini API or a compatible AI assistant).

Usage

Refer to the instructions and implementation prompts located in 2-img-to-clean-md-en/gemini.prompt.md.

3. `3-combine-markdown`

Aggregates multiple markdown files from a specified directory into a single document, ensuring correct sequential order (e.g., page-001.md, page-002.md).

Installation

cd 3-combine-markdown
stack build

Usage

stack run -- -i <INPUT_DIR> -o <OUTPUT_DIR> [-n <FILE_NAME>]

The output document will be saved as <OUTPUT_DIR>/<FILE_NAME or INPUT_DIR_NAME>.md.

Workflow Example

To convert a standard PDF (such as Khmer legal text) to clean, translated markdown, follow this recommended procedure:

Preparation: Create a dedicated project directory within temp/ and establish the required output subdirectories:

# Replace 'my-pdf-name' with the actual document identifier
mkdir -p temp/my-pdf-name/1-output-images
mkdir -p temp/my-pdf-name/2-clean-markdown
mkdir -p temp/my-pdf-name/2.1-en-markdown
mkdir -p temp/my-pdf-name/3-combine-markdown

# Copy the source PDF into the project directory
cp path/to/your/document.pdf temp/my-pdf-name/

Stage 1: Image Generation:

cd 1-pdf-to-images
stack run -- ../temp/my-pdf-name/document.pdf \
  -o ../temp/my-pdf-name/1-output-images \
  -p page -f png

Stage 2: Processing and Translation: Utilize the provided instructions in 2-img-to-clean-md-en/gemini.prompt.md to process images from temp/my-pdf-name/1-output-images/. Output should be directed to temp/my-pdf-name/2-clean-markdown/ (Khmer) and temp/my-pdf-name/2.1-en-markdown/ (English).

Stage 3: Final Document Consolidation: Merge the processed Khmer and English files:

cd ../3-combine-markdown
# Consolidate Khmer version
stack run -- \
  -i ../temp/my-pdf-name/2-clean-markdown \
  -o ../temp/my-pdf-name/3-combine-markdown \
  -n my-pdf-name

# Consolidate English version
stack run -- \
  -i ../temp/my-pdf-name/2.1-en-markdown \
  -o ../temp/my-pdf-name/3-combine-markdown \
  -n my-pdf-name-en

This process generates temp/my-pdf-name/3-combine-markdown/my-pdf-name.md and temp/my-pdf-name/3-combine-markdown/my-pdf-name-en.md.

Examples

The examples/ directory contains sample legal documents processed with this toolkit, including automation scripts.

License

This project is licensed under the GNU General Public License v2.0. For more details, see the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
1-pdf-to-images		1-pdf-to-images
2-img-to-clean-md-en		2-img-to-clean-md-en
2.1-img-to-clean-md-no-translation		2.1-img-to-clean-md-no-translation
3-combine-markdown		3-combine-markdown
deprecate		deprecate
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
next.sh		next.sh
run-next.sh		run-next.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to Markdown Toolkit

Tools Overview

1. `1-pdf-to-images`

Prerequisites

Installation

Usage

2. `2-img-to-clean-md-en`

Prerequisites

Usage

3. `3-combine-markdown`

Installation

Usage

Workflow Example

Examples

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF to Markdown Toolkit

Tools Overview

1. 1-pdf-to-images

Prerequisites

Installation

Usage

2. 2-img-to-clean-md-en

Prerequisites

Usage

3. 3-combine-markdown

Installation

Usage

Workflow Example

Examples

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `1-pdf-to-images`

2. `2-img-to-clean-md-en`

3. `3-combine-markdown`

Packages