PDF Annotation Extraction

A TypeScript-based tool for extracting highlight annotations from PDF files and exporting them to structured formats like JSON and Markdown.

Overview

This project extracts highlighted text and associated annotations from PDF files, particularly those annotated in Microsoft Edge or other PDF readers that follow the PDF specification standard. The extracted data can be exported to JSON or Markdown for further processing, note-taking, or integration with AI/LLM pipelines.

Features

✅ Implemented:

Extract highlight annotations from PDF files
Capture the actual highlighted text (not just annotation metadata)
Export to JSON and Markdown formats
CLI interface for command-line usage
Programmatic API for integration into other projects
Preserve annotation metadata:
- Page numbers
- Highlighted text content
- Annotation comments/notes
- Color information
- Author and modification dates
- Position information

Technology Stack

Runtime: Node.js
Language: TypeScript
Core PDF Engine: pdf.js (pdfjs-dist)
CLI: Commander.js
Output Formatting: Chalk for colorized terminal output

Installation

# Clone the repository
git clone https://github.com/MikeORed/pdf-annotation-extraction.git
cd pdf-annotation-extraction

# Install dependencies
npm install

# Build the project
npm run build

Usage

CLI Usage

After building, you can use the CLI tool:

# Basic extraction to JSON (default)
npx ts-node src/cli.ts document.pdf

# Or using the dev script
npm run dev document.pdf

# Extract to Markdown
npm run dev document.pdf --format markdown

# Specify output file
npm run dev document.pdf --output my-notes.json

# Group highlights by page (Markdown only)
npm run dev document.pdf --format markdown --group-by-page

# Include additional metadata
npm run dev document.pdf --include-dates --include-authors

# Get help
npm run dev --help

CLI Options

-o, --output <path> - Output file path (auto-generated if not specified)
-f, --format <format> - Output format: json or markdown (default: json)
--no-colors - Exclude color information from output
--no-toc - Exclude table of contents (markdown only)
--group-by-page - Group highlights by page (markdown only)
--include-dates - Include modification dates
--include-authors - Include author information
--include-raw - Include raw annotation data (json only)
--pretty - Pretty print JSON output (default: true)

Programmatic Usage

import { extractHighlights, exportToJSON, exportToMarkdown } from 'pdf-annotation-extraction';

async function extractPDFHighlights() {
  // Extract highlights
  const result = await extractHighlights('document.pdf', {
    includeDates: true,
    includeAuthor: true
  });

  console.log(`Found ${result.highlights.length} highlights`);

  // Access individual highlights
  result.highlights.forEach(highlight => {
    console.log(`Page ${highlight.page}: "${highlight.text}"`);
    if (highlight.comment) {
      console.log(`  Note: ${highlight.comment}`);
    }
  });

  // Export to JSON
  await exportToJSON(result, 'output.json', { pretty: true });

  // Export to Markdown
  await exportToMarkdown(result, 'output.md', {
    groupByPage: true,
    includeTOC: true
  });
}

Testing

A test script is provided to quickly test extraction on a PDF:

npx ts-node test-extract.ts path/to/your/annotated.pdf

This will:

Extract all highlights from the PDF
Display a summary in the terminal
Show sample JSON and Markdown output

Output Formats

JSON Output

{
  "sourceFile": "document.pdf",
  "totalPages": 10,
  "highlights": [
    {
      "page": 1,
      "text": "This is the highlighted text content",
      "comment": "Optional annotation comment",
      "color": { "r": 255, "g": 255, "b": 0 },
      "position": { "x": 100, "y": 500, "width": 200, "height": 20 },
      "author": null,
      "modificationDate": "2024-01-15T10:30:00.000Z"
    }
  ],
  "extractedAt": "2024-01-20T15:45:00.000Z"
}

Markdown Output

# PDF Highlights: document.pdf

**Extracted:** 1/20/2024, 3:45:00 PM
**Total Pages:** 10
**Total Highlights:** 5

## Table of Contents

- [Highlight 1](#highlight-1) - "This is the highlighted text..."
- [Highlight 2](#highlight-2) - "Another highlight..."

## Highlights

### Highlight 1

**Page:** 1 | **Color:** Yellow

> This is the highlighted text content

**Note:** Optional annotation comment

---

Use Cases

Research Compilation: Extract highlights from academic papers
Document Review: Gather annotated quotes and notes from documents
Note-Taking: Automate workflows for managing reading notes
AI/LLM Integration: Feed annotated content into AI pipelines
Knowledge Management: Create searchable databases of highlighted content

Project Structure

pdf-annotation-extraction/
├── src/
│   ├── types/
│   │   └── annotations.ts       # Type definitions
│   ├── core/
│   │   ├── pdf-loader.ts        # PDF document loading
│   │   ├── annotation-extractor.ts # Main extraction logic
│   │   ├── text-matcher.ts      # Text-to-annotation matching
│   │   └── geometry.ts          # Coordinate utilities
│   ├── exporters/
│   │   ├── json-exporter.ts     # JSON export
│   │   └── markdown-exporter.ts # Markdown export
│   ├── cli.ts                   # CLI entry point
│   └── index.ts                 # Main API exports
├── docs/                        # Documentation
│   └── PROJECT_PLAN.md          # Detailed project plan
├── dist/                        # Compiled JavaScript (after build)
└── test-extract.ts              # Test script

How It Works

PDF Loading: Uses pdf.js to load and parse PDF documents
Annotation Extraction: Retrieves highlight annotations with their quadPoints (geometry)
Text Matching: Matches text items to highlight regions using geometric intersection
Text Assembly: Sorts and joins matched text fragments in reading order
Export: Formats the extracted data into JSON or Markdown

Limitations

Currently supports only highlight annotations (not underlines, strikethroughs, etc.)
Assumes horizontal, left-to-right text (rotated text may not order correctly)
Multi-column layouts may produce mixed text order
Requires standard PDF annotations (works with Edge, Adobe, etc.)

Development

# Build the project
npm run build

# Run in development mode
npm run dev <pdf-file>

# Run tests
npm test  # (not yet implemented)

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

MIT

Acknowledgments

Built with pdf.js by Mozilla
Uses Commander.js for CLI
Terminal colors by Chalk

Future Enhancements

Support for other annotation types (underline, strikethrough, freehand)
Batch processing with parallel execution
Integration with note-taking apps (Obsidian)
Advanced filtering (by color, date, author)
Export to additional formats (CSV, HTML)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Annotation Extraction

Overview

Features

Technology Stack

Installation

Usage

CLI Usage

CLI Options

Programmatic Usage

Testing

Output Formats

JSON Output

Markdown Output

Use Cases

Project Structure

How It Works

Limitations

Development

Contributing

License

Acknowledgments

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Annotation Extraction

Overview

Features

Technology Stack

Installation

Usage

CLI Usage

CLI Options

Programmatic Usage

Testing

Output Formats

JSON Output

Markdown Output

Use Cases

Project Structure

How It Works

Limitations

Development

Contributing

License

Acknowledgments

Future Enhancements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages