Skip to content

MikeORed/pdf-annotation-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Annotation Extraction

A TypeScript-based tool for extracting highlight annotations from PDF files and exporting them to structured formats like JSON and Markdown.

Overview

This project extracts highlighted text and associated annotations from PDF files, particularly those annotated in Microsoft Edge or other PDF readers that follow the PDF specification standard. The extracted data can be exported to JSON or Markdown for further processing, note-taking, or integration with AI/LLM pipelines.

Features

Implemented:

  • Extract highlight annotations from PDF files
  • Capture the actual highlighted text (not just annotation metadata)
  • Export to JSON and Markdown formats
  • CLI interface for command-line usage
  • Programmatic API for integration into other projects
  • Preserve annotation metadata:
    • Page numbers
    • Highlighted text content
    • Annotation comments/notes
    • Color information
    • Author and modification dates
    • Position information

Technology Stack

  • Runtime: Node.js
  • Language: TypeScript
  • Core PDF Engine: pdf.js (pdfjs-dist)
  • CLI: Commander.js
  • Output Formatting: Chalk for colorized terminal output

Installation

# Clone the repository
git clone https://github.com/MikeORed/pdf-annotation-extraction.git
cd pdf-annotation-extraction

# Install dependencies
npm install

# Build the project
npm run build

Usage

CLI Usage

After building, you can use the CLI tool:

# Basic extraction to JSON (default)
npx ts-node src/cli.ts document.pdf

# Or using the dev script
npm run dev document.pdf

# Extract to Markdown
npm run dev document.pdf --format markdown

# Specify output file
npm run dev document.pdf --output my-notes.json

# Group highlights by page (Markdown only)
npm run dev document.pdf --format markdown --group-by-page

# Include additional metadata
npm run dev document.pdf --include-dates --include-authors

# Get help
npm run dev --help

CLI Options

  • -o, --output <path> - Output file path (auto-generated if not specified)
  • -f, --format <format> - Output format: json or markdown (default: json)
  • --no-colors - Exclude color information from output
  • --no-toc - Exclude table of contents (markdown only)
  • --group-by-page - Group highlights by page (markdown only)
  • --include-dates - Include modification dates
  • --include-authors - Include author information
  • --include-raw - Include raw annotation data (json only)
  • --pretty - Pretty print JSON output (default: true)

Programmatic Usage

import { extractHighlights, exportToJSON, exportToMarkdown } from 'pdf-annotation-extraction';

async function extractPDFHighlights() {
  // Extract highlights
  const result = await extractHighlights('document.pdf', {
    includeDates: true,
    includeAuthor: true
  });

  console.log(`Found ${result.highlights.length} highlights`);

  // Access individual highlights
  result.highlights.forEach(highlight => {
    console.log(`Page ${highlight.page}: "${highlight.text}"`);
    if (highlight.comment) {
      console.log(`  Note: ${highlight.comment}`);
    }
  });

  // Export to JSON
  await exportToJSON(result, 'output.json', { pretty: true });

  // Export to Markdown
  await exportToMarkdown(result, 'output.md', {
    groupByPage: true,
    includeTOC: true
  });
}

Testing

A test script is provided to quickly test extraction on a PDF:

npx ts-node test-extract.ts path/to/your/annotated.pdf

This will:

  • Extract all highlights from the PDF
  • Display a summary in the terminal
  • Show sample JSON and Markdown output

Output Formats

JSON Output

{
  "sourceFile": "document.pdf",
  "totalPages": 10,
  "highlights": [
    {
      "page": 1,
      "text": "This is the highlighted text content",
      "comment": "Optional annotation comment",
      "color": { "r": 255, "g": 255, "b": 0 },
      "position": { "x": 100, "y": 500, "width": 200, "height": 20 },
      "author": null,
      "modificationDate": "2024-01-15T10:30:00.000Z"
    }
  ],
  "extractedAt": "2024-01-20T15:45:00.000Z"
}

Markdown Output

# PDF Highlights: document.pdf

**Extracted:** 1/20/2024, 3:45:00 PM
**Total Pages:** 10
**Total Highlights:** 5

## Table of Contents

- [Highlight 1](#highlight-1) - "This is the highlighted text..."
- [Highlight 2](#highlight-2) - "Another highlight..."

## Highlights

### Highlight 1

**Page:** 1 | **Color:** Yellow

> This is the highlighted text content

**Note:** Optional annotation comment

---

Use Cases

  • Research Compilation: Extract highlights from academic papers
  • Document Review: Gather annotated quotes and notes from documents
  • Note-Taking: Automate workflows for managing reading notes
  • AI/LLM Integration: Feed annotated content into AI pipelines
  • Knowledge Management: Create searchable databases of highlighted content

Project Structure

pdf-annotation-extraction/
├── src/
│   ├── types/
│   │   └── annotations.ts       # Type definitions
│   ├── core/
│   │   ├── pdf-loader.ts        # PDF document loading
│   │   ├── annotation-extractor.ts # Main extraction logic
│   │   ├── text-matcher.ts      # Text-to-annotation matching
│   │   └── geometry.ts          # Coordinate utilities
│   ├── exporters/
│   │   ├── json-exporter.ts     # JSON export
│   │   └── markdown-exporter.ts # Markdown export
│   ├── cli.ts                   # CLI entry point
│   └── index.ts                 # Main API exports
├── docs/                        # Documentation
│   └── PROJECT_PLAN.md          # Detailed project plan
├── dist/                        # Compiled JavaScript (after build)
└── test-extract.ts              # Test script

How It Works

  1. PDF Loading: Uses pdf.js to load and parse PDF documents
  2. Annotation Extraction: Retrieves highlight annotations with their quadPoints (geometry)
  3. Text Matching: Matches text items to highlight regions using geometric intersection
  4. Text Assembly: Sorts and joins matched text fragments in reading order
  5. Export: Formats the extracted data into JSON or Markdown

Limitations

  • Currently supports only highlight annotations (not underlines, strikethroughs, etc.)
  • Assumes horizontal, left-to-right text (rotated text may not order correctly)
  • Multi-column layouts may produce mixed text order
  • Requires standard PDF annotations (works with Edge, Adobe, etc.)

Development

# Build the project
npm run build

# Run in development mode
npm run dev <pdf-file>

# Run tests
npm test  # (not yet implemented)

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

MIT

Acknowledgments

Future Enhancements

  • Support for other annotation types (underline, strikethrough, freehand)
  • Batch processing with parallel execution
  • Integration with note-taking apps (Obsidian)
  • Advanced filtering (by color, date, author)
  • Export to additional formats (CSV, HTML)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors