A TypeScript-based tool for extracting highlight annotations from PDF files and exporting them to structured formats like JSON and Markdown.
This project extracts highlighted text and associated annotations from PDF files, particularly those annotated in Microsoft Edge or other PDF readers that follow the PDF specification standard. The extracted data can be exported to JSON or Markdown for further processing, note-taking, or integration with AI/LLM pipelines.
✅ Implemented:
- Extract highlight annotations from PDF files
- Capture the actual highlighted text (not just annotation metadata)
- Export to JSON and Markdown formats
- CLI interface for command-line usage
- Programmatic API for integration into other projects
- Preserve annotation metadata:
- Page numbers
- Highlighted text content
- Annotation comments/notes
- Color information
- Author and modification dates
- Position information
- Runtime: Node.js
- Language: TypeScript
- Core PDF Engine: pdf.js (pdfjs-dist)
- CLI: Commander.js
- Output Formatting: Chalk for colorized terminal output
# Clone the repository
git clone https://github.com/MikeORed/pdf-annotation-extraction.git
cd pdf-annotation-extraction
# Install dependencies
npm install
# Build the project
npm run buildAfter building, you can use the CLI tool:
# Basic extraction to JSON (default)
npx ts-node src/cli.ts document.pdf
# Or using the dev script
npm run dev document.pdf
# Extract to Markdown
npm run dev document.pdf --format markdown
# Specify output file
npm run dev document.pdf --output my-notes.json
# Group highlights by page (Markdown only)
npm run dev document.pdf --format markdown --group-by-page
# Include additional metadata
npm run dev document.pdf --include-dates --include-authors
# Get help
npm run dev --help-o, --output <path>- Output file path (auto-generated if not specified)-f, --format <format>- Output format:jsonormarkdown(default: json)--no-colors- Exclude color information from output--no-toc- Exclude table of contents (markdown only)--group-by-page- Group highlights by page (markdown only)--include-dates- Include modification dates--include-authors- Include author information--include-raw- Include raw annotation data (json only)--pretty- Pretty print JSON output (default: true)
import { extractHighlights, exportToJSON, exportToMarkdown } from 'pdf-annotation-extraction';
async function extractPDFHighlights() {
// Extract highlights
const result = await extractHighlights('document.pdf', {
includeDates: true,
includeAuthor: true
});
console.log(`Found ${result.highlights.length} highlights`);
// Access individual highlights
result.highlights.forEach(highlight => {
console.log(`Page ${highlight.page}: "${highlight.text}"`);
if (highlight.comment) {
console.log(` Note: ${highlight.comment}`);
}
});
// Export to JSON
await exportToJSON(result, 'output.json', { pretty: true });
// Export to Markdown
await exportToMarkdown(result, 'output.md', {
groupByPage: true,
includeTOC: true
});
}A test script is provided to quickly test extraction on a PDF:
npx ts-node test-extract.ts path/to/your/annotated.pdfThis will:
- Extract all highlights from the PDF
- Display a summary in the terminal
- Show sample JSON and Markdown output
{
"sourceFile": "document.pdf",
"totalPages": 10,
"highlights": [
{
"page": 1,
"text": "This is the highlighted text content",
"comment": "Optional annotation comment",
"color": { "r": 255, "g": 255, "b": 0 },
"position": { "x": 100, "y": 500, "width": 200, "height": 20 },
"author": null,
"modificationDate": "2024-01-15T10:30:00.000Z"
}
],
"extractedAt": "2024-01-20T15:45:00.000Z"
}# PDF Highlights: document.pdf
**Extracted:** 1/20/2024, 3:45:00 PM
**Total Pages:** 10
**Total Highlights:** 5
## Table of Contents
- [Highlight 1](#highlight-1) - "This is the highlighted text..."
- [Highlight 2](#highlight-2) - "Another highlight..."
## Highlights
### Highlight 1
**Page:** 1 | **Color:** Yellow
> This is the highlighted text content
**Note:** Optional annotation comment
---- Research Compilation: Extract highlights from academic papers
- Document Review: Gather annotated quotes and notes from documents
- Note-Taking: Automate workflows for managing reading notes
- AI/LLM Integration: Feed annotated content into AI pipelines
- Knowledge Management: Create searchable databases of highlighted content
pdf-annotation-extraction/
├── src/
│ ├── types/
│ │ └── annotations.ts # Type definitions
│ ├── core/
│ │ ├── pdf-loader.ts # PDF document loading
│ │ ├── annotation-extractor.ts # Main extraction logic
│ │ ├── text-matcher.ts # Text-to-annotation matching
│ │ └── geometry.ts # Coordinate utilities
│ ├── exporters/
│ │ ├── json-exporter.ts # JSON export
│ │ └── markdown-exporter.ts # Markdown export
│ ├── cli.ts # CLI entry point
│ └── index.ts # Main API exports
├── docs/ # Documentation
│ └── PROJECT_PLAN.md # Detailed project plan
├── dist/ # Compiled JavaScript (after build)
└── test-extract.ts # Test script
- PDF Loading: Uses pdf.js to load and parse PDF documents
- Annotation Extraction: Retrieves highlight annotations with their quadPoints (geometry)
- Text Matching: Matches text items to highlight regions using geometric intersection
- Text Assembly: Sorts and joins matched text fragments in reading order
- Export: Formats the extracted data into JSON or Markdown
- Currently supports only highlight annotations (not underlines, strikethroughs, etc.)
- Assumes horizontal, left-to-right text (rotated text may not order correctly)
- Multi-column layouts may produce mixed text order
- Requires standard PDF annotations (works with Edge, Adobe, etc.)
# Build the project
npm run build
# Run in development mode
npm run dev <pdf-file>
# Run tests
npm test # (not yet implemented)Contributions are welcome! Please feel free to submit issues or pull requests.
MIT
- Built with pdf.js by Mozilla
- Uses Commander.js for CLI
- Terminal colors by Chalk
- Support for other annotation types (underline, strikethrough, freehand)
- Batch processing with parallel execution
- Integration with note-taking apps (Obsidian)
- Advanced filtering (by color, date, author)
- Export to additional formats (CSV, HTML)