Skip to content

bogerman1/docharbor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

120 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocHarbor -- The Universal Document Tool for AI agent

Portable document intelligence for local corpora, agent workflows, and evidence-backed answers.

Python MCP Windows License Homepage

中文文档 | Operator Manual | Architecture | Ask Retrieval Benchmark

DocHarbor banner

Overview

DocHarbor is a Windows-first, agent-oriented document pipeline for local corpora. It is designed for cases where an LLM should not guess: engineering bids, technical packs, PDFs with extraction stages, Office files that need format-preserving delivery translation, and large mixed folders that need auditable routing and indexing.

The backend package and CLI are named doc-agent. DocHarbor is the product and repository name.

Core idea:

  • inventory the corpus
  • route each file to the right parser
  • normalize outputs into a stable internal contract
  • build retrieval indexes
  • answer with evidence instead of synthesis-only guesses
  • expose the same workflow through CLI, MCP, and thin agent adapters

Why DocHarbor

Most local-document workflows fail in one of three ways:

  • the agent opens files ad hoc and loses repeatability
  • the parser choice is implicit and impossible to audit
  • translation and question answering are detached from the indexed evidence

DocHarbor addresses those failure modes directly:

  • explicit inventory and routing manifests
  • parser-specific stages for native text, Office, and PDF
  • stable normalized artifacts for indexing and downstream tooling
  • preserve-format delivery translation for Office files
  • constrained PDF delivery translation with configurable skip policy
  • MCP-first integration for local desktop and IDE agents

Screenshots

Product banner

DocHarbor banner

Pipeline overview

DocHarbor workflow

Integration architecture

DocHarbor architecture

What It Does

  • Inventories large local folder trees.
  • Routes files to the right parser based on type and workflow constraints.
  • Uses MinerU v4 for production PDF extraction.
  • Parses native text and Office files locally.
  • Supports a legacy Office conversion path through LibreOffice for older formats.
  • Normalizes parser outputs into a stable internal contract.
  • Builds retrieval indexes for grounded answers.
  • Builds Agent-facing document maps for large project routing.
  • Supports Agent query planning with safe synonym, unit, section, bilingual, and positive/negative retrieval hints.
  • Supports JSON, SQLite FTS5 sidecar, and hybrid answer retrieval backends.
  • Creates translated normalized variants for retrieval.
  • Creates preserve-format delivery translations for .docx, .xlsx, and .pptx.
  • Creates constrained preserve-layout PDF delivery translations through overlay rendering.
  • Exposes the same backend through CLI, MCP, and thin agent adapters.
  • Includes a local browser UI adapter (doc-agent web) that wraps the existing CLI backend with job polling and artifact download APIs.
  • Installs a native OpenClaw plugin for path retrieval, translation, answer, and status.

Current Capability Matrix

Area Current State
Local text corpora Supported
PDF parsing MinerU v4-based pipeline
Office parsing markitdown plus legacy Office conversion path
Delivery translation: DOCX Supported
Delivery translation: XLSX Supported
Delivery translation: PPTX Supported
Delivery translation: PDF Supported with configurable skip policy
Retrieval translation Supported via normalized translated variants
Local web UI adapter Supported via doc-agent web
MCP server Supported
Codex / Claude Code / Cursor / Windsurf adapters Supported
OpenClaw native plugin Supported

Workflow

DocHarbor’s normal operating model is:

  1. Create or choose a project.
  2. Inventory the source tree.
  3. Parse local-native and Office content.
  4. Queue and process PDF-heavy documents through MinerU when needed.
  5. Normalize artifacts.
  6. Build an index.
  7. Ask questions against indexed evidence.
  8. Optionally produce translated retrieval variants, delivery artifacts, or both.

End-to-End Pipeline

flowchart LR
    A["Source Folder / File"] --> B["inventory"]
    B --> C["routing.json"]
    C --> D["parse-native / parse-office / parse-office-convert"]
    C --> E["queue -> mineru-submit -> mineru-poll -> normalize-mineru"]
    D --> F["normalized artifacts"]
    E --> F
    F --> G["build-index"]
    G --> H["answer / ask-path"]
    F --> I["translate"]
    I --> J["translated normalized variants"]
    I --> K["delivery artifacts (.docx/.xlsx/.pptx/.pdf)"]
Loading

DocHarbor workflow

Architecture

DocHarbor is intentionally layered:

  1. doc-agent CLI as the canonical backend
  2. docharbor MCP server for MCP-capable clients
  3. thin adapter files, skills, rules, and native commands for agent clients

If a client supports MCP, use MCP first. Shell wrappers and agent-specific instructions are fallback integration surfaces, not the core backend.

DocHarbor architecture

Install

Option 1. One-Click Windows Install From Git

This is the recommended source-install path.

Minimum requirement:

  • Python 3.11 or newer

Run from the repo root:

.\one-click-installation.bat

What it does:

  • bootstraps .\.venv if needed
  • installs DocHarbor into that local virtual environment
  • runs the guided setup wizard
  • writes or updates .env
  • auto-detects LibreOffice and ODA File Converter
  • can install supported external tools with winget
  • installs agent adapters
  • installs MCP config for supported clients
  • can install the native OpenClaw plugin

Compatibility note:

  • doc-agent-setup.bat forwards to one-click-installation.bat
  • helper batch files have been moved under tools/windows/ so the repo root stays clean
  • those helper scripts are for manual or partial setup, not for first-time onboarding

Installer Helper Batches

  • tools/windows/doc-agent-bootstrap.bat
  • tools/windows/doc-agent-env-setup.bat
  • tools/windows/doc-agent-install-agents.bat
  • tools/windows/doc-agent-install-mcp.bat
  • tools/windows/doc-agent-openclaw-setup.bat

Option 2. Manual Developer Install

Install the package yourself if you want explicit control over dependency groups.

Minimal development path:

python -m pip install -e ".[inventory,office]"

Full translation-capable path:

python -m pip install -e ".[full]"

Recommended verification:

doc-agent setup-env
doc-agent doctor --format json

Local Web Adapter

Run the local browser UI and REST-style adapter over the existing CLI backend:

doc-agent web --host 127.0.0.1 --port 8799 --open-browser

The web adapter is a thin local shell around DocHarbor commands. It starts asynchronous translation jobs, polls logs/status, exposes detected artifacts for download, and wires glossary/PDF review actions back to the same CLI contracts.

Option 3. Portable Windows Bundle

Build a user-friendly release bundle:

doc-agent build-portable --mode self-contained --profile lite --build-wheelhouse
powershell -ExecutionPolicy Bypass -File .\scripts\package_release.ps1 -Mode self-contained -Profile lite -BuildWheelhouse

Portable bundle characteristics:

  • contains doc-agent.exe
  • bootstraps an embedded Python runtime on first use
  • installs from the bundled wheelhouse
  • keeps runtime data in .\doc-agent-home

Quick Start

Standard Corpus Workflow

doc-agent init-project --project sample --project-root ".\proj" --source-root "D:\docs\sample"
doc-agent inventory --project sample --project-root ".\proj" --source-root "D:\docs\sample"

doc-agent parse-native --project sample --project-root ".\proj"
doc-agent parse-office --project sample --project-root ".\proj"
doc-agent parse-office-convert --project sample --project-root ".\proj"

doc-agent queue --project sample --project-root ".\proj" --priority core --priority high
doc-agent mineru-submit --project sample --project-root ".\proj"
doc-agent mineru-poll --project sample --project-root ".\proj" --wait --download
doc-agent normalize-mineru --project sample --project-root ".\proj"

doc-agent build-index --project sample --project-root ".\proj"
doc-agent answer --project sample --project-root ".\proj" --question "What changed?" --format json --no-write

One-Off Path Workflow

Use this when the user gives a direct file or folder path:

doc-agent ask-path --source-path "D:\docs\sample\spec.pdf" --question "What does this say?" --format json --no-write

Translation Workflow

DocHarbor supports three translation output modes:

Mode What It Produces
normalized translated normalized variants for retrieval/indexing
delivery translated openable source-format artifacts
both both translated retrieval artifacts and delivery artifacts

DocHarbor also supports two translation process modes:

Mode Behavior
rough direct translation with no glossary review step
precise extract a project glossary from normalized source text, pause for human approval, then rerun translation with the approved glossary

Output style is a separate axis:

Style Behavior
target_only write only the translated text
bilingual write source plus translation in the delivery artifact

Current output-style constraints:

  • bilingual supports delivery mode only
  • bilingual delivery is currently intended for Office preserve-format output (.docx, .xlsx, .pptx)
  • PDF and DXF delivery use target_only

Translate an Existing Project

doc-agent translate --project sample --project-root ".\proj" --target-lang en --output-mode both --format json

Translate a Direct File or Folder Path

doc-agent translate-path --source-path "D:\docs\quote.docx" --target-lang zh --output-mode delivery --format json

Precise Translation with Human Review

precise is a two-pass workflow:

  1. Run precise translation once. If DocHarbor finds new glossary terms, it updates the Pending Review section inside glossary.md and exits with approval_required.
  2. Review or edit the pending glossary terms in glossary.md.
  3. Approve the glossary.
  4. Rerun the same translation command. DocHarbor uses the approved glossary and completes translation.

CLI example:

doc-agent translate-path --source-path "D:\docs\quote.docx" --target-lang zh --output-mode delivery --translation-mode precise --format json
doc-agent glossary-status --project <project-slug> --target-lang zh --format json
doc-agent glossary-approve --project <project-slug> --target-lang zh --format json
doc-agent translate-path --source-path "D:\docs\quote.docx" --target-lang zh --output-mode delivery --translation-mode precise --project <project-slug> --format json

If you want the precise workflow to reuse the same approved glossary, keep the same project slug on reruns.

Preserve Original Office Format

doc-agent translate-path --source-path "D:\docs\deck.pptx" --target-lang zh --output-mode delivery --format json

PDF Delivery Skip Policy

PDF delivery translation supports public block-skip control:

doc-agent translate-path --source-path "D:\docs\spec.pdf" --target-lang en --output-mode delivery --pdf-skip-block-types default --format json
doc-agent translate-path --source-path "D:\docs\spec.pdf" --target-lang en --output-mode delivery --pdf-skip-block-types none --format json
doc-agent translate-path --source-path "D:\docs\spec.pdf" --target-lang en --output-mode delivery --pdf-skip-block-types table,header --format json

Semantics:

  • omitted or default = built-in policy (header,footer)
  • none = skip nothing
  • explicit list = comma-separated block types

Allowed values:

  • table
  • text
  • title
  • list
  • aside_text
  • page_footnote
  • header
  • footer

Agent Integrations

MCP Server

Run the MCP server directly:

doc-agent mcp-serve --transport stdio

Equivalent entrypoint:

doc-agent-mcp --transport stdio

Install MCP config:

doc-agent install-mcp-config --agent claude --agent cursor --agent windsurf --workspace-root ".\workspace"

Supported MCP-first clients:

  • Claude Code
  • Cursor
  • Windsurf

Adapter Install

Install thin adapters:

doc-agent install-adapters --agent codex --agent claude --agent cursor --agent windsurf --agent openclaw --workspace-root ".\workspace"

OpenClaw

OpenClaw is not just a prompt-level integration. DocHarbor ships a native plugin path:

  • helper install: .\tools\windows\doc-agent-openclaw-setup.bat
  • native tools:
    • docharbor_ask_path
    • docharbor_translate_path
    • docharbor_glossary_status
    • docharbor_glossary_approve
    • docharbor_answer
    • docharbor_status
  • slash command:
    • /doctranslatepath is guidance-only for write operations and points users to the tool path

OpenClaw should prefer the native plugin tools over ad hoc shell extraction.

Recommended OpenClaw translation workflow:

  1. Use docharbor_translate_path for all translations.
  2. For translationMode=precise, stop if the tool returns approval_required.
  3. Review pending terms with docharbor_glossary_status.
  4. Approve them with docharbor_glossary_approve.
  5. Rerun docharbor_translate_path with the same project and translationMode=precise.

Do not use Exec, raw CLI commands, or /doctranslatepath to test the HITL translation flow.

Client Notes

Client Preferred Integration
Codex skill + CLI path in this repo
Claude Code MCP first, skill/commands as thin guidance
Cursor MCP first
Windsurf MCP first
OpenClaw native plugin first

Project Layout

Typical repository/project directories:

DocHarbor/
├─ src/doc_agent/              # canonical backend package
├─ scripts/                    # pipeline scripts and helpers
├─ docs/                       # manuals, architecture docs, and static media
├─ docs/media/                 # README images and demo assets
├─ tools/windows/              # helper batch scripts
├─ openclaw-plugin/            # native OpenClaw plugin
├─ launcher/                   # portable launcher assets
├─ proj/                       # generated project artifacts
│  └─ <project>/
│     ├─ manifests/
│     ├─ parsed/
│     ├─ normalized/
│     ├─ index/
│     ├─ source/derived/
│     └─ logs/
├─ doc-agent-setup.bat         # compatibility wrapper
└─ one-click-installation.bat  # main Windows entrypoint

Environment Notes

Preview analysis is opt-in. DocHarbor will not call external vision providers unless DOC_AGENT_PREVIEW_ANALYSIS_PROVIDER is set.

Legacy Office note:

  • .doc, .ppt, and .xlsm require LibreOffice for the pre-conversion stage
  • set DOC_AGENT_LIBREOFFICE_BIN manually only if auto-detection fails
  • confirm with doc-agent doctor --format json

Translation note:

  • supported providers in the current implementation are OpenAI-compatible APIs and Google/Gemini
  • leave DOC_AGENT_TRANSLATE_TARGET_LANG blank if you want per-request --target-lang
  • structured JSON glossary files enable terminology validation and retry
  • inline glossary text is guidance only

Troubleshooting

Use doctor first:

doc-agent doctor --format json

Common checks:

  • Python version
  • requests
  • PyPDF2
  • openpyxl
  • python-docx
  • python-pptx
  • PyMuPDF
  • markitdown
  • dotenv
  • ezdxf
  • matplotlib
  • MinerU API token
  • LibreOffice
  • translation provider config

Documentation

Repository Scope

This repository intentionally excludes:

  • private corpora
  • generated project artifacts
  • local SDK caches
  • staging workspaces
  • secrets and tokens

Security Notes

  • do not commit .env
  • do not commit real project corpora under proj/
  • keep MinerU and model provider credentials in environment variables or local .env
  • prefer doc-agent setup-env or tools/windows/doc-agent-env-setup.bat over ad hoc environment editing on new machines

Release Files

Portable Windows release assets are produced under:

  • dist/release/

Recommended first-download artifact:

  • DocHarbor-windows-x64-lite-v0.1.0.zip

License

MIT

About

DocHarbor: portable multi-agent document retrieval and evidence workflow

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors