Portable document intelligence for local corpora, agent workflows, and evidence-backed answers.
中文文档 | Operator Manual | Architecture | Ask Retrieval Benchmark
DocHarbor is a Windows-first, agent-oriented document pipeline for local corpora. It is designed for cases where an LLM should not guess: engineering bids, technical packs, PDFs with extraction stages, Office files that need format-preserving delivery translation, and large mixed folders that need auditable routing and indexing.
The backend package and CLI are named doc-agent. DocHarbor is the product and repository name.
Core idea:
- inventory the corpus
- route each file to the right parser
- normalize outputs into a stable internal contract
- build retrieval indexes
- answer with evidence instead of synthesis-only guesses
- expose the same workflow through CLI, MCP, and thin agent adapters
Most local-document workflows fail in one of three ways:
- the agent opens files ad hoc and loses repeatability
- the parser choice is implicit and impossible to audit
- translation and question answering are detached from the indexed evidence
DocHarbor addresses those failure modes directly:
- explicit inventory and routing manifests
- parser-specific stages for native text, Office, and PDF
- stable normalized artifacts for indexing and downstream tooling
- preserve-format delivery translation for Office files
- constrained PDF delivery translation with configurable skip policy
- MCP-first integration for local desktop and IDE agents
- Inventories large local folder trees.
- Routes files to the right parser based on type and workflow constraints.
- Uses MinerU v4 for production PDF extraction.
- Parses native text and Office files locally.
- Supports a legacy Office conversion path through LibreOffice for older formats.
- Normalizes parser outputs into a stable internal contract.
- Builds retrieval indexes for grounded answers.
- Builds Agent-facing document maps for large project routing.
- Supports Agent query planning with safe synonym, unit, section, bilingual, and positive/negative retrieval hints.
- Supports JSON, SQLite FTS5 sidecar, and hybrid answer retrieval backends.
- Creates translated normalized variants for retrieval.
- Creates preserve-format delivery translations for
.docx,.xlsx, and.pptx. - Creates constrained preserve-layout PDF delivery translations through overlay rendering.
- Exposes the same backend through CLI, MCP, and thin agent adapters.
- Includes a local browser UI adapter (
doc-agent web) that wraps the existing CLI backend with job polling and artifact download APIs. - Installs a native OpenClaw plugin for path retrieval, translation, answer, and status.
| Area | Current State |
|---|---|
| Local text corpora | Supported |
| PDF parsing | MinerU v4-based pipeline |
| Office parsing | markitdown plus legacy Office conversion path |
| Delivery translation: DOCX | Supported |
| Delivery translation: XLSX | Supported |
| Delivery translation: PPTX | Supported |
| Delivery translation: PDF | Supported with configurable skip policy |
| Retrieval translation | Supported via normalized translated variants |
| Local web UI adapter | Supported via doc-agent web |
| MCP server | Supported |
| Codex / Claude Code / Cursor / Windsurf adapters | Supported |
| OpenClaw native plugin | Supported |
DocHarbor’s normal operating model is:
- Create or choose a project.
- Inventory the source tree.
- Parse local-native and Office content.
- Queue and process PDF-heavy documents through MinerU when needed.
- Normalize artifacts.
- Build an index.
- Ask questions against indexed evidence.
- Optionally produce translated retrieval variants, delivery artifacts, or both.
flowchart LR
A["Source Folder / File"] --> B["inventory"]
B --> C["routing.json"]
C --> D["parse-native / parse-office / parse-office-convert"]
C --> E["queue -> mineru-submit -> mineru-poll -> normalize-mineru"]
D --> F["normalized artifacts"]
E --> F
F --> G["build-index"]
G --> H["answer / ask-path"]
F --> I["translate"]
I --> J["translated normalized variants"]
I --> K["delivery artifacts (.docx/.xlsx/.pptx/.pdf)"]
DocHarbor is intentionally layered:
doc-agentCLI as the canonical backenddocharborMCP server for MCP-capable clients- thin adapter files, skills, rules, and native commands for agent clients
If a client supports MCP, use MCP first. Shell wrappers and agent-specific instructions are fallback integration surfaces, not the core backend.
This is the recommended source-install path.
Minimum requirement:
- Python 3.11 or newer
Run from the repo root:
.\one-click-installation.batWhat it does:
- bootstraps
.\.venvif needed - installs DocHarbor into that local virtual environment
- runs the guided setup wizard
- writes or updates
.env - auto-detects LibreOffice and ODA File Converter
- can install supported external tools with
winget - installs agent adapters
- installs MCP config for supported clients
- can install the native OpenClaw plugin
Compatibility note:
doc-agent-setup.batforwards toone-click-installation.bat- helper batch files have been moved under
tools/windows/so the repo root stays clean - those helper scripts are for manual or partial setup, not for first-time onboarding
tools/windows/doc-agent-bootstrap.battools/windows/doc-agent-env-setup.battools/windows/doc-agent-install-agents.battools/windows/doc-agent-install-mcp.battools/windows/doc-agent-openclaw-setup.bat
Install the package yourself if you want explicit control over dependency groups.
Minimal development path:
python -m pip install -e ".[inventory,office]"Full translation-capable path:
python -m pip install -e ".[full]"Recommended verification:
doc-agent setup-env
doc-agent doctor --format jsonRun the local browser UI and REST-style adapter over the existing CLI backend:
doc-agent web --host 127.0.0.1 --port 8799 --open-browserThe web adapter is a thin local shell around DocHarbor commands. It starts asynchronous translation jobs, polls logs/status, exposes detected artifacts for download, and wires glossary/PDF review actions back to the same CLI contracts.
Build a user-friendly release bundle:
doc-agent build-portable --mode self-contained --profile lite --build-wheelhouse
powershell -ExecutionPolicy Bypass -File .\scripts\package_release.ps1 -Mode self-contained -Profile lite -BuildWheelhousePortable bundle characteristics:
- contains
doc-agent.exe - bootstraps an embedded Python runtime on first use
- installs from the bundled wheelhouse
- keeps runtime data in
.\doc-agent-home
doc-agent init-project --project sample --project-root ".\proj" --source-root "D:\docs\sample"
doc-agent inventory --project sample --project-root ".\proj" --source-root "D:\docs\sample"
doc-agent parse-native --project sample --project-root ".\proj"
doc-agent parse-office --project sample --project-root ".\proj"
doc-agent parse-office-convert --project sample --project-root ".\proj"
doc-agent queue --project sample --project-root ".\proj" --priority core --priority high
doc-agent mineru-submit --project sample --project-root ".\proj"
doc-agent mineru-poll --project sample --project-root ".\proj" --wait --download
doc-agent normalize-mineru --project sample --project-root ".\proj"
doc-agent build-index --project sample --project-root ".\proj"
doc-agent answer --project sample --project-root ".\proj" --question "What changed?" --format json --no-writeUse this when the user gives a direct file or folder path:
doc-agent ask-path --source-path "D:\docs\sample\spec.pdf" --question "What does this say?" --format json --no-writeDocHarbor supports three translation output modes:
| Mode | What It Produces |
|---|---|
normalized |
translated normalized variants for retrieval/indexing |
delivery |
translated openable source-format artifacts |
both |
both translated retrieval artifacts and delivery artifacts |
DocHarbor also supports two translation process modes:
| Mode | Behavior |
|---|---|
rough |
direct translation with no glossary review step |
precise |
extract a project glossary from normalized source text, pause for human approval, then rerun translation with the approved glossary |
Output style is a separate axis:
| Style | Behavior |
|---|---|
target_only |
write only the translated text |
bilingual |
write source plus translation in the delivery artifact |
Current output-style constraints:
bilingualsupportsdeliverymode only- bilingual delivery is currently intended for Office preserve-format output (
.docx,.xlsx,.pptx) - PDF and DXF delivery use
target_only
doc-agent translate --project sample --project-root ".\proj" --target-lang en --output-mode both --format jsondoc-agent translate-path --source-path "D:\docs\quote.docx" --target-lang zh --output-mode delivery --format jsonprecise is a two-pass workflow:
- Run precise translation once. If DocHarbor finds new glossary terms, it updates the
Pending Reviewsection insideglossary.mdand exits withapproval_required. - Review or edit the pending glossary terms in
glossary.md. - Approve the glossary.
- Rerun the same translation command. DocHarbor uses the approved glossary and completes translation.
CLI example:
doc-agent translate-path --source-path "D:\docs\quote.docx" --target-lang zh --output-mode delivery --translation-mode precise --format json
doc-agent glossary-status --project <project-slug> --target-lang zh --format json
doc-agent glossary-approve --project <project-slug> --target-lang zh --format json
doc-agent translate-path --source-path "D:\docs\quote.docx" --target-lang zh --output-mode delivery --translation-mode precise --project <project-slug> --format jsonIf you want the precise workflow to reuse the same approved glossary, keep the same project slug on reruns.
doc-agent translate-path --source-path "D:\docs\deck.pptx" --target-lang zh --output-mode delivery --format jsonPDF delivery translation supports public block-skip control:
doc-agent translate-path --source-path "D:\docs\spec.pdf" --target-lang en --output-mode delivery --pdf-skip-block-types default --format json
doc-agent translate-path --source-path "D:\docs\spec.pdf" --target-lang en --output-mode delivery --pdf-skip-block-types none --format json
doc-agent translate-path --source-path "D:\docs\spec.pdf" --target-lang en --output-mode delivery --pdf-skip-block-types table,header --format jsonSemantics:
- omitted or
default= built-in policy (header,footer) none= skip nothing- explicit list = comma-separated block types
Allowed values:
tabletexttitlelistaside_textpage_footnoteheaderfooter
Run the MCP server directly:
doc-agent mcp-serve --transport stdioEquivalent entrypoint:
doc-agent-mcp --transport stdioInstall MCP config:
doc-agent install-mcp-config --agent claude --agent cursor --agent windsurf --workspace-root ".\workspace"Supported MCP-first clients:
- Claude Code
- Cursor
- Windsurf
Install thin adapters:
doc-agent install-adapters --agent codex --agent claude --agent cursor --agent windsurf --agent openclaw --workspace-root ".\workspace"OpenClaw is not just a prompt-level integration. DocHarbor ships a native plugin path:
- helper install:
.\tools\windows\doc-agent-openclaw-setup.bat - native tools:
docharbor_ask_pathdocharbor_translate_pathdocharbor_glossary_statusdocharbor_glossary_approvedocharbor_answerdocharbor_status
- slash command:
/doctranslatepathis guidance-only for write operations and points users to the tool path
OpenClaw should prefer the native plugin tools over ad hoc shell extraction.
Recommended OpenClaw translation workflow:
- Use
docharbor_translate_pathfor all translations. - For
translationMode=precise, stop if the tool returnsapproval_required. - Review pending terms with
docharbor_glossary_status. - Approve them with
docharbor_glossary_approve. - Rerun
docharbor_translate_pathwith the same project andtranslationMode=precise.
Do not use Exec, raw CLI commands, or /doctranslatepath to test the HITL translation flow.
| Client | Preferred Integration |
|---|---|
| Codex | skill + CLI path in this repo |
| Claude Code | MCP first, skill/commands as thin guidance |
| Cursor | MCP first |
| Windsurf | MCP first |
| OpenClaw | native plugin first |
Typical repository/project directories:
DocHarbor/
├─ src/doc_agent/ # canonical backend package
├─ scripts/ # pipeline scripts and helpers
├─ docs/ # manuals, architecture docs, and static media
├─ docs/media/ # README images and demo assets
├─ tools/windows/ # helper batch scripts
├─ openclaw-plugin/ # native OpenClaw plugin
├─ launcher/ # portable launcher assets
├─ proj/ # generated project artifacts
│ └─ <project>/
│ ├─ manifests/
│ ├─ parsed/
│ ├─ normalized/
│ ├─ index/
│ ├─ source/derived/
│ └─ logs/
├─ doc-agent-setup.bat # compatibility wrapper
└─ one-click-installation.bat # main Windows entrypoint
Preview analysis is opt-in. DocHarbor will not call external vision providers unless DOC_AGENT_PREVIEW_ANALYSIS_PROVIDER is set.
Legacy Office note:
.doc,.ppt, and.xlsmrequire LibreOffice for the pre-conversion stage- set
DOC_AGENT_LIBREOFFICE_BINmanually only if auto-detection fails - confirm with
doc-agent doctor --format json
Translation note:
- supported providers in the current implementation are OpenAI-compatible APIs and Google/Gemini
- leave
DOC_AGENT_TRANSLATE_TARGET_LANGblank if you want per-request--target-lang - structured JSON glossary files enable terminology validation and retry
- inline glossary text is guidance only
Use doctor first:
doc-agent doctor --format jsonCommon checks:
- Python version
requestsPyPDF2openpyxlpython-docxpython-pptxPyMuPDFmarkitdowndotenvezdxfmatplotlib- MinerU API token
- LibreOffice
- translation provider config
- Architecture
- Format Integration Plan
- Translation Architecture Advice
- Universal Agent Architecture
- Operator Manual
- Launcher Notes
This repository intentionally excludes:
- private corpora
- generated project artifacts
- local SDK caches
- staging workspaces
- secrets and tokens
- do not commit
.env - do not commit real project corpora under
proj/ - keep MinerU and model provider credentials in environment variables or local
.env - prefer
doc-agent setup-envortools/windows/doc-agent-env-setup.batover ad hoc environment editing on new machines
Portable Windows release assets are produced under:
dist/release/
Recommended first-download artifact:
DocHarbor-windows-x64-lite-v0.1.0.zip
MIT


