Skip to content

feat(read): extract .ipynb/.docx/.xlsx to text in read_file#37082

Merged
teknium1 merged 1 commit into
mainfrom
kilocode-port/read-document-extraction
Jun 13, 2026
Merged

feat(read): extract .ipynb/.docx/.xlsx to text in read_file#37082
teknium1 merged 1 commit into
mainfrom
kilocode-port/read-document-extraction

Conversation

@teknium1

@teknium1 teknium1 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

read_file can now read Jupyter notebooks, Word documents, and Excel workbooks directly — they're auto-extracted to plain text instead of being rejected as binary (.docx/.xlsx) or dumping raw JSON with output payloads (.ipynb).

Ports the structured-document reading Kilo Code added in Kilo-Org/kilocode#10733 (notebooks), #10737 (DOCX), and #10740 (XLSX), adapted to hermes-agent's Python architecture.

How it was adapted

Kilo bundles the mammoth JS library for DOCX (~526 KB into their compiled binary). hermes-agent instead uses a pure-stdlib approach — .docx and .xlsx are Zip+OOXML containers that zipfile + xml.etree can unpack, and .ipynb is JSON. No new dependency is added.

Extracted text flows through the existing read pipeline (pagination, LINE|CONTENT numbering, char-limit guard, secret redaction), so the output is indistinguishable in shape from reading a source file. Malformed documents fall through to the normal read path so they stay inspectable — matching Kilo's malformed-notebook fallback behavior.

Changes

  • tools/read_extract.py (new): stdlib extractors for .ipynb/.docx/.xlsx behind an extract_document_text() router. Hidden Excel sheets omitted; notebook output payloads/metadata stripped; bounded rows/cols.
  • tools/file_tools.py: intercept extractable documents in read_file_tool before the binary guard; _read_extracted_document() applies pagination + line-numbering + char-limit + redaction. Updated tool description.
  • tests/tools/test_read_extract.py (new): 18 tests covering extraction correctness, hidden-sheet omission, output-payload stripping, malformed-input fallback, and read_file_tool integration (pagination, line-numbering, binary-guard fallthrough).

Validation

Before After
read_file foo.ipynb Raw JSON incl. base64 outputs + metadata Markdown + code cells in order, outputs stripped
read_file foo.docx Cannot read binary file Paragraph text (tabs/breaks preserved)
read_file foo.xlsx Cannot read binary file Visible sheets as labelled TSV; hidden sheets omitted
Malformed .docx n/a Falls through to binary guard (no crash)
  • New tests: 18 passed
  • Existing read tests (test_file_read_guards, test_file_tools, test_read_loop_detection): 100 passed — no regression.

Source PRs: Kilo-Org/kilocode#10733, #10737, #10740

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: kilocode-port/read-document-extraction vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10876 on HEAD, 10878 on base (✅ -2)

🆕 New issues (1):

Rule Count
invalid-assignment 1
First entries
tests/run_agent/test_credits_notices_toggle.py:76: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to attribute `_credits_session_start_micros` of type `int`

✅ Fixed issues (2):

Rule Count
unresolved-attribute 2
First entries
run_agent.py:2891: [unresolved-attribute] unresolved-attribute: Object of type `Self@get_credits_spent_micros` has no attribute `_credits_session_start_micros`
tests/run_agent/test_credits_notices_toggle.py:76: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_credits_session_start_micros` on type `AIAgent`

Unchanged: 5707 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Review Notes

  • PR #37082feat: read_file gains structured-document extraction (.ipynb / .docx / .xlsx) by @teknium1

✅ Looks Good

  • Clean pure-stdlib approach: Uses zipfile + xml.etree for OOXML and json for notebooks — zero new dependencies. Well adapted from Kilo Code's pattern.
  • Comprehensive tests: 18 tests covering extraction correctness, hidden-sheet omission, output-payload stripping, malformed-input fallback, and read_file_tool integration. Good boundary coverage.
  • Fail-safe design: Malformed documents fall through to the normal read path; errors in extraction are caught gracefully.
  • Pagination + redaction pipeline integration: Extracted text flows through existing output handling unchanged.
  • Hidden sheets intentionally omitted — prevents data leakage.
  • Notebook output payloads/metadata stripped — clean extraction.

💡 Suggestions (Non-Blocking)

  • Consider adding a max_rows / max_cols configurable limit for XLSX to prevent memory issues with very large spreadsheets (the current bounded approach is good, but a config knob would make it user-tunable).

Reviewed by Hermes Agent

@alt-glitch alt-glitch added type/feature New feature or request comp/tools Tool registry, model_tools, toolsets tool/file File tools (read, write, patch, search) P3 Low — cosmetic, nice to have labels Jun 2, 2026
Add stdlib-only extraction for `.ipynb`, `.docx`, and `.xlsx` in read_file with lazy integration and malformed-document fallback.
@teknium1 teknium1 force-pushed the kilocode-port/read-document-extraction branch from ff7f013 to b7ab168 Compare June 13, 2026 13:52
@teknium1

Copy link
Copy Markdown
Contributor Author

Refreshed against current origin/main and force-pushed a smaller salvage commit.

Cleanup:

  • reduced tools/read_extract.py from about 409 LOC to about 248 LOC
  • reduced tools/file_tools.py integration to about 48 insertions
  • kept extraction stdlib-only with lazy read_file integration
  • malformed documents fall back to existing read/binary guards

Validation:

  • python3 -m pytest tests/tools/test_read_extract.py -q -> 18 passed
  • python3 -m pytest tests/tools/test_file_read_guards.py tests/tools/test_file_tools.py tests/tools/test_read_loop_detection.py tests/tools/test_read_extract.py -q -> 118 passed
  • python3 -m compileall -q tools/file_tools.py tools/read_extract.py tests/tools/test_read_extract.py -> passed
  • python3 -m ruff check tools/file_tools.py tools/read_extract.py tests/tools/test_read_extract.py -> passed
  • git diff --check -> passed

@teknium1 teknium1 merged commit 817f392 into main Jun 13, 2026
28 checks passed
@teknium1 teknium1 deleted the kilocode-port/read-document-extraction branch June 13, 2026 21:42
AIalliAI pushed a commit to AIalliAI/Hermes that referenced this pull request Jun 14, 2026
Add stdlib-only extraction for `.ipynb`, `.docx`, and `.xlsx` in read_file with lazy integration and malformed-document fallback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have tool/file File tools (read, write, patch, search) type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants