feat(read): extract .ipynb/.docx/.xlsx to text in read_file by teknium1 · Pull Request #37082 · NousResearch/hermes-agent

teknium1 · 2026-06-02T00:07:32Z

Summary

read_file can now read Jupyter notebooks, Word documents, and Excel workbooks directly — they're auto-extracted to plain text instead of being rejected as binary (.docx/.xlsx) or dumping raw JSON with output payloads (.ipynb).

Ports the structured-document reading Kilo Code added in Kilo-Org/kilocode#10733 (notebooks), #10737 (DOCX), and #10740 (XLSX), adapted to hermes-agent's Python architecture.

How it was adapted

Kilo bundles the mammoth JS library for DOCX (~526 KB into their compiled binary). hermes-agent instead uses a pure-stdlib approach — .docx and .xlsx are Zip+OOXML containers that zipfile + xml.etree can unpack, and .ipynb is JSON. No new dependency is added.

Extracted text flows through the existing read pipeline (pagination, LINE|CONTENT numbering, char-limit guard, secret redaction), so the output is indistinguishable in shape from reading a source file. Malformed documents fall through to the normal read path so they stay inspectable — matching Kilo's malformed-notebook fallback behavior.

Changes

tools/read_extract.py (new): stdlib extractors for .ipynb/.docx/.xlsx behind an extract_document_text() router. Hidden Excel sheets omitted; notebook output payloads/metadata stripped; bounded rows/cols.
tools/file_tools.py: intercept extractable documents in read_file_tool before the binary guard; _read_extracted_document() applies pagination + line-numbering + char-limit + redaction. Updated tool description.
tests/tools/test_read_extract.py (new): 18 tests covering extraction correctness, hidden-sheet omission, output-payload stripping, malformed-input fallback, and read_file_tool integration (pagination, line-numbering, binary-guard fallthrough).

Validation

	Before	After
`read_file foo.ipynb`	Raw JSON incl. base64 outputs + metadata	Markdown + code cells in order, outputs stripped
`read_file foo.docx`	`Cannot read binary file`	Paragraph text (tabs/breaks preserved)
`read_file foo.xlsx`	`Cannot read binary file`	Visible sheets as labelled TSV; hidden sheets omitted
Malformed `.docx`	n/a	Falls through to binary guard (no crash)

New tests: 18 passed
Existing read tests (test_file_read_guards, test_file_tools, test_read_loop_detection): 100 passed — no regression.

Source PRs: Kilo-Org/kilocode#10733, #10737, #10740

github-actions · 2026-06-02T00:08:21Z

🔎 Lint report: `kilocode-port/read-document-extraction` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10876 on HEAD, 10878 on base (✅ -2)

🆕 New issues (1):

Rule	Count
`invalid-assignment`	1

First entries

tests/run_agent/test_credits_notices_toggle.py:76: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to attribute `_credits_session_start_micros` of type `int`

✅ Fixed issues (2):

Rule	Count
`unresolved-attribute`	2

First entries

run_agent.py:2891: [unresolved-attribute] unresolved-attribute: Object of type `Self@get_credits_spent_micros` has no attribute `_credits_session_start_micros`
tests/run_agent/test_credits_notices_toggle.py:76: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_credits_session_start_micros` on type `AIAgent`

Unchanged: 5707 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

tonydwb

Code Review Summary

Verdict: Approved ✅

Review Notes

PR #37082 — feat: read_file gains structured-document extraction (.ipynb / .docx / .xlsx) by @teknium1

✅ Looks Good

Clean pure-stdlib approach: Uses zipfile + xml.etree for OOXML and json for notebooks — zero new dependencies. Well adapted from Kilo Code's pattern.
Comprehensive tests: 18 tests covering extraction correctness, hidden-sheet omission, output-payload stripping, malformed-input fallback, and read_file_tool integration. Good boundary coverage.
Fail-safe design: Malformed documents fall through to the normal read path; errors in extraction are caught gracefully.
Pagination + redaction pipeline integration: Extracted text flows through existing output handling unchanged.
Hidden sheets intentionally omitted — prevents data leakage.
Notebook output payloads/metadata stripped — clean extraction.

💡 Suggestions (Non-Blocking)

Consider adding a max_rows / max_cols configurable limit for XLSX to prevent memory issues with very large spreadsheets (the current bounded approach is good, but a config knob would make it user-tunable).

Reviewed by Hermes Agent

Add stdlib-only extraction for `.ipynb`, `.docx`, and `.xlsx` in read_file with lazy integration and malformed-document fallback.

teknium1 · 2026-06-13T13:53:03Z

Refreshed against current origin/main and force-pushed a smaller salvage commit.

Cleanup:

reduced tools/read_extract.py from about 409 LOC to about 248 LOC
reduced tools/file_tools.py integration to about 48 insertions
kept extraction stdlib-only with lazy read_file integration
malformed documents fall back to existing read/binary guards

Validation:

python3 -m pytest tests/tools/test_read_extract.py -q -> 18 passed
python3 -m pytest tests/tools/test_file_read_guards.py tests/tools/test_file_tools.py tests/tools/test_read_loop_detection.py tests/tools/test_read_extract.py -q -> 118 passed
python3 -m compileall -q tools/file_tools.py tools/read_extract.py tests/tools/test_read_extract.py -> passed
python3 -m ruff check tools/file_tools.py tools/read_extract.py tests/tools/test_read_extract.py -> passed
git diff --check -> passed

Add stdlib-only extraction for `.ipynb`, `.docx`, and `.xlsx` in read_file with lazy integration and malformed-document fallback.

tonydwb approved these changes Jun 2, 2026

View reviewed changes

alt-glitch added type/feature New feature or request comp/tools Tool registry, model_tools, toolsets tool/file File tools (read, write, patch, search) P3 Low — cosmetic, nice to have labels Jun 2, 2026

feat(read): extract notebook and office documents

b7ab168

Add stdlib-only extraction for `.ipynb`, `.docx`, and `.xlsx` in read_file with lazy integration and malformed-document fallback.

teknium1 force-pushed the kilocode-port/read-document-extraction branch from ff7f013 to b7ab168 Compare June 13, 2026 13:52

teknium1 merged commit 817f392 into main Jun 13, 2026
28 checks passed

teknium1 deleted the kilocode-port/read-document-extraction branch June 13, 2026 21:42

Haderach-Ram mentioned this pull request Jun 14, 2026

Ecosystem Digest — 2026-06-14 Haderach-Ram/openclaw-radar#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(read): extract .ipynb/.docx/.xlsx to text in read_file#37082

feat(read): extract .ipynb/.docx/.xlsx to text in read_file#37082
teknium1 merged 1 commit into
mainfrom
kilocode-port/read-document-extraction

teknium1 commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

tonydwb left a comment

Uh oh!

teknium1 commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teknium1 commented Jun 2, 2026

Summary

How it was adapted

Changes

Validation

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: kilocode-port/read-document-extraction vs origin/main

ruff

ty (type checker)

Uh oh!

tonydwb left a comment

Choose a reason for hiding this comment

Code Review Summary

Review Notes

✅ Looks Good

💡 Suggestions (Non-Blocking)

Uh oh!

teknium1 commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 2, 2026 •

edited

Loading

🔎 Lint report: `kilocode-port/read-document-extraction` vs `origin/main`