feat(read): extract .ipynb/.docx/.xlsx to text in read_file#37082
Merged
Conversation
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
invalid-assignment |
1 |
First entries
tests/run_agent/test_credits_notices_toggle.py:76: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to attribute `_credits_session_start_micros` of type `int`
✅ Fixed issues (2):
| Rule | Count |
|---|---|
unresolved-attribute |
2 |
First entries
run_agent.py:2891: [unresolved-attribute] unresolved-attribute: Object of type `Self@get_credits_spent_micros` has no attribute `_credits_session_start_micros`
tests/run_agent/test_credits_notices_toggle.py:76: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_credits_session_start_micros` on type `AIAgent`
Unchanged: 5707 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
tonydwb
approved these changes
Jun 2, 2026
tonydwb
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approved ✅
Review Notes
- PR #37082 —
feat: read_file gains structured-document extraction (.ipynb / .docx / .xlsx)by @teknium1
✅ Looks Good
- Clean pure-stdlib approach: Uses
zipfile+xml.etreefor OOXML andjsonfor notebooks — zero new dependencies. Well adapted from Kilo Code's pattern. - Comprehensive tests: 18 tests covering extraction correctness, hidden-sheet omission, output-payload stripping, malformed-input fallback, and read_file_tool integration. Good boundary coverage.
- Fail-safe design: Malformed documents fall through to the normal read path; errors in extraction are caught gracefully.
- Pagination + redaction pipeline integration: Extracted text flows through existing output handling unchanged.
- Hidden sheets intentionally omitted — prevents data leakage.
- Notebook output payloads/metadata stripped — clean extraction.
💡 Suggestions (Non-Blocking)
- Consider adding a
max_rows/max_colsconfigurable limit for XLSX to prevent memory issues with very large spreadsheets (the current bounded approach is good, but a config knob would make it user-tunable).
Reviewed by Hermes Agent
Add stdlib-only extraction for `.ipynb`, `.docx`, and `.xlsx` in read_file with lazy integration and malformed-document fallback.
ff7f013 to
b7ab168
Compare
Contributor
Author
|
Refreshed against current Cleanup:
Validation:
|
AIalliAI
pushed a commit
to AIalliAI/Hermes
that referenced
this pull request
Jun 14, 2026
Add stdlib-only extraction for `.ipynb`, `.docx`, and `.xlsx` in read_file with lazy integration and malformed-document fallback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
read_filecan now read Jupyter notebooks, Word documents, and Excel workbooks directly — they're auto-extracted to plain text instead of being rejected as binary (.docx/.xlsx) or dumping raw JSON with output payloads (.ipynb).Ports the structured-document reading Kilo Code added in Kilo-Org/kilocode#10733 (notebooks), #10737 (DOCX), and #10740 (XLSX), adapted to hermes-agent's Python architecture.
How it was adapted
Kilo bundles the
mammothJS library for DOCX (~526 KB into their compiled binary). hermes-agent instead uses a pure-stdlib approach —.docxand.xlsxare Zip+OOXML containers thatzipfile+xml.etreecan unpack, and.ipynbis JSON. No new dependency is added.Extracted text flows through the existing read pipeline (pagination,
LINE|CONTENTnumbering, char-limit guard, secret redaction), so the output is indistinguishable in shape from reading a source file. Malformed documents fall through to the normal read path so they stay inspectable — matching Kilo's malformed-notebook fallback behavior.Changes
tools/read_extract.py(new): stdlib extractors for.ipynb/.docx/.xlsxbehind anextract_document_text()router. Hidden Excel sheets omitted; notebook output payloads/metadata stripped; bounded rows/cols.tools/file_tools.py: intercept extractable documents inread_file_toolbefore the binary guard;_read_extracted_document()applies pagination + line-numbering + char-limit + redaction. Updated tool description.tests/tools/test_read_extract.py(new): 18 tests covering extraction correctness, hidden-sheet omission, output-payload stripping, malformed-input fallback, andread_file_toolintegration (pagination, line-numbering, binary-guard fallthrough).Validation
read_file foo.ipynbread_file foo.docxCannot read binary fileread_file foo.xlsxCannot read binary file.docx18 passedtest_file_read_guards,test_file_tools,test_read_loop_detection):100 passed— no regression.Source PRs: Kilo-Org/kilocode#10733, #10737, #10740