Skip to content

Fix panic on non-ASCII input by using byte offsets instead of char in…#1

Merged
grainier merged 1 commit intoagents-sh:mainfrom
irshadnilam:non-ascii
Mar 28, 2026
Merged

Fix panic on non-ASCII input by using byte offsets instead of char in…#1
grainier merged 1 commit intoagents-sh:mainfrom
irshadnilam:non-ascii

Conversation

@irshadnilam
Copy link
Copy Markdown
Contributor

Problem

Any LLM response containing multibyte UTF-8 characters (Thai, Chinese,
Japanese, Arabic, emoji, accented Latin) caused a hard panic:
byte index 77 is not a char boundary; it is inside 'ก' (bytes 75..78)

find_balanced_boundaries collected chars with chars() and stored the
loop index i — a character count — directly into the boundaries vec
as a byte offset. For ASCII the two are identical. For any multibyte
codepoint they diverge, so the subsequent &input[start..end] slice
landed inside a codepoint and panicked. The same bug existed in both
HeuristicStrategy (heuristic.rs) and HeuristicExtractor (extractor.rs).

Fix

Replace chars() with char_indices() throughout both implementations.
char_indices() yields (byte_offset, char) pairs, so byte_start is
read directly from the tuple and byte_end is derived from the start
offset of the next character (or input.len() when at the end of the
string). find_matching_close is updated to accept &[(usize, char)]
and destructure accordingly; its return value remains a vec index (not a
byte offset), which is correct and unchanged.

What is not changed

  • Public API is identical
  • Parsing behaviour for ASCII input is identical
  • No performance impact — char_indices() has the same cost as chars()

Tests added

8 new regression tests across both files covering Thai (3-byte), CJK
(3-byte), emoji (4-byte), accented Latin (2-byte), and multibyte
characters appearing in prose before the JSON object (the case where
char index and byte offset diverge for the opening { itself).

@irshadnilam irshadnilam requested a review from grainier March 27, 2026 22:49
@grainier grainier merged commit 19b9adf into agents-sh:main Mar 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants