Fix panic on non-ASCII input by using byte offsets instead of char in… by irshadnilam · Pull Request #1 · agents-sh/tryparse

irshadnilam · 2026-03-27T22:49:50Z

Problem

Any LLM response containing multibyte UTF-8 characters (Thai, Chinese,
Japanese, Arabic, emoji, accented Latin) caused a hard panic:
byte index 77 is not a char boundary; it is inside 'ก' (bytes 75..78)

find_balanced_boundaries collected chars with chars() and stored the
loop index i — a character count — directly into the boundaries vec
as a byte offset. For ASCII the two are identical. For any multibyte
codepoint they diverge, so the subsequent &input[start..end] slice
landed inside a codepoint and panicked. The same bug existed in both
HeuristicStrategy (heuristic.rs) and HeuristicExtractor (extractor.rs).

Fix

Replace chars() with char_indices() throughout both implementations.
char_indices() yields (byte_offset, char) pairs, so byte_start is
read directly from the tuple and byte_end is derived from the start
offset of the next character (or input.len() when at the end of the
string). find_matching_close is updated to accept &[(usize, char)]
and destructure accordingly; its return value remains a vec index (not a
byte offset), which is correct and unchanged.

What is not changed

Public API is identical
Parsing behaviour for ASCII input is identical
No performance impact — char_indices() has the same cost as chars()

Tests added

8 new regression tests across both files covering Thai (3-byte), CJK
(3-byte), emoji (4-byte), accented Latin (2-byte), and multibyte
characters appearing in prose before the JSON object (the case where
char index and byte offset diverge for the opening { itself).

…dices in heuristic JSON boundary detection

Fix panic on non-ASCII input by using byte offsets instead of char in…

543410a

…dices in heuristic JSON boundary detection

irshadnilam requested a review from grainier March 27, 2026 22:49

grainier approved these changes Mar 28, 2026

View reviewed changes

grainier merged commit 19b9adf into agents-sh:main Mar 28, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix panic on non-ASCII input by using byte offsets instead of char in…#1

Fix panic on non-ASCII input by using byte offsets instead of char in…#1
grainier merged 1 commit intoagents-sh:mainfrom
irshadnilam:non-ascii

irshadnilam commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

irshadnilam commented Mar 27, 2026

Problem

Fix

What is not changed

Tests added

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants