Skip to content

Add custom tokenizer for argparse.zig#3137

Draft
Argmaster wants to merge 2 commits into
PixelGuys:masterfrom
Argmaster:new-args-tokenizer
Draft

Add custom tokenizer for argparse.zig#3137
Argmaster wants to merge 2 commits into
PixelGuys:masterfrom
Argmaster:new-args-tokenizer

Conversation

@Argmaster

@Argmaster Argmaster commented May 30, 2026

Copy link
Copy Markdown
Collaborator

This pull request adds a prototype custom tokenizer to argparse.zig used for parsing argument list string.

As of now tokenizer recognizes two types of tokens: identifiers and strings.
Anything that is not a string, classifies as identifier. Identifiers are separated by any amount of any ascii whitespace characters, thus they themselves do not contain whitespace.
Strings are any sequence of characters starting with single quote / double quote / backtick character:

'
"
`

and ending with second of same quote character.
Additionally strings support escape sequences:

\\
\'
\"
\`
\n

Resolved into: single backslash, single quote, double quote, backtick, line feed.
When using one type of quotes to start and end the string, you can freely use other two types without escaping them.

Escape sequences are resolved before returning a token, so strings returned from next() do not contain escape sequences. This itself is a little bit problematic, because it means that we have to write resolved versions of strings somewhere, preferably without multiple allocations/reallocations. For that purpose I decided to pre-allocate a buffer of capacity equal to string size and use it to store string literals. This approach creates a little bit awkward memory ownership rules, since those tokens are owned by Tokenizer, while identifier tokens belong to caller, since they are substrings of original string and there is no good way to distinguish them. Therefore caller will have to copy tokens if they are supposed to outlive Tokenizer. This is problematic if we assume that parsed arguments have lifetime of args string passed to Parser and we don't copy them when constructing Args struct (one of its members).

We could avoid immediately resolving escape sequences, but then every use of argument token string has to assume that there is a escape sequence in token string and thus every use would requite:

  • reiterating string to rediscover escape sequences,
  • allocating a buffer for the string with resolved escape sequences,
  • copying characters and resolving escape sequences,
    That could be a lot of boiler plate?

I'm open for counterproposals.
We should not compromise on support for escape sequences, as without them we cannot effectively support arbitrary strings, eg. Zon, either for our use or for use in mods.

Resolves: #3102

@Argmaster Argmaster marked this pull request as ready for review May 30, 2026 14:53
@IntegratedQuantum

Copy link
Copy Markdown
Member

This is problematic if we assume that parsed arguments have lifetime of args string passed to Parser and we don't copy them when constructing Args struct (one of its members).

This could be resolved by passing in the tokenizer. That way the caller can decide how long its memory is supposed to live.

@Argmaster

Argmaster commented May 30, 2026

Copy link
Copy Markdown
Collaborator Author

This could be resolved by passing in the tokenizer. That way the caller can decide how long its memory is supposed to live.

That may fix the issue when we extract parser and tokenizer outside of execute(), until then it will add more boilerplate. That's a good option tho.

Another approach I was considering was "tokenize it yourself" approach - for Parser to pass the full remaining args string to struct.parse() command and expect argument structs to implement finding the end of the token themselves (and then return remaining string without that token). That would allow much more advanced tokenization, where eg. Zon argument struct could have bounds based on { } rather than require being passed as a string. For trivial types we could implement it by find first whitespace. That means there would never be a built in []const u8 support, but rather a separate struct String or alike that would handle it.

This does not fix lifetime issue, it removes it from our scope of interest, but possible implementation will not run into same issue, since they will be aware they have to own the string and that they have to consider escape sequences according to their own rules.

Unfortunately it completely changes current contract of Parser in a very disruptive way, so I'm not sure if the power it brings outweights the possible extra boilerplate and amount of extra refactoring to do. Thankfully both have to be done mostly in custom types and Parser itself, so command execute code would not have to be changed.

@IntegratedQuantum

Copy link
Copy Markdown
Member

That may fix the issue when we extract parser and tokenizer outside of execute(), until then it will add more boilerplate.

Then I'd suggest we wait for that. particles command can be written without spaces in the zon, I think it wouldn't be too bad to regress this behavior momentarily.

Another approach I was considering was "tokenize it yourself" approach

I don't like this. We shouldn't allow mods to do their own thing, we should provide a simple and consistent baseline instead of allowing mods to do their own style just because they like to use json syntax or whatever.

If we do want to support zon parsing (and in my opinion there is no reason for it, particles command should be done with flags), then we should hardcode it into our tokenizer instead.

@Argmaster

Copy link
Copy Markdown
Collaborator Author

Ok, I will draft this PR and proceed to migrate WE commands.

@Argmaster Argmaster marked this pull request as draft May 31, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add custom splitter to argparse Parser

2 participants