Add custom tokenizer for `argparse.zig` by Argmaster · Pull Request #3137 · PixelGuys/Cubyz

Argmaster · 2026-05-30T14:27:22Z

This pull request adds a prototype custom tokenizer to argparse.zig used for parsing argument list string.

As of now tokenizer recognizes two types of tokens: identifiers and strings.
Anything that is not a string, classifies as identifier. Identifiers are separated by any amount of any ascii whitespace characters, thus they themselves do not contain whitespace.
Strings are any sequence of characters starting with single quote / double quote / backtick character:

'
"
`

and ending with second of same quote character.
Additionally strings support escape sequences:

\\
\'
\"
\`
\n

Resolved into: single backslash, single quote, double quote, backtick, line feed.
When using one type of quotes to start and end the string, you can freely use other two types without escaping them.

Escape sequences are resolved before returning a token, so strings returned from next() do not contain escape sequences. This itself is a little bit problematic, because it means that we have to write resolved versions of strings somewhere, preferably without multiple allocations/reallocations. For that purpose I decided to pre-allocate a buffer of capacity equal to string size and use it to store string literals. This approach creates a little bit awkward memory ownership rules, since those tokens are owned by Tokenizer, while identifier tokens belong to caller, since they are substrings of original string and there is no good way to distinguish them. Therefore caller will have to copy tokens if they are supposed to outlive Tokenizer. This is problematic if we assume that parsed arguments have lifetime of args string passed to Parser and we don't copy them when constructing Args struct (one of its members).

We could avoid immediately resolving escape sequences, but then every use of argument token string has to assume that there is a escape sequence in token string and thus every use would requite:

reiterating string to rediscover escape sequences,
allocating a buffer for the string with resolved escape sequences,
copying characters and resolving escape sequences,
That could be a lot of boiler plate?

I'm open for counterproposals.
We should not compromise on support for escape sequences, as without them we cannot effectively support arbitrary strings, eg. Zon, either for our use or for use in mods.

Resolves: #3102

IntegratedQuantum · 2026-05-30T15:36:56Z

This is problematic if we assume that parsed arguments have lifetime of args string passed to Parser and we don't copy them when constructing Args struct (one of its members).

This could be resolved by passing in the tokenizer. That way the caller can decide how long its memory is supposed to live.

Argmaster · 2026-05-30T15:48:40Z

This could be resolved by passing in the tokenizer. That way the caller can decide how long its memory is supposed to live.

That may fix the issue when we extract parser and tokenizer outside of execute(), until then it will add more boilerplate. That's a good option tho.

Another approach I was considering was "tokenize it yourself" approach - for Parser to pass the full remaining args string to struct.parse() command and expect argument structs to implement finding the end of the token themselves (and then return remaining string without that token). That would allow much more advanced tokenization, where eg. Zon argument struct could have bounds based on { } rather than require being passed as a string. For trivial types we could implement it by find first whitespace. That means there would never be a built in []const u8 support, but rather a separate struct String or alike that would handle it.

This does not fix lifetime issue, it removes it from our scope of interest, but possible implementation will not run into same issue, since they will be aware they have to own the string and that they have to consider escape sequences according to their own rules.

Unfortunately it completely changes current contract of Parser in a very disruptive way, so I'm not sure if the power it brings outweights the possible extra boilerplate and amount of extra refactoring to do. Thankfully both have to be done mostly in custom types and Parser itself, so command execute code would not have to be changed.

IntegratedQuantum · 2026-05-31T09:26:35Z

That may fix the issue when we extract parser and tokenizer outside of execute(), until then it will add more boilerplate.

Then I'd suggest we wait for that. particles command can be written without spaces in the zon, I think it wouldn't be too bad to regress this behavior momentarily.

Another approach I was considering was "tokenize it yourself" approach

I don't like this. We shouldn't allow mods to do their own thing, we should provide a simple and consistent baseline instead of allowing mods to do their own style just because they like to use json syntax or whatever.

If we do want to support zon parsing (and in my opinion there is no reason for it, particles command should be done with flags), then we should hardcode it into our tokenizer instead.

Argmaster · 2026-05-31T13:29:22Z

Ok, I will draft this PR and proceed to migrate WE commands.

Argmaster added 2 commits May 30, 2026 16:08

Add new Tokenizer for argparser

2ef943e

Integrate Tokenizer with Parser

48c1e10

Argmaster marked this pull request as ready for review May 30, 2026 14:53

Argmaster marked this pull request as draft May 31, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add custom tokenizer for `argparse.zig`#3137

Add custom tokenizer for `argparse.zig`#3137
Argmaster wants to merge 2 commits into
PixelGuys:masterfrom
Argmaster:new-args-tokenizer

Argmaster commented May 30, 2026 •

edited

Loading

Uh oh!

IntegratedQuantum commented May 30, 2026

Uh oh!

Argmaster commented May 30, 2026 •

edited

Loading

Uh oh!

IntegratedQuantum commented May 31, 2026

Uh oh!

Argmaster commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Argmaster commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IntegratedQuantum commented May 30, 2026

Uh oh!

Argmaster commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IntegratedQuantum commented May 31, 2026

Uh oh!

Argmaster commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Argmaster commented May 30, 2026 •

edited

Loading

Argmaster commented May 30, 2026 •

edited

Loading