Add custom tokenizer for argparse.zig#3137
Conversation
This could be resolved by passing in the tokenizer. That way the caller can decide how long its memory is supposed to live. |
That may fix the issue when we extract parser and tokenizer outside of Another approach I was considering was "tokenize it yourself" approach - for Parser to pass the full remaining args string to This does not fix lifetime issue, it removes it from our scope of interest, but possible implementation will not run into same issue, since they will be aware they have to own the string and that they have to consider escape sequences according to their own rules. Unfortunately it completely changes current contract of Parser in a very disruptive way, so I'm not sure if the power it brings outweights the possible extra boilerplate and amount of extra refactoring to do. Thankfully both have to be done mostly in custom types and Parser itself, so command execute code would not have to be changed. |
Then I'd suggest we wait for that. particles command can be written without spaces in the zon, I think it wouldn't be too bad to regress this behavior momentarily.
I don't like this. We shouldn't allow mods to do their own thing, we should provide a simple and consistent baseline instead of allowing mods to do their own style just because they like to use json syntax or whatever. If we do want to support zon parsing (and in my opinion there is no reason for it, particles command should be done with flags), then we should hardcode it into our tokenizer instead. |
|
Ok, I will draft this PR and proceed to migrate WE commands. |
This pull request adds a prototype custom tokenizer to
argparse.zigused for parsing argument list string.As of now tokenizer recognizes two types of tokens: identifiers and strings.
Anything that is not a string, classifies as identifier. Identifiers are separated by any amount of any ascii whitespace characters, thus they themselves do not contain whitespace.
Strings are any sequence of characters starting with single quote / double quote / backtick character:
and ending with second of same quote character.
Additionally strings support escape sequences:
Resolved into: single backslash, single quote, double quote, backtick, line feed.
When using one type of quotes to start and end the string, you can freely use other two types without escaping them.
Escape sequences are resolved before returning a token, so strings returned from
next()do not contain escape sequences. This itself is a little bit problematic, because it means that we have to write resolved versions of strings somewhere, preferably without multiple allocations/reallocations. For that purpose I decided to pre-allocate a buffer of capacity equal to string size and use it to store string literals. This approach creates a little bit awkward memory ownership rules, since those tokens are owned by Tokenizer, while identifier tokens belong to caller, since they are substrings of original string and there is no good way to distinguish them. Therefore caller will have to copy tokens if they are supposed to outlive Tokenizer. This is problematic if we assume that parsed arguments have lifetime of args string passed to Parser and we don't copy them when constructingArgsstruct (one of its members).We could avoid immediately resolving escape sequences, but then every use of argument token string has to assume that there is a escape sequence in token string and thus every use would requite:
That could be a lot of boiler plate?
I'm open for counterproposals.
We should not compromise on support for escape sequences, as without them we cannot effectively support arbitrary strings, eg. Zon, either for our use or for use in mods.
Resolves: #3102