Fix and Enhance Scanning Involving & and - in Regular Expressions in Unicode Sets Mode#62716
Fix and Enhance Scanning Involving & and - in Regular Expressions in Unicode Sets Mode#62716graphemecluster wants to merge 2 commits intomicrosoft:mainfrom
& and - in Regular Expressions in Unicode Sets Mode#62716Conversation
… in Unicode Sets Mode - Removes the incorrect errors on individual `&`s (not `&&`s) at certain positions - Flags `-`s at the beginning and the end of class sets - Improves error recovery and makes diagnostic messages more intuitive and user-friendly when there are consecutive `&`s or `-`s or when operators are mixed within the same class
There was a problem hiding this comment.
Pull Request Overview
This pull request improves the error recovery and diagnostic messages for regular expression Unicode sets mode (/v flag) in TypeScript. The changes focus on handling complex operator combinations (&& for intersection and -- for subtraction) in character classes, providing more user-friendly error messages and better error recovery for malformed regex patterns.
Key changes:
- Enhanced scanner logic to better detect and report errors in Unicode sets mode regular expressions
- Added two new diagnostic messages (error codes 1547 and 1548) for clearer error reporting
- Comprehensive test coverage for various edge cases involving
&,-,&&, and--operators
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
tests/cases/compiler/regularExpressionUnicodeSets.ts |
New comprehensive test file covering 188 regex patterns with various operator combinations |
tests/baselines/reference/regularExpressionUnicodeSets.types |
Baseline file capturing type information for the test cases |
tests/baselines/reference/regularExpressionUnicodeSets.symbols |
Baseline file capturing symbol information for the test cases |
tests/baselines/reference/regularExpressionUnicodeSets.js |
Baseline file showing compiled JavaScript output |
tests/baselines/reference/regularExpressionUnicodeSets.errors.txt |
Baseline file documenting the 226 expected error messages |
tests/baselines/reference/regularExpressionScanning(target=esnext).errors.txt |
Updated baseline to reflect new error messages |
tests/baselines/reference/regularExpressionScanning(target=es5).errors.txt |
Updated baseline to reflect new error messages |
tests/baselines/reference/regularExpressionScanning(target=es2015).errors.txt |
Updated baseline to reflect new error messages |
src/compiler/scanner.ts |
Enhanced regex scanner with improved error recovery for Unicode sets mode |
src/compiler/diagnosticMessages.json |
Added two new error messages (codes 1547 and 1548) |
| // If we see '-', parse the expression as ClassUnion. | ||
| // An exception is made for '-&&', in which case it is more intuitive to parse as ClassIntersection and scan '-' as an invalid operand | ||
| else if ( | ||
| ch === CharacterCodes.minus && ( | ||
| charCodeChecked(pos + 1) !== CharacterCodes.ampersand || charCodeChecked(pos + 2) !== CharacterCodes.ampersand | ||
| ) | ||
| ) { | ||
| // '-' can't be an operand in a ClassUnion (a class in Unicode sets mode can't start with '-') | ||
| error(Diagnostics.Incomplete_character_class_range_Expected_a_class_set_character); | ||
| mayContainStrings = false; | ||
| } |
There was a problem hiding this comment.
I was struggling whether to treat - as an unexpected character ("Unexpected '-'. Did you mean to escape it with backslash?") or an incomplete character class with a zero-width message in cases like /[-a-b-]/v
There was a problem hiding this comment.
And temporarily went with the former.
| "Incomplete character class range. Expected a class set character.": { | ||
| "category": "Error", | ||
| "code": 1547 | ||
| }, | ||
| "Missing '{0}' in between class set operands. To form a union of these characters, wrap them in a nested class instead.": { | ||
| "category": "Error", | ||
| "code": 1548 | ||
| }, |
There was a problem hiding this comment.
I would appreciate any better suggestions regarding the wording of the messages.
There was a problem hiding this comment.
I also wonder if old diagnostic messages are amendable. Precisely, shall I alter the message of TS1519 from "Wrap it" to "Wrap them"?
| if (ch === charCodeChecked(pos + 1) && (ch === CharacterCodes.minus || ch === CharacterCodes.ampersand)) { | ||
| error(Diagnostics.Operators_must_not_be_mixed_within_a_character_class_Wrap_it_in_a_nested_class_instead, pos, 2); | ||
| pos += 2; | ||
| operand = text.slice(start, pos); | ||
| } |
There was a problem hiding this comment.
It would be even more intuitive if we were to rescan the class set as intersection/subtraction when we hit any -- or &&, but it is infeasible without a two-pass scan, the first of which must not emit errors. Unfortunately, I need to give up as it would overcomplicate the implementation.
| operand = scanClassSetOperand(); | ||
| } | ||
| // Use the first operator to determine the expression type | ||
| switch (charCodeChecked(pos)) { |
There was a problem hiding this comment.
it might not have functional difference right now but to future-proof this I'd still include this:
| switch (charCodeChecked(pos)) { | |
| switch (ch = charCodeChecked(pos)) { |
The reason why this should be reassigned here is that the above call to scanClassSetOperand could move the pos cursor
&s (not&&s) at certain positions-(class set range operator) at the end of class sets are not flagged as errors&s or-s, when operators are mixed within the same class, or when there are consecutive operands in class intersections and subtractionsFixes #62707