Skip to content

feat: simplify regex wildcard pattern#15299

Merged
waynexia merged 6 commits intoapache:mainfrom
waynexia:simplify-regex
Mar 21, 2025
Merged

feat: simplify regex wildcard pattern#15299
waynexia merged 6 commits intoapache:mainfrom
waynexia:simplify-regex

Conversation

@waynexia
Copy link
Member

Which issue does this PR close?

  • Closes #.

Rationale for this change

Simplify dump regex cases like ~ '.*' or !~ '.*'.

What changes are included in this PR?

Handle special wildcard regex pattern in expr_simplifier rule

Are these changes tested?

Yes, via sqllogictests and unit tests

Are there any user-facing changes?

no

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 19, 2025

if let Expr::Literal(ScalarValue::Utf8(Some(pattern))) = right.as_ref() {
// Handle the special case for ".*" pattern
if pattern == ".*" {
Copy link
Contributor

@jayzhan211 jayzhan211 Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this a const similar to COUNT_STAR_EXPANSION

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
right: empty_lit,
})
} else {
// always true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to handle nulls too as null ~ '.*' is not true (it is null)

> create or replace table foo(x varchar) as values (1), (2), (null);
0 row(s) fetched.
Elapsed 0.004 seconds.

> select x ~ '.*' from foo;
+--------------------+
| foo.x ~ Utf8(".*") |
+--------------------+
| true               |
| true               |
| NULL               |
+--------------------+
3 row(s) fetched.
Elapsed 0.016 seconds.

So maybe instead of lit(true) it is x.is_not_null() 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I added two cases about null in simplify_expr.slt, it should work as expected now.

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@@ -910,8 +910,8 @@ SELECT * FROM (SELECT y FROM u1 UNION ALL SELECT y FROM u2) ORDER BY y;
query I
SELECT * FROM v1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this result is not deterministic, we need rowsort for it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 95848ef

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @waynexia

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia waynexia merged commit 4af5cfc into apache:main Mar 21, 2025
27 checks passed
@waynexia
Copy link
Member Author

Thank you for reviewing @jayzhan211 @alamb ❤️

@waynexia waynexia deleted the simplify-regex branch March 21, 2025 22:06
github-merge-queue bot pushed a commit that referenced this pull request Mar 3, 2026
…puts (#20581)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #20580

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

I ran into a bug that prevented some regexp optimizations from working
that were introduced in #15299.
After #16290, some SQL types
were updated to `utf8view`. As part of that PR, some expected query
plans in sqllogictest were updated to expect the unoptimized version.

I need this fixed to avoid additional test failures while implementing a
new regexp optimization for
#20579.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
- Add support for Utf8View and LargeUtf8 in `regex.rs`.
- Properly return `Transformed::no()` on cases when the plan isn't
modified (previously, it was always returning `Transformed::yes()`
- Updates the tests back to expect the optimized query plans

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Fixed existing tests that previously weren't working. Now they reflect
the optimization being reflected properly.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
No. Just applying the optimizations to more cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants