consider volatile function in simply_expression#13128
Conversation
| op: Or, | ||
| right, | ||
| }) if has_common_conjunction(&left, &right) => { | ||
| }) if has_common_conjunction(&left, &right)? => { |
There was a problem hiding this comment.
This doesn't seem correct. It is ok to just check the presence of common non-volatile expressions in has_common_conjunction() to trigger simplification, but you need to deal with volatile expressions during simplification too and don't extract them as common.
E.g. can you please add a test like column1 = 2 AND random() = 0 OR column1 = 2 AND random() = 0?
There was a problem hiding this comment.
That's weird and seems incorrect to me. BTW, this is what Spark does:
scala> sql("select * from t where column1 = 2 AND rand() = 0 OR column1 = 2 AND rand() = 0").explain(true);
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ((('column1 = 2) AND ('rand() = 0)) OR (('column1 = 2) AND ('rand() = 0)))
+- 'UnresolvedRelation [t], [], false
== Analyzed Logical Plan ==
column1: int, column2: int
Project [column1#0, column2#1]
+- Filter (((column1#0 = 2) AND (rand(-1409433140814249342) = cast(0 as double))) OR ((column1#0 = 2) AND (rand(2383542950853957003) = cast(0 as double))))
+- SubqueryAlias spark_catalog.default.t
+- HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [column1#0, column2#1], Partition Cols: []]
== Optimized Logical Plan ==
Filter (isnotnull(column1#0) AND ((column1#0 = 2) AND ((rand(-1409433140814249342) = 0.0) OR (rand(2383542950853957003) = 0.0))))
+- HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [column1#0, column2#1], Partition Cols: []]
== Physical Plan ==
*(1) Filter ((isnotnull(column1#0) AND (column1#0 = 2)) AND ((rand(-1409433140814249342) = 0.0) OR (rand(2383542950853957003) = 0.0)))
+- Scan hive spark_catalog.default.t [column1#0, column2#1], HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [column1#0, column2#1], Partition Cols: []]
| fn has_common_conjunction(lhs: &Expr, rhs: &Expr) -> Result<bool, DataFusionError> { | ||
| let lhs: HashSet<&Expr> = iter_conjunction(lhs).collect(); | ||
| iter_conjunction(rhs).any(|e| lhs.contains(&e)) | ||
| iter_conjunction(rhs).try_fold(false, |acc, e| { |
There was a problem hiding this comment.
This still looks a bit ugly. I mean, we just look for the first non-volatile expression and don't want to iterate over all expressions.
Actually, I think I made a mistake when I added is_volatile() with return type Result<bool>:
datafusion/datafusion/expr/src/expr.rs
Lines 1581 to 1583 in e22d231
As the closure we provide to
exists() can't return any error, it would be safe to unwrap the result and return only bool.
@alamb, do you think we can fix Expr::is_volatile(), it would be an API breaking change though, to make it simpler and so we can keep using any() here?
There was a problem hiding this comment.
I have refactored it, Thanks for your review
There was a problem hiding this comment.
Looks good to me, thanks for the fix.
Maybe we can consider changing the return type of Expr::is_volatile() in a follow-up PR.
There was a problem hiding this comment.
Maybe we can consider changing the return type of
Expr::is_volatile()in a follow-up PR.
I think this PR would be much cleaner if we fixed the signature of that first. But I guess the important thing is that we do fix it before it spreads keeps spreading.
There was a problem hiding this comment.
I changed the signature in #13206 which caused a logical conflict in this PR (failed to compile) and then I updated this PR to use the new signature
As @eejbyfeldt predicated, the PR became much cleaner
|
|
||
|
|
||
|
|
||
| query TT |
There was a problem hiding this comment.
Any reason for not adding these as unit tests in the expr_simplifier module instead?
They feel quite output place in file called explain.slt.
| fn has_common_conjunction(lhs: &Expr, rhs: &Expr) -> Result<bool, DataFusionError> { | ||
| let lhs: HashSet<&Expr> = iter_conjunction(lhs).collect(); | ||
| iter_conjunction(rhs).any(|e| lhs.contains(&e)) | ||
| iter_conjunction(rhs).try_fold(false, |acc, e| { |
There was a problem hiding this comment.
Maybe we can consider changing the return type of
Expr::is_volatile()in a follow-up PR.
I think this PR would be much cleaner if we fixed the signature of that first. But I guess the important thing is that we do fix it before it spreads keeps spreading.
alamb
left a comment
There was a problem hiding this comment.
Thanks @Lordworms @findepi and @peter-toth
I agree with @eejbyfeldt in https://github.com/apache/datafusion/pull/13128/files#r1822606751 that we should add some unit tests in datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs in addition to the explain .slt test
I also have a proposal that I think would simplify this PR: #13206
| fn has_common_conjunction(lhs: &Expr, rhs: &Expr) -> Result<bool, DataFusionError> { | ||
| let lhs_set: HashSet<&Expr> = iter_conjunction(lhs).collect(); | ||
| for e in iter_conjunction(rhs) { | ||
| if lhs_set.contains(&e) && !e.is_volatile()? { |
There was a problem hiding this comment.
I looked a little more at Expr::is_volatile and I actually think it can't return any errors. If we changed the signature I think this PR would be significantly simpler
| .eq(lit(2)) | ||
| .or(col("column1").eq(lit(2)).and(rand.clone().eq(lit(0)))); | ||
|
|
||
| assert_eq!(simplify(expr), col("column1").eq(lit(2))); |
There was a problem hiding this comment.
It would seem to me this should still be the same thing -- namely that we shouldn't remove the rand() = 0 condition 🤔
There was a problem hiding this comment.
However, this PR still seems like it is an improvement over what is on main so we could potentially fix this in a follow on PR
There was a problem hiding this comment.
Could be something wrong here, let me double check
There was a problem hiding this comment.
Also Happy Halloween!
This comment was marked as outdated.
This comment was marked as outdated.
|
I merged up from main to resolve a logical conflict with this PR |
|
Thanks again @Lordworms and @eejbyfeldt and @peter-toth and @jonathanc-n |




Which issue does this PR close?
Closes #13060
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?