Support non-equijoin predicate for EliminateCrossJoin#4866
Support non-equijoin predicate for EliminateCrossJoin#4866ygf11 wants to merge 7 commits intoapache:mainfrom
Conversation
| left_input.schema().clone(), | ||
| right_input.schema().clone(), | ||
| )?; | ||
| let predicate_schemas = |
There was a problem hiding this comment.
we should find column compare pair. like t1.a > t2.a or t1.a > t2.a + 1.
If left or right don't contains column, we shouldn't put them into join_filter.
| is_valid_join_predicate(expr, &predicate_schemas)?; | ||
| if is_join_filter { | ||
| join_filters.push(expr.clone()); | ||
| } |
There was a problem hiding this comment.
@jackwener, this try_fold will filter join filters.
Suppose the possible_join_predicates:
- t1.a > 10
- t1.c = t2.c
- t3.b > t2.b
for t1 join t2, the try_fold will return t1.a > 1 and t1.c = t2.c, because all columns of these exprs are from t1 and t2.
This logic is the same as you comment.
I will improve the code readability after the CI fix.
There was a problem hiding this comment.
I think the try_fold result is not correct, t1.a > 1 should not be treated as join filters(Inner join/Cross join cases).
| " TableScan: lineitem projection=[l_partkey, l_quantity], partial_filters=[lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2)] [l_partkey:Int64, l_quantity:Decimal128(15, 2)]", | ||
| " Filter: (part.p_brand = Utf8(\"Brand#12\") AND part.p_size <= Int32(5) OR part.p_brand = Utf8(\"Brand#23\") AND part.p_size <= Int32(10) OR part.p_brand = Utf8(\"Brand#34\") AND part.p_size <= Int32(15)) AND part.p_size >= Int32(1) [p_partkey:Int64, p_brand:Utf8, p_size:Int32]", | ||
| " TableScan: part projection=[p_partkey, p_brand, p_size], partial_filters=[part.p_size >= Int32(1), part.p_brand = Utf8(\"Brand#12\") AND part.p_size <= Int32(5) OR part.p_brand = Utf8(\"Brand#23\") AND part.p_size <= Int32(10) OR part.p_brand = Utf8(\"Brand#34\") AND part.p_size <= Int32(15)] [p_partkey:Int64, p_brand:Utf8, p_size:Int32]", | ||
| " Inner Join: lineitem.l_partkey = part.p_partkey Filter: part.p_brand = Utf8(\"Brand#12\") AND lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) AND part.p_size <= Int32(5) OR part.p_brand = Utf8(\"Brand#23\") AND lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) AND part.p_size <= Int32(10) OR part.p_brand = Utf8(\"Brand#34\") AND lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2) AND part.p_size <= Int32(15) [l_partkey:Int64, l_quantity:Decimal128(15, 2), p_partkey:Int64, p_brand:Utf8, p_size:Int32]", |
There was a problem hiding this comment.
Like part.p_brand = Utf8(\"Brand#12\").
left or right don't contain column, should keep in filter
There was a problem hiding this comment.
There was a problem hiding this comment.
Sorry, it's different from you comment.
I think both are correct, because the above expression is only related to part(part is one of the input).
But I prefer join filter, because our physical join executor support it, although the above expression will push down.
Do you have any concern? could you explain more?
There was a problem hiding this comment.
The new plan looks not an optimized plan, we should differ join conditions and filter conditions.
| ) => { | ||
| l_op == r_op && ((ll == rl && lr == rr) || (ll == rr && lr == rl)) | ||
| } | ||
| _ => false, |
There was a problem hiding this comment.
Should we allow 'A < B' eq 'B > A' ?
|
Any thoughts on how to make progress on this? |
Thanks for asking @ozankabak . It is blocking by #5022, seems |
|
Hey @ygf11, I saw the blocking PR was merged, do you think you'll get back to it soon? |
|
Since this has been open for more than a year, closing it down. Feel free to reopen if/when you keep working on it. |
Which issue does this PR close?
Closes #4844.
Closes #4877
Rationale for this change
Currently datafusion will loss the
join filterof inner join when runEliminateCrossJoinrule. Following are query and optimized logical plan:Explain Projection: t1.t1_id, t2.t2_id, t3.t3_id Filter: t3.t3_int > t1.t1_int CrossJoin: Filter: t1.t1_int > t2.t2_int CrossJoin: TableScan: t1 projection=[t1_id, t1_int] TableScan: t2 projection=[t2_id, t2_int] TableScan: t3 projection=[t3_id, t3_int]",We can see the
t1.t1_id > t2.t2_id(join filter) is lost.This is because
EliminateCrossJoinonly consider equijoin predicate.This pr will rewrite
EliminateCrossJoin, and choose the right input of join based on both equijoin and non-equijoin predicate. After this pr, the logical plan will be:The join filter does not lost.
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?