Generate GroupByHash output in multiple RecordBatches#11758
Generate GroupByHash output in multiple RecordBatches#11758JasonLi-cn wants to merge 2 commits intoapache:mainfrom
Conversation
|
Thank you @JasonLi-cn I wonder if we have tested the performance of this branch? I worry that the incremental output generation will result in a copying the values multiple times (as each If this turns out to be a large performance overhead, then I think we could look into updating the accumulators to remember where they have emitted to or something (or maybe add a |
Thank you @alamb . I'll run the benchmark of aggregate. |
f0daecf to
c6d640b
Compare
Benchmark(main VS this branch)The performance of This test may be incomplete, do you @alamb have any better test suggestions? 🤔 |
Hi @JasonLi-cn -- yes I think we should run the ClickBench and TPCH benchmarks using the script here https://github.com/apache/datafusion/tree/main/benchmarks I am doing so now and will report results here |
|
I hit a bug #11833 that has been fixed on main when trying to run the benchmarks on this branch: Thus I a going to merge up from main and try again |
…pByHash_output_in_multiple_rb
|
🤔 It appears on this branch the kernel killed the process due to out of memory (which does not happen on main) |
Thank you @alamb 🙏. Let me analyze it further 🤔 |
In order to actually generate the output in multiple batches and gain performance, I think we would need to change:
This would likely require some sort of API change to the accumulators / etc I wonder if we could find some way to do the implementation incrementally |
I agree, finally it should be a big change which switches the group values and related states managed by block like duckdb , and I am working on this(#11931). But maybe just splitting the emit result still have benefits? Seems that it can avoid calling the |
I think personally suggest sketching out what this would look like in a first PR without worrying about getting all the tests passing / compiling etc. If we try to port all code at once to being managed in blocks it is going to be a very large change I am thinking maybe we can have a incremental approach (like for example separately adding the ability to do blocked emission for https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html and If we can set this up so that we get the pattern and a few common implementations setup in the first PR then we can make subsequent PRs to port over the other parts of the aggregation 🤔 |
Yes, I want to do the similar things for
Yes... I try to do something like it but still not thorough enough, and I found it is actually hard to support the exact I am planning to switch to the special blocked emission impl now...
What do you think this way to supprot the special blocked emission? |
Do we need to put |
|
@JasonLi-cn As I think, Maybe we should impl the special block based
I am making a try about it in #11943 , and have done some related code changes. |
OK. How do we determine the value of block size? |
I think maybe we make it equal to |
Yes I think this is a good strategy |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
|
I believe the plan here is that we will work to improve the coverage of aggregates and then revisit / revive this design |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #9562
If the community thinks this PR is reasonable, I will continue the work:
In addition, we need to discuss whether we need to emit by
batch_sizewhen spill istrue.During the cargo test, I found that if emitting by
batch_sizewhen spill is true, some test cases such asaggregate_source_not_yielding_with_spillcould not pass. Because the number ofRecordBatcheshas increased, resulting in 'BatchBuilder' call 'push_batch' consumes more memory inupdate_merged_stream.datafusion/datafusion/physical-plan/src/sorts/builder.rs
Lines 71 to 80 in 0f554fa
Personally, I prefer to emit by batch_size when spill is true. Otherwise, the panic (apache/arrow-rs#6112 (comment)) is easily triggered by spill.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?