Skip to content

perf: Optimize array_agg() using GroupsAccumulator#20504

Merged
martin-g merged 6 commits intoapache:mainfrom
neilconway:neilc/optimize-array-agg
Feb 28, 2026
Merged

perf: Optimize array_agg() using GroupsAccumulator#20504
martin-g merged 6 commits intoapache:mainfrom
neilconway:neilc/optimize-array-agg

Conversation

@neilconway
Copy link
Contributor

@neilconway neilconway commented Feb 23, 2026

Which issue does this PR close?

Rationale for this change

This PR optimizes the performance of array_agg() by adding support for the GroupsAccumulator API.

The design tries to minimize the amount of per-batch work done in update_batch(): we store a reference to the batch, and a (group_idx, row_idx) pair for each row. In evaluate(), we assemble all the requested output with a single interleave call.

This turns out to be significantly faster, because we copy much less data and assembling the results can be vectorized more effectively. For example, on a benchmark with 5000 groups and 5000 int64 values per group, this approach is roughly 190x faster than the previous approach.

Releasing memory after a partial emit is a little more involved than the previous approach, but with some determination it is still possible.

What changes are included in this PR?

Are these changes tested?

Yes, and benchmarked.

Are there any user-facing changes?

No.

AI usage

Iterated with the help of multiple AI tools; I've reviewed and understand the resulting code.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 23, 2026
@neilconway
Copy link
Contributor Author

Benchmarks:

group                                   array-agg-opt                          array-agg-vanilla
-----                                   -------------                          -----------------
array_agg_query_group_by_few_groups     1.04    632.3±5.29µs        ? ?/sec    1.00    609.2±5.90µs        ? ?/sec
array_agg_query_group_by_many_groups    1.00      2.2±0.03ms        ? ?/sec    12.15    26.6±0.39ms        ? ?/sec
array_agg_query_group_by_mid_groups     1.00  1104.0±12.85µs        ? ?/sec    3.32      3.7±0.02ms        ? ?/sec

d 2.444444444444 0.541519476308
e 3 0.505440263521

# FIXME: add bool_and(v3) column when issue fixed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR -- this test was redundant with another query in the same file. Happy to split into another PR if preferred.

@neilconway neilconway changed the title perf: Optimize array_agg() perf: Optimize array_agg() using GroupsAccumulator Feb 23, 2026
opt_filter: Option<&BooleanArray>,
total_num_groups: usize,
) -> Result<()> {
assert_eq!(values.len(), 1, "single argument to update_batch");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_eq!(values.len(), 1, "single argument to update_batch");
assert_eq_or_internal_err!(values.len(), 1, "single argument to update_batch");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using assert_eq! because that is what a bunch of the other aggregate functions use, e.g.,:

I don't have a strong preference but we should probably be consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see what others think.
My opinion is that a library should try to avoid panicking as hard as possible.
But this could be improved in a follow-up if others agree!

_opt_filter: Option<&BooleanArray>,
total_num_groups: usize,
) -> Result<()> {
assert_eq!(values.len(), 1, "one argument to merge_batch");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_eq!(values.len(), 1, "one argument to merge_batch");
assert_eq_or_internal_err!(values.len(), 1, "one argument to merge_batch");

values: &[ArrayRef],
opt_filter: Option<&BooleanArray>,
) -> Result<Vec<ArrayRef>> {
assert_eq!(values.len(), 1, "one argument to convert_to_state");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_eq!(values.len(), 1, "one argument to convert_to_state");
assert_eq_or_internal_err!(values.len(), 1, "one argument to convert_to_state");

@alamb
Copy link
Contributor

alamb commented Feb 24, 2026

Also related

cc @duongcongtoai perhaps you have some time to help review this PR?

!args.is_distinct && args.order_bys.is_empty()
}

fn create_groups_accumulator(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was trying to compare this implementation with myown and found 2x difference (mine is faster 😆 ), i was trying to add some profile to check if there is room for improvement but most of the runtime goes to the optimizer parts instead of the real micro function, maybe we should refactor the benchmark/add micro benchmark there

Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was trying to compare this implementation with #17915 and found 2x difference (mine is faster 😆 )

Interesting -- which workload did you use for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was refering to array_agg_query_group_by_many_groups

@neilconway
Copy link
Contributor Author

Comparing this implementation with #17915 on the end-to-end array_agg benchmarks that were added in d303f58, here's the results I get:

  ┌─────────────┬─────────────────┬───────────────────┬───────────────────────────┐
  │  Benchmark  │ Baseline (main) │ PR #20504 (neilc) │ PR #17915 (duongcongtoai) │
  ├─────────────┼─────────────────┼───────────────────┼───────────────────────────┤
  │ few_groups  │ 600.8 µs        │ 620.1 µs          │ 867.2 µs                  │
  ├─────────────┼─────────────────┼───────────────────┼───────────────────────────┤
  │ mid_groups  │ 3.660 ms        │ 1.094 ms          │ 1.046 ms                  │
  ├─────────────┼─────────────────┼───────────────────┼───────────────────────────┤
  │ many_groups │ 26.05 ms        │ 2.248 ms          │ 1.144 ms                  │
  └─────────────┴─────────────────┴───────────────────┴───────────────────────────┘

  Change vs. Baseline

  ┌─────────────┬───────────────────────────┬───────────────────────────────┐
  │  Benchmark  │         PR #20504         │           PR #17915           │
  ├─────────────┼───────────────────────────┼───────────────────────────────┤
  │ few_groups  │ +3.0% (slight regression) │ +44% (significant regression) │
  ├─────────────┼───────────────────────────┼───────────────────────────────┤
  │ mid_groups  │ −70% (3.3x faster)        │ −71% (3.5x faster)            │
  ├─────────────┼───────────────────────────┼───────────────────────────────┤
  │ many_groups │ −91% (11.6x faster)       │ −96% (22.8x faster)           │
  └─────────────┴───────────────────────────┴───────────────────────────────┘

@neilconway neilconway force-pushed the neilc/optimize-array-agg branch from 9bc1f4e to c5238ac Compare February 28, 2026 02:58
@github-actions github-actions bot added the core Core DataFusion crate label Feb 28, 2026
@neilconway
Copy link
Contributor Author

neilconway commented Feb 28, 2026

I think my initial approach of organizing the aggregate state by group wasn't ideal -- when there are many groups, this leads to a lot of small allocations and a more random memory access pattern. I pushed a new version of this PR that uses a per-batch organization instead: that is, for each batch we keep a reference to the batch contents, plus a Vec of (group_idx, row_idx) pairs, one for each row. There are many fewer batches than there are groups (at least in the many-groups case), so this can be a significant win.

Updated benchmark numbers:

  ┌─────────────────────────┬─────────┬────────────────┬──────────────────────┐
  │        Benchmark        │  main   │ feature branch │        Change        │
  ├─────────────────────────┼─────────┼────────────────┼──────────────────────┤
  │ few_groups              │ 607 µs  │ 679 µs         │ +11.5% (regression)  │
  ├─────────────────────────┼─────────┼────────────────┼──────────────────────┤
  │ mid_groups              │ 3.61 ms │ 789 µs         │ -78.1% (4.6x faster) │
  ├─────────────────────────┼─────────┼────────────────┼──────────────────────┤
  │ many_groups             │ 26.0 ms │ 1.04 ms        │ -96.0% (25x faster)  │
  ├─────────────────────────┼─────────┼────────────────┼──────────────────────┤
  │ struct_mid_groups (new) │ 9.10 ms │ 996 µs         │ -89.1% (9.1x faster) │
  └─────────────────────────┴─────────┴────────────────┴──────────────────────┘

So this approach is >2x faster than my initial approach for the many-groups case, and somewhat faster for the mid-groups case as well. It is slightly slower for the few-groups case, but intuitively I'd guess that is tolerable: the regression is relatively modest compared to the speedups for other cases. But LMK if folks disagree about that.

Copy link
Contributor

@duongcongtoai duongcongtoai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, nice work!

@neilconway
Copy link
Contributor Author

Running the repro from #17446, main vs. this PR:

$ hyperfine '~/df/review/target/release/datafusion-cli -f report.sql'
Benchmark 1: ~/df/review/target/release/datafusion-cli -f report.sql
  Time (mean ± σ):     778.5 ms ±  18.5 ms    [User: 5937.0 ms, System: 184.9 ms]
  Range (min … max):   752.9 ms … 803.6 ms    10 runs

$ hyperfine '~/df/main/target/release/datafusion-cli -f report.sql'
Benchmark 1: ~/df/main/target/release/datafusion-cli -f report.sql
  Time (mean ± σ):     132.9 ms ±   9.0 ms    [User: 273.2 ms, System: 57.5 ms]
  Range (min … max):   125.0 ms … 153.0 ms    23 runs

(Directory names are a bit confusing: the review worktree is actually on main, and the main worktree is on this PR branch.)

@martin-g martin-g added this pull request to the merge queue Feb 28, 2026
Merged via the queue into apache:main with commit 3a23bb2 Feb 28, 2026
31 checks passed
@martin-g
Copy link
Member

Thank you @neilconway, @duongcongtoai and @alamb !

@neilconway neilconway deleted the neilc/optimize-array-agg branch February 28, 2026 13:33
@alamb
Copy link
Contributor

alamb commented Mar 1, 2026

This is so great -- thank you all

@alamb alamb added the performance Make DataFusion faster label Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

array_agg is very slow Slow aggregrate query with array_agg, Polars is 4 times faster for equal query

4 participants