perf: Use batched row conversion for array_has_any, array_has_all#20588
perf: Use batched row conversion for array_has_any, array_has_all#20588neilconway wants to merge 5 commits intoapache:mainfrom
array_has_any, array_has_all#20588Conversation
|
Benchmarks: It's a significant win for short arrays, and a small win for large arrays. For large arrays, the N*M comparison cost probably dominates. We should probably be able to do something smarter by hashing, I'll look at that shortly but in a separate PR. |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
This PR doesn't touch the |
I've been wondering that myself. That was run on my personal machine, and since benchmarks are largely single threaded I wouldn't expect much variability in the results .. but the numbers do seem off. I'll try another run later, otherwise I can see about running them on my ec2 instance I have. |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
The above was run on my ec2 dev instance with nothing else running. It's as stable a benchmark as I can realistically get. |
…has-any-all-rowconvert
Interesting. The regressions are mostly in the 500 element benchmarks. I didn't see similar behavior on my local dev box (M4 Max). I'll do a run on a cloud dev box and see if I can repro the results. |
|
Here's what I get on a Hetzner cloud box (cax31): So we do indeed see some regressions for large arrays. I'm not entirely sure why that would be ... I suppose for 10k rows * 500 elements we end up pushing a lot more stuff out of L1/L2, whereas the previous approach uses a smaller working set. I'm surprised that the effect is that pronounced, though. Let me try doing the row conversion in smaller batches and see if that helps. |
|
That is interesting. I ran the benchmark on a m7i.4xlarge which is an Intel Sapphire Rapids machine. I'm curious why we're seeing it on cloud machines but not on your m4 mac. L2 cache is shared iirc on m4 max which might be a reason for the difference. I guess running benchmarks on cloud hardware is worth it considering that most implementations of DF will run on server grade hardware, not consumer - even though the consumer may be faster in some cases |
|
Alright, I implemented a variant where we do row conversion in chunks of 256 rows. Here are the results on the Hertzner box: Happily, this seems to address the regressions we saw on large arrays with the initial approach. Less happily, 256-row chunking performs slightly less well than full-batch row conversion on my M4 Max machine, although interestingly the regressions are only for the i64 benchmarks: The string benchmarks were much closer and basically in the noise. Avoiding the regressions on large arrays seems worth the small performance hit on M4 machines, but it's probably worth exploring a bigger chunk size and seeing if that helps at all. |
|
Here are the results on the Hetzner machine with 512 row chunks: I'm inclined to go with 512 row chunking: it seems that this reduces cache pressure sufficiently, while doing half as many row-conversion calls as 256 row chunking. I've updated the PR with that approach. |
Which issue does this PR close?
array_has_any,array_has_all#20587 .Rationale for this change
array_has_anyandarray_has_allcalledRowConverter::convert_columnstwice for every input row.convert_columnshas a lot of per-call overhead: allocating a newRowsbuffer, doing various schema checking, and so on.It is considerably more efficient to use
RowConvertertwice up front and convert all of the haystack and needle inputs in bulk. We can then implement thehas_any/has_allpredicate comparison by indexing into the converted rows.array_has_any/array_has_allhad a special-case for strings, but it had an analogous problem: it iterated over rows, materialized each row's inner list, and then calledstring_array_to_vectwice per row. That does a lot of per-row work; it is significantly faster to callstring_array_to_vecon all input rows at once, and then index into the results to implement the per-row comparisons.What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
No.