You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks
This PR reduces the size of `GroupByScalar` from 32 bytes to 16 bytes by using `Box<String>`. This will reduce the size of a `Vec<GroupByScalar>` and thus the key of hashmaps used for aggregates / joins.
Also, it changes the type of the key to `Box<[GroupByScalar]>` to reduce memory usage further by 8 bytes per key needed to hold the capacity of the vec.
Finally we can remove a `Box` around the `Vec` holding the indices.
Difference in speed seems to be minimal, at least in current state.
I think in the future, it could be nice to see if the data could be packed efficiently in one `Box<[T]>` (where T is a primitive value) when having no dynamically sized types by using the schema instead of creating "dynamic" values. That should also make the hashing faster. Currently, when grouping on multiple i32 values, we need 32 bytes per value (next to 24 bytes for the Vec holding the values) instead of just 4! Also using const generics https://rust-lang.github.io/rfcs/2000-const-generics.html#:~:text=Rust%20currently%20has%20one%20type,implement%20traits%20for%20all%20arrays could provide a further improvement (by not having to store the length of the slice).
This PR also tries to improve reproducability in the benchmarks a bit by using the seed in the random number generator (still a quite noisy on my machine though).
Closes#8765 from Dandandan/reduce_key_size
Lead-authored-by: Heres, Daniel <danielheres@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
0 commit comments