Skip to content

Commit 0dcbefd

Browse files
authored
Merge pull request #174 from shacuros/migrate_uniqExact
Add case for Numeric values. Redo example to work without external UDF
2 parents c66714d + 51c71e6 commit 0dcbefd

File tree

1 file changed

+63
-29
lines changed

1 file changed

+63
-29
lines changed

content/en/altinity-kb-setup-and-maintenance/uniqExact-to-uniq-combined.md

Lines changed: 63 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -9,39 +9,65 @@ description: >-
99
## uniqExactState
1010

1111
`uniqExactState` is stored in two parts: a count of values in `LEB128` format + list values without a delimiter.
12+
Depending on the orignial datatype of the values to count, the datatype of the list values differ.
1213

13-
In our case, the value is `sipHash128` of strings passed to uniqExact function.
14+
### Numeric Values
15+
16+
In case of numeric values like `UInt8`, `UInt64` etc. the representation of `uniqExactState` is just a simple array of the unique values encountered.
17+
Therefore it is easy to recover the values from the state which have appeared:
18+
19+
```text
20+
┌─hex(uniqExactState(arrayJoin([1, 3])))─┐
21+
│ 020103 │
22+
└────────────────────────────────────────┘
23+
02 01 03
24+
^ ^ ^
25+
LEB128 hex(1::UInt8) hex(3::UInt8)
26+
27+
28+
┌─finalizeAggregation(CAST(unhex('020103'), 'AggregateFunction(groupArray, UInt8)'))─┐
29+
│ [1,3] │
30+
└────────────────────────────────────────────────────────────────────────────────────┘
31+
```
32+
33+
### String Values
34+
35+
#### Internal Representation
36+
In case of values of data type `String`, ClickHouse® applies a hashing algorithm before storing the values into the internal array, otherwise the amount of space needed could get enormous.
1437

1538
```text
1639
┌─hex(uniqExactState(toString(arrayJoin([1]))))─┐
1740
│ 01E2756D8F7A583CA23016E03447724DE7 │
1841
└───────────────────────────────────────────────┘
1942
01 E2756D8F7A583CA23016E03447724DE7
2043
^ ^
21-
LEB128 sipHash128
44+
LEB128 hash of '1'
2245
2346
2447
┌─hex(uniqExactState(toString(arrayJoin([1, 2]))))───────────────────┐
2548
│ 024809CB4528E00621CF626BE9FA14E2BFE2756D8F7A583CA23016E03447724DE7 │
2649
└────────────────────────────────────────────────────────────────────┘
27-
02 4809CB4528E00621CF626BE9FA14E2BF E2756D8F7A583CA23016E03447724DE7
50+
02 4809CB4528E00621CF626BE9FA14E2BF E2756D8F7A583CA23016E03447724DE7
2851
^ ^ ^
29-
LEB128 sipHash128 sipHash128
52+
LEB128 hash of '2' hash of '1'
3053
```
3154

32-
So, our task is to find how we can generate such values by ourself.
33-
In case of `String` data type, it just the simple `sipHash128` function.
55+
So, our task is to find how we can generate such values by ourself, speak what hash function is used.
56+
In case of `String` data type, it is just the simple `sipHash128` function.
3457

3558
```text
3659
┌─hex(sipHash128(toString(2)))─────┬─hex(sipHash128(toString(1)))─────┐
3760
│ 4809CB4528E00621CF626BE9FA14E2BF │ E2756D8F7A583CA23016E03447724DE7 │
3861
└──────────────────────────────────┴──────────────────────────────────┘
3962
```
4063

41-
The second task: it needs to read a state and split it into an array of values.
64+
#### Getting the Hash Values
65+
The second task: now that we know how the state is formed, how can we demangle it and convert it into an `Array` of values.
66+
Unfortunatelly it is not possible to get the original values back, as `sipHash128` is a one way conversion, but at least we can try to get an `Array` of hashes.
4267
Luckily for us, ClickHouse® use the exact same serialization (`LEB128` + list of values) for Arrays (in this case if `uniqExactState` and `Array` are serialized into `RowBinary` format).
4368

44-
We need one a helper -- `UDF` function to do that conversion:
69+
One way to "convert" the `uniqExactState` to an `Array` of hashes would be via an external helper
70+
`UDF` function to do that conversion:
4571

4672
```xml
4773
cat /etc/clickhouse-server/pipe_function.xml
@@ -60,15 +86,32 @@ cat /etc/clickhouse-server/pipe_function.xml
6086
</function>
6187
</clickhouse>
6288
```
63-
This UDF -- `pipe` converts `uniqExactState` to the `Array(FixedString(16))`.
89+
This UDF -- `pipe` converts `uniqExactState` to the `Array(FixedString(16))`:
6490

6591
```text
6692
┌─arrayMap(x -> hex(x), pipe(uniqExactState(toString(arrayJoin([1, 2])))))──────────────┐
6793
│ ['4809CB4528E00621CF626BE9FA14E2BF','E2756D8F7A583CA23016E03447724DE7'] │
6894
└───────────────────────────────────────────────────────────────────────────────────────┘
6995
```
7096

71-
And here is the full example, how you can convert `uniqExactState(string)` to `uniqState(string)` or `uniqCombinedState(string)` using `pipe` UDF and `arrayReduce('func', [..])`.
97+
This way only works if you have direct access to your ClickHouse® installation.
98+
However if you are on a managed platform like Altinity.Cloud installing executable `UDF`s is typically not supported for security reasons.
99+
Luckily we know that the internal representation of `sipHash128` is `FixedString(16)` which has exactly 128 bit. `UInt128` also takes up exactly 128 bit.
100+
Therefore we can consider the `uniqExactState(String)` as a representation of `Array(UInt128)`.
101+
102+
Again, we can therefore convert our state to an `Array`:
103+
104+
```text
105+
┌─arrayMap(lambda(tuple(x), hex(reinterpretAsFixedString(x))), finalizeAggregation(CAST(unhex(hex(uniqExactState(arrayJoin(['1', '2'])))), 'AggregateFunction(groupArray, UInt128)')))─┐
106+
│ ['4809CB4528E00621CF626BE9FA14E2BF','E2756D8F7A583CA23016E03447724DE7'] │
107+
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
108+
```
109+
110+
As you can see the `Array` is identical to the one we created with the `pipe` function.
111+
112+
#### Full Example of Conversion
113+
114+
And here is the full example, how you can convert `uniqExactState(string)` to any approximate `uniq` function like `uniqState(string)` or `uniqCombinedState(string)` by `reinterpret` and `arrayReduce('func', [..])`.
72115

73116
```sql
74117
-- Generate demo with random data, uniqs are stored as heavy uniqExact
@@ -89,25 +132,16 @@ GROUP BY id;
89132

90133
-- Let's add a new columns to store optimized, approximate uniq & uniqCombined
91134
ALTER TABLE aggregates
92-
ADD COLUMN `uniq` AggregateFunction(uniq, FixedString(16))
93-
default arrayReduce('uniqState', pipe(uniqExact)),
94-
ADD COLUMN `uniqCombined` AggregateFunction(uniqCombined, FixedString(16))
95-
default arrayReduce('uniqCombinedState', pipe(uniqExact));
96-
97-
-- Materialize defaults in the new columns
98-
ALTER TABLE aggregates UPDATE uniqCombined = uniqCombined, uniq = uniq
99-
WHERE 1 settings mutations_sync=2;
100-
101-
-- Let's reset defaults to remove the dependancy of the UDF from our table
102-
ALTER TABLE aggregates
103-
modify COLUMN `uniq` remove default,
104-
modify COLUMN `uniqCombined` remove default;
105-
106-
-- Alternatively you can populate data in the new columns directly without using DEFAULT columns
107-
-- ALTER TABLE aggregates UPDATE
108-
-- uniqCombined = arrayReduce('uniqCombinedState', pipe(uniqExact)),
109-
-- uniq = arrayReduce('uniqState', pipe(uniqExact))
110-
-- WHERE 1 settings mutations_sync=2;
135+
ADD COLUMN `uniq` AggregateFunction(uniq, FixedString(16)),
136+
ADD COLUMN `uniqCombined` AggregateFunction(uniqCombined, FixedString(16));
137+
138+
-- Materialize values in the new columns
139+
ALTER TABLE aggregates
140+
UPDATE
141+
uniqCombined = arrayReduce('uniqCombinedState', arrayMap(x -> reinterpretAsFixedString(x), finalizeAggregation(unhex(hex(uniqExact))::AggregateFunction(groupArray, UInt128)))),
142+
uniq = arrayReduce('uniqState', arrayMap(x -> reinterpretAsFixedString(x), finalizeAggregation(unhex(hex(uniqExact))::AggregateFunction(groupArray, UInt128))))
143+
WHERE 1
144+
SETTINGS mutations_sync=2;
111145

112146
-- Check results, results are slighty different, because uniq & uniqCombined are approximate functions
113147
SELECT

0 commit comments

Comments
 (0)