Let's assume we have a table with some existing data:
SQL> create table existing_data as 2 select round(dbms_random.value(0, 77777)) n 3 from dual 4 connect by level <= 100000; Table createdPrecise number of distinct values:
SQL> select count(distinct n) from existing_data;
COUNT(DISTINCTN)
----------------
56192
Now in order for the incremental refresh to work we first need to create HyperLogLog synopsis:
SQL> create table existing_hll as 2 select mod(ora_hash(n), 1024) bucket, 3 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val 4 from existing_data 5 group by mod(ora_hash(n), 1024); Table createdThe table is extremely small as it contains only 1024 rows yet it would be enough to describe data with billions of rows in it. Of course we can now use that synopsis to estimate number of distinct values we have using HyperLogLog:
SQL> select
2 case
3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from existing_hll
12 )
13 );
NUM_DISTINCT
------------
57676
Again, the result is within 2% of the precise distinct, which is pretty good considering how little information we had to store about the entire data set.
Now let's say there is some new data you want to add into the existing table:
SQL> create table new_data as 2 select round(dbms_random.value(5555, 111111)) n 3 from dual 4 connect by level <= 10000; Table createdSo what we're going to do is calculate HyperLogLog synopsis about this new data and then merge it with HyperLogLog synopsis for the existing data:
SQL> merge into existing_hll e 2 using ( 3 select mod(ora_hash(n), 1024) bucket, 4 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val 5 from new_data 6 group by mod(ora_hash(n), 1024) 7 ) n on (e.bucket=n.bucket) 8 when matched then update set e.val=greatest(e.val, n.val) 9 when not matched then insert values (n.bucket, n.val); 1024 rows merged. SQL> commit; Commit complete.Of course the above is a lot more efficient than what you would have to do otherwise, i.e. calculate the number of distinct values from scratch for the entire data set. Once the synopsis has been refreshed we can estimate the new number of distinct values we have:
SQL> select
2 case
3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from existing_hll
12 )
13 );
NUM_DISTINCT
------------
62288
And it's no different had I computed HyperLogLog for both data sets from scratch:
SQL> select
2 case
3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from (
12 select mod(ora_hash(n), 1024) bucket,
13 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
14 from (
15 select n from existing_data
16 union all
17 select n from new_data
18 ) group by mod(ora_hash(n), 1024)
19 )
20 )
21 );
NUM_DISTINCT
------------
62288
And the precise distinct count:
SQL> select count(distinct n) from
2 (
3 select * from existing_data
4 union all
5 select * from new_data
6 );
COUNT(DISTINCTN)
----------------
60983
I think the ability to incrementally refresh the data is the most powerful aspect of HyperLogLog.

No comments:
Post a Comment