Let's assume we have a table with some existing data:
SQL> create table existing_data as 2 select round(dbms_random.value(0, 77777)) n 3 from dual 4 connect by level <= 100000; Table createdPrecise number of distinct values:
SQL> select count(distinct n) from existing_data; COUNT(DISTINCTN) ---------------- 56192Now in order for the incremental refresh to work we first need to create HyperLogLog synopsis:
SQL> create table existing_hll as 2 select mod(ora_hash(n), 1024) bucket, 3 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val 4 from existing_data 5 group by mod(ora_hash(n), 1024); Table createdThe table is extremely small as it contains only 1024 rows yet it would be enough to describe data with billions of rows in it. Of course we can now use that synopsis to estimate number of distinct values we have using HyperLogLog:
SQL> select 2 case 3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes)) 4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32))) 5 else round(hll) 6 end num_distinct 7 from ( 8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes 9 from ( 10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes 11 from existing_hll 12 ) 13 ); NUM_DISTINCT ------------ 57676Again, the result is within 2% of the precise distinct, which is pretty good considering how little information we had to store about the entire data set.
Now let's say there is some new data you want to add into the existing table:
SQL> create table new_data as 2 select round(dbms_random.value(5555, 111111)) n 3 from dual 4 connect by level <= 10000; Table createdSo what we're going to do is calculate HyperLogLog synopsis about this new data and then merge it with HyperLogLog synopsis for the existing data:
SQL> merge into existing_hll e 2 using ( 3 select mod(ora_hash(n), 1024) bucket, 4 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val 5 from new_data 6 group by mod(ora_hash(n), 1024) 7 ) n on (e.bucket=n.bucket) 8 when matched then update set e.val=greatest(e.val, n.val) 9 when not matched then insert values (n.bucket, n.val); 1024 rows merged. SQL> commit; Commit complete.Of course the above is a lot more efficient than what you would have to do otherwise, i.e. calculate the number of distinct values from scratch for the entire data set. Once the synopsis has been refreshed we can estimate the new number of distinct values we have:
SQL> select 2 case 3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes)) 4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32))) 5 else round(hll) 6 end num_distinct 7 from ( 8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes 9 from ( 10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes 11 from existing_hll 12 ) 13 ); NUM_DISTINCT ------------ 62288And it's no different had I computed HyperLogLog for both data sets from scratch:
SQL> select 2 case 3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes)) 4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32))) 5 else round(hll) 6 end num_distinct 7 from ( 8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes 9 from ( 10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes 11 from ( 12 select mod(ora_hash(n), 1024) bucket, 13 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val 14 from ( 15 select n from existing_data 16 union all 17 select n from new_data 18 ) group by mod(ora_hash(n), 1024) 19 ) 20 ) 21 ); NUM_DISTINCT ------------ 62288And the precise distinct count:
SQL> select count(distinct n) from 2 ( 3 select * from existing_data 4 union all 5 select * from new_data 6 ); COUNT(DISTINCTN) ---------------- 60983I think the ability to incrementally refresh the data is the most powerful aspect of HyperLogLog.
No comments:
Post a Comment