Monday, November 18, 2013

How to use HyperLogLog to incrementally maintain number of distinct values

In this post I'll show how extremely easy it is to maintain the number of distinct values when using HyperLogLog. Please reference to my previous post for some description how HyperLogLog works.

Let's assume we have a table with some existing data:
SQL> create table existing_data as
  2   select round(dbms_random.value(0, 77777)) n
  3    from dual
  4    connect by level <= 100000;
 
Table created
Precise number of distinct values:
SQL> select count(distinct n) from existing_data;
 
COUNT(DISTINCTN)
----------------
           56192
Now in order for the incremental refresh to work we first need to create HyperLogLog synopsis:
SQL> create table existing_hll as
  2   select mod(ora_hash(n), 1024) bucket,
  3     max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
  4    from existing_data
  5    group by mod(ora_hash(n), 1024);
 
Table created
The table is extremely small as it contains only 1024 rows yet it would be enough to describe data with billions of rows in it. Of course we can now use that synopsis to estimate number of distinct values we have using HyperLogLog:
SQL> select
  2    case
  3     when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
  4     when hll >  1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
  5     else round(hll)
  6   end num_distinct
  7   from (
  8    select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
  9     from (
 10      select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
 11       from existing_hll
 12       )
 13   );
 
NUM_DISTINCT
------------
       57676
Again, the result is within 2% of the precise distinct, which is pretty good considering how little information we had to store about the entire data set.

Now let's say there is some new data you want to add into the existing table:
SQL> create table new_data as
  2   select round(dbms_random.value(5555, 111111)) n
  3    from dual
  4    connect by level <= 10000;
 
Table created
So what we're going to do is calculate HyperLogLog synopsis about this new data and then merge it with HyperLogLog synopsis for the existing data:
SQL> merge into existing_hll e
  2   using (
  3    select mod(ora_hash(n), 1024) bucket,
  4      max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
  5     from new_data
  6     group by mod(ora_hash(n), 1024)
  7   ) n on (e.bucket=n.bucket)
  8   when matched then update set e.val=greatest(e.val, n.val)
  9   when not matched then insert values (n.bucket, n.val);

1024 rows merged.

SQL> commit;

Commit complete.
Of course the above is a lot more efficient than what you would have to do otherwise, i.e. calculate the number of distinct values from scratch for the entire data set. Once the synopsis has been refreshed we can estimate the new number of distinct values we have:
SQL> select
  2    case
  3     when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
  4     when hll >  1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
  5     else round(hll)
  6   end num_distinct
  7   from (
  8    select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
  9     from (
 10      select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
 11       from existing_hll
 12       )
 13   );
 
NUM_DISTINCT
------------
       62288
And it's no different had I computed HyperLogLog for both data sets from scratch:
SQL> select
  2    case
  3     when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
  4     when hll >  1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
  5     else round(hll)
  6   end num_distinct
  7   from (
  8    select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
  9     from (
 10      select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
 11       from (
 12        select mod(ora_hash(n), 1024) bucket,
 13          max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
 14         from (
 15          select n from existing_data
 16          union all
 17          select n from new_data
 18         ) group by mod(ora_hash(n), 1024)
 19       )
 20       )
 21   );
 
NUM_DISTINCT
------------
       62288
And the precise distinct count:
SQL> select count(distinct n) from
  2   (
  3   select * from existing_data
  4   union all
  5   select * from new_data
  6   );
 
COUNT(DISTINCTN)
----------------
           60983
I think the ability to incrementally refresh the data is the most powerful aspect of HyperLogLog.

No comments:

Post a Comment