The UDF pragma tells the compiler that the PL/SQL unit is a user defined function that is used primarily in SQL statements, which might improve its performance.I though it would be very cool to try it out with my HyperLogLog post I did recently and see if it results in any measurable performance improvement.
Test Table
I'll use the same test table as I did in my original post:
SQL> create table z_hll_test as
2 select dbms_random.string('x', 4)||rpad('x', 500, 'x') n
3 from dual
4 connect by level <= 1000000;
Table created
SQL> alter table z_hll_test cache;
Table altered
Note that I'm explicitly setting the table to cache in order to make in-memory PQ kick in and eliminate disk I/O as a factor from my test case. For each test I made sure that no physical I/O has happened (including temp I/O for native distinct test).
Regular Function
create or replace function num_zeroes(
p_n binary_integer
) return binary_integer deterministic is
l_t binary_integer;
l_b binary_integer;
begin
--assume 32-bit hash value, 10-bits for bucket
if (p_n = 0) then return 22; end if;
l_t := 1;
l_b := 0;
while ( bitand(p_n,l_t) = 0 )
loop
l_t := l_t*2;
l_b := l_b+1;
end loop;
return l_b;
end num_zeroes;
SQL> select
2 case
3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from (
12 select /*+ parallel(z 4) */ mod(ora_hash(n), 1024) bucket,
13 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
14 from z_hll_test z
15 group by mod(ora_hash(n), 1024)
16 )
17 )
18 );
NUM_DISTINCT
------------
748175
Executed in 0.889 seconds
Pragma UDF Function
create or replace function num_zeroes(
p_n binary_integer
) return binary_integer deterministic is
pragma udf;
l_t binary_integer;
l_b binary_integer;
begin
--assume 32-bit hash value, 10-bits for bucket
if (p_n = 0) then return 22; end if;
l_t := 1;
l_b := 0;
while ( bitand(p_n,l_t) = 0 )
loop
l_t := l_t*2;
l_b := l_b+1;
end loop;
return l_b;
end num_zeroes;
SQL> select
2 case
3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from (
12 select /*+ parallel(z 4) */ mod(ora_hash(n), 1024) bucket,
13 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
14 from z_hll_test z
15 group by mod(ora_hash(n), 1024)
16 )
17 )
18 );
NUM_DISTINCT
------------
748175
Executed in 0.593 seconds
Pragma UDF gives us ~33% performance boost which is not too bad considering we didn't have to do anything else. However, that's not the most interesting part -- let's take a look at the native distinct next.
Native Distinct
SQL> select /*+ parallel(z 4) */ count(distinct n) from z_hll_test z;
COUNT(DISTINCTN)
----------------
753204
Executed in 0.983 seconds
Note that this was an optimal execution!
Summary
Let's summarize results in a table:
| Regular Function | Pragma UDF Function | Native Distinct |
| 0.889 | 0.593 | 0.983 |
As you can see pragma udf actually beats native implementation by a considerable margin which is very impressive given the fact that distict had an optimal execution.
Any query which produces such a report won't care about how much parallel you're running because virtually all the time is spent by the query coordinator (which is a serial process) being busy.
The total execution time has dropped from almost 26 minutes down to 28 seconds. Moreover, the workload is now much more skewed towards parallel query slaves, which is exactly what we want to see. Of course, the trick only works if the group by is able to collapse the data sufficiently enough.
