select case wait_rank when 1 then inst_id end "Inst Num",
case wait_rank when 1 then snap_id end "Snap Id",
case wait_rank when 1 then begin_snap end "Begin Snap",
case wait_rank when 1 then end_snap end "End Snap",
event_name "Event",
total_waits "Waits",
time_waited "Time(s)",
round((time_waited/total_waits)*1000) "Avg wait(ms)",
round((time_waited/db_time)*100, 2) "% DB time",
substr(wait_class, 1, 15) "Wait Class"
from (
select
inst_id,
snap_id, to_char(begin_snap, 'DD-MM-YY hh24:mi:ss') begin_snap,
to_char(end_snap, 'hh24:mi:ss') end_snap,
event_name,
wait_class,
total_waits,
time_waited,
dense_rank() over (partition by inst_id, snap_id order by time_waited desc)-1 wait_rank,
max(time_waited) over (partition by inst_id, snap_id) db_time
from (
select
s.instance_number inst_id,
s.snap_id,
s.begin_interval_time begin_snap,
s.end_interval_time end_snap,
event_name,
wait_class,
total_waits-lag(total_waits, 1, total_waits) over
(partition by s.startup_time, s.instance_number, stats.event_name order by s.snap_id) total_waits,
time_waited-lag(time_waited, 1, time_waited) over
(partition by s.startup_time, s.instance_number, stats.event_name order by s.snap_id) time_waited,
min(s.snap_id) over (partition by s.startup_time, s.instance_number, stats.event_name) min_snap_id
from (
select dbid, instance_number, snap_id, event_name, wait_class, total_waits_fg total_waits, round(time_waited_micro_fg/1000000, 2) time_waited
from dba_hist_system_event
where wait_class not in ('Idle', 'System I/O')
union all
select dbid, instance_number, snap_id, stat_name event_name, null wait_class, null total_waits, round(value/1000000, 2) time_waited
from dba_hist_sys_time_model
where stat_name in ('DB CPU', 'DB time')
) stats, dba_hist_snapshot s
where stats.instance_number=s.instance_number
and stats.snap_id=s.snap_id
and stats.dbid=s.dbid
and s.dbid=3870213301
and s.instance_number=1
and stats.snap_id between 190 and 195
) where snap_id > min_snap_id and nvl(total_waits,1) > 0
) where event_name!='DB time' and wait_rank <= 5
order by inst_id, snap_id;
Inst Snap Begin Snap End Snap Event Waits Time(s) Avg wait(ms) % DB time Wait Class
---- ----- ----------------- -------- --------------------------- ---------- ---------- ------------ ---------- ---------------
1 191 20-02-14 14:10:10 14:20:10 cell smart table scan 631829 9569.43 15 79.03 User I/O
DB CPU 1202.09 9.93
direct path read temp 66074 1006.82 15 8.32 User I/O
PX Deq: Slave Session Stats 11730 429.91 37 3.55 Other
latch: shared pool 28134 162.47 6 1.34 Concurrency
1 192 20-02-14 14:20:10 14:30:11 cell smart table scan 1391832 4620.11 3 67.39 User I/O
DB CPU 1017.78 14.85
direct path read temp 76329 977.95 13 14.26 User I/O
PX Deq: Slave Session Stats 25043 401.53 16 5.86 Other
latch free 38836 214.1 6 3.12 Other
1 193 20-02-14 14:30:11 14:40:14 cell smart table scan 2448539 11075.29 5 79.3 User I/O
DB CPU 1529.93 10.95
PX Deq: Slave Session Stats 44242 1520.01 34 10.88 Other
direct path read temp 77583 985.65 13 7.06 User I/O
latch free 67518 376.52 6 2.7 Other
1 194 20-02-14 14:40:14 14:50:15 direct path read temp 99224 857.3 9 71.63 User I/O
DB CPU 328.78 27.47
name-service call wait 91 5.4 59 0.45 Other
PX Deq: Slave Session Stats 83 0.17 2 0.01 Other
direct path write 194 0.12 1 0.01 User I/O
1 195 20-02-14 14:50:15 15:00:18 DB CPU 1188.84 98.15
log switch/archive 1 10.01 10010 0.83 Other
direct path read temp 775 3.96 5 0.33 User I/O
cell smart table scan 1393 1.1 1 0.09 User I/O
cell single block physical 148 0.9 6 0.07 User I/O
read
25 rows selected
Thursday, February 20, 2014
AWR Top 5 Timed Foreground Events
I've noticed that people post how to get AWR Top 5 Timed Foreground Events other a range of snapshots using a SQL query from time to time. Since this is something I've done for years here is the version of the SQL I use in case somebody finds it useful:
Friday, February 14, 2014
'active txn count during cleanout', part II
In part I I've shown some interesting side effects that happen when you're trying to select from a table block which have an outstanding active transaction in it. In this post we're going to make things a little bit more interesting by introducing indexes into the picture.
Test Setup
I'll create a table with two rows and an index:
I'll update one row in my first session and leave the transaction open:
Since I only have two rows in the table I will end up with a special case where my index root block will be able to hold all the data essentially playing a role of both the root block and a leaf block at the same time. This makes it easier for me to dump the relevant index block because I know there is only one index block to dump:
Session 2
I'll update another row in the second session leaving the transaction open as well:
Here is how index block dump looks right now:
What happened is just another variation of the theme we saw in part I -- when our second session updates the index block it notices an active transaction in the ITL list and tries to perform a cleanout. It will do the same for the table block as well but since I've shown all the relevant mechanics in the previous post I'll leave it at that.
Undo Segment Header Checks
The important consequence from all the above is that a session which tries to perform a cleanout will have to look into the other transaction(-s) undo segment header block in order to find out whether the other transaction has committed or not:
Test Setup
I'll create a table with two rows and an index:
SQL> create table test as 2 select level n 3 from dual 4 connect by level <= 2; Table created SQL> create index i_test on test (n); Index createdSession 1
I'll update one row in my first session and leave the transaction open:
SQL> update test set n=3 where n=1; 1 row updatedHere is xid for this transaction:
SQL> select '0x'||to_char(XIDUSN, 'fm000x')||'.'||
2 to_char(XIDSLOT, 'fm00x')||'.'||
3 to_char(XIDSQN, 'fm0000000x') xid
4 from v$transaction
5 where addr=(
6 select taddr
7 from v$session
8 where sid=sys_context('userenv','sid')
9 );
XID
----------------------
0x0004.01c.00001fd5
Index Block Dump 1
Since I only have two rows in the table I will end up with a special case where my index root block will be able to hold all the data essentially playing a role of both the root block and a leaf block at the same time. This makes it easier for me to dump the relevant index block because I know there is only one index block to dump:
Block header dump: 0x0100008b
Object id on Block? Y
seg/obj: 0x12f46 csc: 0x00.1efcb3c itc: 2 flg: E typ: 2 - INDEX
brn: 0 bdba: 0x1000088 ver: 0x01 opc: 0
inc: 0 exflg: 0
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000
0x02 0x0004.01c.00001fd5 0x00c011ac.09a5.0b ---- 2 fsc 0x000e.00000000
We have two rows locked in the index block because the row with value=1 got deleted and a row with value=3 got inserted, as per our update. Let's notice block cleanout scn (csc) value: 0x00.1efcb3c
Session 2
I'll update another row in the second session leaving the transaction open as well:
SQL> update test set n=4 where n=2; 1 row updated XID ---------------------- 0x0003.01f.00001eabIndex Block Dump 2
Here is how index block dump looks right now:
Block header dump: 0x0100008b
Object id on Block? Y
seg/obj: 0x12f46 csc: 0x00.1efcd0c itc: 3 flg: E typ: 2 - INDEX
brn: 0 bdba: 0x1000088 ver: 0x01 opc: 0
inc: 0 exflg: 0
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000
0x02 0x0004.01c.00001fd5 0x00c011ac.09a5.0b ---- 2 fsc 0x000e.00000000
0x03 0x0003.01f.00001eab 0x00c00e73.0982.31 ---- 2 fsc 0x000e.00000000
Notice that csc value has changed from 0x00.1efcb3c to 0x00.1efcd0c
What happened is just another variation of the theme we saw in part I -- when our second session updates the index block it notices an active transaction in the ITL list and tries to perform a cleanout. It will do the same for the table block as well but since I've shown all the relevant mechanics in the previous post I'll leave it at that.
Undo Segment Header Checks
The important consequence from all the above is that a session which tries to perform a cleanout will have to look into the other transaction(-s) undo segment header block in order to find out whether the other transaction has committed or not:
SQL> select
2 trn.xidusn,
3 rbs.file_id,
4 rbs.block_id header_block,
5 trn.ubablk undo_block,
6 '0x'||to_char(trn.XIDUSN, 'fm000x')||'.'||
7 to_char(trn.XIDSLOT, 'fm00x')||'.'||
8 to_char(trn.XIDSQN, 'fm0000000x') xid
9 from v$transaction trn, dba_rollback_segs rbs
10 where trn.XIDUSN=rbs.segment_id
11 order by 1;
XIDUSN FILE_ID HEADER_BLOCK UNDO_BLOCK XID
---------- ---------- ------------ ---------- ----------------------
3 3 160 3699 0x0003.01f.00001eab
4 3 176 4524 0x0004.01c.00001fd5
Our first session xid was 0x0004.01c.00001fd5 so when our second session performed the update it had to look into block 176 (undo header block) to check the other transaction status and block 4524 (undo block) in order to rollback the other session changes for write consistency checks:
WAIT #140055864053216: nam='db file sequential read' ela= 341 file#=3 block#=176 blocks=1 obj#=0 tim=1392400391392719 WAIT #140055864053216: nam='db file sequential read' ela= 675 file#=3 block#=4524 blocks=1 obj#=0 tim=1392400391393679I'll continue setting the up stage for a perfect disaster with delayed block cleanout and parallel DML in the upcoming series.
Tuesday, January 28, 2014
'active txn count during cleanout', part I
I was going to write a blog post about some peculiar side effects you can get into with the delayed block cleanout when running parallel DML but soon discovered that the entry became so big that I've decided to split it up into a series of more manageable posts.
For a good background on various themes of block cleanout check out Clean it up by Jonathan Lewis.
Active transactions, consistent reads and table scans
First I'm going to show you some interesting observations about what happens when you're trying to select from the block which has an outstanding open transaction in it. Let's create a test table with one row and update it leaving the transaction open. I'm using 11.2.0.4 here:
Before:
For a good background on various themes of block cleanout check out Clean it up by Jonathan Lewis.
Active transactions, consistent reads and table scans
First I'm going to show you some interesting observations about what happens when you're trying to select from the block which has an outstanding open transaction in it. Let's create a test table with one row and update it leaving the transaction open. I'm using 11.2.0.4 here:
SQL> create table test (n number, m number);
Table created
SQL> insert into test values (1, 1);
1 row inserted
SQL> commit;
Commit complete
SQL> select dbms_rowid.rowid_relative_fno(rowid) file#,
2 dbms_rowid.rowid_block_number(rowid) block#
3 from test;
FILE# BLOCK#
---------- ----------
4 134
SQL> update test set n=n;
1 row updated
If we were to look into the buffer headers view (x$bh) we would find the following:
FILE# DBABLK STATE CR_SCN_BAS CR_SCN_WRP
---------- ---------- ----- ---------- ----------
4 134 xcur 0 0
4 134 cur 595913 0
Now lets select from this table in a different session while checking couple stats at the same time:
NAME VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications 0
active txn count during cleanout 0
SQL> select * from test;
N M
---------- ----------
1 1
NAME VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications 1
active txn count during cleanout 1
SQL> select * from test;
N M
---------- ----------
1 1
NAME VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications 2
active txn count during cleanout 2
SQL> select * from test;
N M
---------- ----------
1 1
NAME VALUE
---------------------------------------------------------------- ----------
immediate (CR) block cleanout applications 3
active txn count during cleanout 3
The first thing worth mentioning is that both immediate (CR) block cleanout applications and active txn count during cleanout statistics increment every time we execute the select. Immediate (CR) block cleanout applications indicates that the session is performing delayed block cleanout while doing the consistent read (CR). Active txn count during cleanout indicates how many currently active transactions the cleanout process encountered in each block. From one perspective this makes sense -- when our select reads the block and discovers that there are open transactions it may not know whether these transactions are currently active or the block indeed requires a cleanout. However, something interesting happened when we look into x$bh again:
FILE# DBABLK STATE CR_SCN_BAS CR_SCN_WRP
---------- ---------- ----- ---------- ----------
4 134 xcur 0 0
4 134 cur 595913 0
4 134 cur 595922 0
4 134 cur 595926 0
4 134 cur 595940 0
Clearly each subsequent select generated a new consistent read version of the block at a different SCN. Indeed, if we were to dump the table's block before and after the last select here is what we would find:
Before:
buffer tsn: 4 rdba: 0x01000086 (4/134)
scn: 0x0000.000917d6 seq: 0x01 flg: 0x04 tail: 0x17d60601
frmt: 0x02 chkval: 0xf347 type: 0x06=trans data
...
Block header dump: 0x01000086
Object id on Block? Y
seg/obj: 0x113da csc: 0x00.917d6 itc: 2 flg: E typ: 1 - DATA
brn: 0 bdba: 0x1000080 ver: 0x01 opc: 0
inc: 0 exflg: 0
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0002.01f.00000242 0x00c0c19c.0035.19 C--- 0 scn 0x0000.00090e7e
0x02 0x0008.00b.00000241 0x00c00989.0039.38 ---- 1 fsc 0x0000.00000000
After:
buffer tsn: 4 rdba: 0x01000086 (4/134)
scn: 0x0000.000917e4 seq: 0x01 flg: 0x04 tail: 0x17e40601
frmt: 0x02 chkval: 0xf375 type: 0x06=trans data
...
Block header dump: 0x01000086
Object id on Block? Y
seg/obj: 0x113da csc: 0x00.917e4 itc: 2 flg: E typ: 1 - DATA
brn: 0 bdba: 0x1000080 ver: 0x01 opc: 0
inc: 0 exflg: 0
Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0002.01f.00000242 0x00c0c19c.0035.19 C--- 0 scn 0x0000.00090e7e
0x02 0x0008.00b.00000241 0x00c00989.0039.38 ---- 1 fsc 0x0000.00000000
Notice that cleanout scn (cscn) is different and the block got updated with the same scn as well. As you probably have guessed by now, each select generates a block cleanout redo record too:
REDO RECORD - Thread:1 RBA: 0x00000c.00035faf.0010 LEN: 0x006c VLD: 0x05 SCN: 0x0000.000917e4 SUBSCN: 1 01/28/2014 00:16:44 (LWN RBA: 0x00000c.00035faf.0010 LEN: 0001 NST: 0001 SCN: 0x0000.000917e4) CHANGE #1 TYP:0 CLS:1 AFN:4 DBA:0x01000086 OBJ:70618 SCN:0x0000.000917d6 SEQ:1 OP:4.1 ENC:0 RBL:0 Block cleanout record, scn: 0x0000.000917e4 ver: 0x01 opt: 0x01, entries follow...From another perspective why even bother doing any of this? It almost looks like the cleanout code is moving along in tiding up the block, updates scn plus cscn and then discovers that there is really nothing to cleanup because the other transaction is still active. But it did changes to the current version of the block by now which then results in proliferation of CR copies from the selects and increased redo generation from the (essentially) shell block cleanout records. There is going to be more interesting details worth mentioning in the next post.
Saturday, January 25, 2014
crsd.bin core dumps
Core dump issues sometimes can be notoriously difficult to troubleshoot. I've got a call this morning from one of my customers saying that after a power outage Grid Infrastructure is not able to fully come up on some nodes on their Exadata cluster. After further examining the situation it turned out that crsd.bin binary is simply core dumping upon start up.
Troubleshooting Grid Infrastructure startup issues when nothing is core dumping sometimes could be a chore so what could be more fun when it's not able to fully start due to a major daemon core dumping?
One of the useful things to do when a binary core dumps is to get a stack trace to see which function raised the exception (you can examine the core file the gdb, for example, in order to do that). Let's see what the stack trace holds for us:
Armed with that knowledge you can now sift through the Grind Infrastructure logs in a much more effective way because these logs are notoriously big and "chatty" (I think my worst nightmare is when the database alert log will become like GI alert log thereby making it much less useful). And there we have it:
Troubleshooting Grid Infrastructure startup issues when nothing is core dumping sometimes could be a chore so what could be more fun when it's not able to fully start due to a major daemon core dumping?
One of the useful things to do when a binary core dumps is to get a stack trace to see which function raised the exception (you can examine the core file the gdb, for example, in order to do that). Let's see what the stack trace holds for us:
Core was generated by `/u01/app/11.2.0.3/grid/bin/crsd.bin reboot'.
Program terminated with signal 6, Aborted.
#0 0x0000003ea3e30285 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003ea3e30285 in raise () from /lib64/libc.so.6
#1 0x0000003ea3e31d30 in abort () from /lib64/libc.so.6
#2 0x0000003ea56bed94 in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib64/libstdc++.so.6
#3 0x0000003ea56bce46 in ?? () from /usr/lib64/libstdc++.so.6
#4 0x0000003ea56bce73 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5 0x0000003ea56bcef9 in __cxa_rethrow () from /usr/lib64/libstdc++.so.6
#6 0x0000000000df8672 in Acl::Acl (this=0x4556d440, domain=..., resource=...,
aclString=..., useOcr=true, $U7=,
$U8=, $U9=,
$V0=, $V1=) at acl.cpp:120
#6 0x0000000000df8672 in Acl::Acl (this=0x4556d440, domain=..., resource=...,
aclString=..., useOcr=true, $U7=,
$U8=, $U9=,
$V0=, $V1=) at acl.cpp:120
#7 0x0000000000df879c in Acl::_ZN3CAA3AclC1ERKSsS2_S2_b (this=0x4556d440,
$U7=, $U8=,
$U9=, $V0=,
$V1=)
#8 0x0000000000a4d81e in SrvResource::initUserId (this=0x7f15803d7550,
$1=) at clsAgfwSrvResource.cpp:204
We can see that the source of the exception is in the Acl::Acl which is then propagated through the standard libraries. Moreover, function SrvResource::initUserId appears in the stack trace as well, which makes you wonder whether there is some issue with some of the resource's Access Control List, in particular with it's user id setting.
Armed with that knowledge you can now sift through the Grind Infrastructure logs in a much more effective way because these logs are notoriously big and "chatty" (I think my worst nightmare is when the database alert log will become like GI alert log thereby making it much less useful). And there we have it:
Exception: ACL entry creation failed for: owner:ggate:rwxTurned out the nodes which were core dumping were recently added to the cluster and the user ggate, which is the owner of the GoldenGate resource, simply did not exist on these nodes. Apparently that was enough to cause crsd.bin core dumps. Yikes!
Tuesday, December 03, 2013
Oracle 12cR1, UDF Pragma and HyperLogLog
One interesting enhancement in 12cR1 PL/SQL is UDF pragma which has the following description:
Test Table
I'll use the same test table as I did in my original post:
Regular Function
Native Distinct
Note that this was an optimal execution!
Summary
Let's summarize results in a table:
As you can see pragma udf actually beats native implementation by a considerable margin which is very impressive given the fact that distict had an optimal execution.
The UDF pragma tells the compiler that the PL/SQL unit is a user defined function that is used primarily in SQL statements, which might improve its performance.I though it would be very cool to try it out with my HyperLogLog post I did recently and see if it results in any measurable performance improvement.
Test Table
I'll use the same test table as I did in my original post:
SQL> create table z_hll_test as
2 select dbms_random.string('x', 4)||rpad('x', 500, 'x') n
3 from dual
4 connect by level <= 1000000;
Table created
SQL> alter table z_hll_test cache;
Table altered
Note that I'm explicitly setting the table to cache in order to make in-memory PQ kick in and eliminate disk I/O as a factor from my test case. For each test I made sure that no physical I/O has happened (including temp I/O for native distinct test).
Regular Function
create or replace function num_zeroes(
p_n binary_integer
) return binary_integer deterministic is
l_t binary_integer;
l_b binary_integer;
begin
--assume 32-bit hash value, 10-bits for bucket
if (p_n = 0) then return 22; end if;
l_t := 1;
l_b := 0;
while ( bitand(p_n,l_t) = 0 )
loop
l_t := l_t*2;
l_b := l_b+1;
end loop;
return l_b;
end num_zeroes;
SQL> select
2 case
3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from (
12 select /*+ parallel(z 4) */ mod(ora_hash(n), 1024) bucket,
13 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
14 from z_hll_test z
15 group by mod(ora_hash(n), 1024)
16 )
17 )
18 );
NUM_DISTINCT
------------
748175
Executed in 0.889 seconds
Pragma UDF Function
create or replace function num_zeroes(
p_n binary_integer
) return binary_integer deterministic is
pragma udf;
l_t binary_integer;
l_b binary_integer;
begin
--assume 32-bit hash value, 10-bits for bucket
if (p_n = 0) then return 22; end if;
l_t := 1;
l_b := 0;
while ( bitand(p_n,l_t) = 0 )
loop
l_t := l_t*2;
l_b := l_b+1;
end loop;
return l_b;
end num_zeroes;
SQL> select
2 case
3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from (
12 select /*+ parallel(z 4) */ mod(ora_hash(n), 1024) bucket,
13 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
14 from z_hll_test z
15 group by mod(ora_hash(n), 1024)
16 )
17 )
18 );
NUM_DISTINCT
------------
748175
Executed in 0.593 seconds
Pragma UDF gives us ~33% performance boost which is not too bad considering we didn't have to do anything else. However, that's not the most interesting part -- let's take a look at the native distinct next.
Native Distinct
SQL> select /*+ parallel(z 4) */ count(distinct n) from z_hll_test z;
COUNT(DISTINCTN)
----------------
753204
Executed in 0.983 seconds
Note that this was an optimal execution!
Summary
Let's summarize results in a table:
| Regular Function | Pragma UDF Function | Native Distinct |
| 0.889 | 0.593 | 0.983 |
As you can see pragma udf actually beats native implementation by a considerable margin which is very impressive given the fact that distict had an optimal execution.
Tuesday, November 26, 2013
Result Cache Latch and PDBs
One interesting aspect of Oracle 12cR1 database when it comes to PDBs is how latching is done. For example, if all PDBs have to work under the same latch then contention in one PDB can easily affect users in other PDBs too.
Continuing my series of posts on the Result Cache latch today I'll check what happens when you try to acquire RC latch from two different PDBs.
Session 1
The first session is connected under container name TEST:
The second session is connected under container name TEST2:
Session 1
Continuing my series of posts on the Result Cache latch today I'll check what happens when you try to acquire RC latch from two different PDBs.
Session 1
The first session is connected under container name TEST:
SQL> select sys_context('userenv', 'con_name') con_name from dual;
CON_NAME
--------------------
TEST
SQL> select addr from v$latch where name='Result Cache: RC Latch';
ADDR
----------------
0000000060041C78
Session 2
The second session is connected under container name TEST2:
SQL> select sys_context('userenv', 'con_name') con_name from dual;
CON_NAME
--------------------
TEST2
SQL> select addr from v$latch where name='Result Cache: RC Latch';
ADDR
----------------
0000000060041C78
As you can see the address of the latch is exactly the same under both containers so both PDBs appear to share exactly the same latch. Let's confirm it by trying to acquire the latch in exclusive mode in both sessions:
Session 1
SQL> oradebug setmypid Statement processed. SQL> oradebug call kslgetsl_w 0x0000000060041C78 1 1 1 16 Function returned 1Session 2
SQL> oradebug setmypid Statement processed. SQL> oradebug call kslgetsl_w 0x0000000060041C78 1 1 1 16 ...session hangs......and we can confirm that it waits on the RC latch:
SQL> select event, p1, to_char(p1, 'fmxxxxxxxxxxx') addr, state, seconds_in_wait 2 from v$session_wait 3 where sid=18; EVENT P1 ADDR STATE SECONDS_IN_WAIT ---------- ---------- ------------ ------------------- --------------- latch free 1610882168 60041c78 WAITING 25The bottom line is that the single RC latch appears to be shared by all PDBs.
Monday, November 18, 2013
How to use HyperLogLog to incrementally maintain number of distinct values
In this post I'll show how extremely easy it is to maintain the number of distinct values when using HyperLogLog. Please reference to my previous post for some description how HyperLogLog works.
Let's assume we have a table with some existing data:
Now let's say there is some new data you want to add into the existing table:
Let's assume we have a table with some existing data:
SQL> create table existing_data as 2 select round(dbms_random.value(0, 77777)) n 3 from dual 4 connect by level <= 100000; Table createdPrecise number of distinct values:
SQL> select count(distinct n) from existing_data;
COUNT(DISTINCTN)
----------------
56192
Now in order for the incremental refresh to work we first need to create HyperLogLog synopsis:
SQL> create table existing_hll as 2 select mod(ora_hash(n), 1024) bucket, 3 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val 4 from existing_data 5 group by mod(ora_hash(n), 1024); Table createdThe table is extremely small as it contains only 1024 rows yet it would be enough to describe data with billions of rows in it. Of course we can now use that synopsis to estimate number of distinct values we have using HyperLogLog:
SQL> select
2 case
3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from existing_hll
12 )
13 );
NUM_DISTINCT
------------
57676
Again, the result is within 2% of the precise distinct, which is pretty good considering how little information we had to store about the entire data set.
Now let's say there is some new data you want to add into the existing table:
SQL> create table new_data as 2 select round(dbms_random.value(5555, 111111)) n 3 from dual 4 connect by level <= 10000; Table createdSo what we're going to do is calculate HyperLogLog synopsis about this new data and then merge it with HyperLogLog synopsis for the existing data:
SQL> merge into existing_hll e 2 using ( 3 select mod(ora_hash(n), 1024) bucket, 4 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val 5 from new_data 6 group by mod(ora_hash(n), 1024) 7 ) n on (e.bucket=n.bucket) 8 when matched then update set e.val=greatest(e.val, n.val) 9 when not matched then insert values (n.bucket, n.val); 1024 rows merged. SQL> commit; Commit complete.Of course the above is a lot more efficient than what you would have to do otherwise, i.e. calculate the number of distinct values from scratch for the entire data set. Once the synopsis has been refreshed we can estimate the new number of distinct values we have:
SQL> select
2 case
3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from existing_hll
12 )
13 );
NUM_DISTINCT
------------
62288
And it's no different had I computed HyperLogLog for both data sets from scratch:
SQL> select
2 case
3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
4 when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from (
12 select mod(ora_hash(n), 1024) bucket,
13 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
14 from (
15 select n from existing_data
16 union all
17 select n from new_data
18 ) group by mod(ora_hash(n), 1024)
19 )
20 )
21 );
NUM_DISTINCT
------------
62288
And the precise distinct count:
SQL> select count(distinct n) from
2 (
3 select * from existing_data
4 union all
5 select * from new_data
6 );
COUNT(DISTINCTN)
----------------
60983
I think the ability to incrementally refresh the data is the most powerful aspect of HyperLogLog.
Sunday, November 17, 2013
HyperLogLog in Oracle
Calculating number of unique values using Oracle distinct for big data sets has two major problems:
Places like Facebook and Google solve the above problems by using a very interesting algorithm called HyperLogLog. There is a pretty good description of how it works but, in a nutshell, it relies on counting the number of trailing (or leading) zeroes in the binary stream and using that information to estimate number of unique values with a very high degree of accuracy.
HyperLogLog is widely used due to the following properties:
Of course the simplest way was to implement it as a user-defined aggregate, which I did. Unfortunately the performance of that solution left a lot to be desired due to a simple fact that there is quite a bit of overhead in executing a user-defined aggregate function for each row in the data set. Based on that I don't think there is a lot of practical applications for HyperLogLog in this form apart from doing a technology demo.
However, it's quite simple to compute most of the HyperLogLog directly inside the SQL statement itself and, as expected, that provides a much better performance. So here is a quick example which demonstrates HyperLogLog in action.
Fist I will create a test table:
First let's run Oracle's distinct:
Let's see what happens if we use HyperLogLog instead. First I will create a function to calculate the number of trailing zeroes we have in a number:
The algoirithm can have even more potential in the Exadata environments, if we imagine for a second that Oracle gives us a native implementation which is off-loaded, because each cell can independently compute it's part of the data and then simply sent the results out for final merge. Indeed, this is how places like Facebook and Google scale it across lots of small servers.
- It may require lots of memory for sort/hash group by.
- It is very difficult to refresh distinct numbers incrementally meaning every time you append some new data you generally have to perform distinct calculations from scratch.
Places like Facebook and Google solve the above problems by using a very interesting algorithm called HyperLogLog. There is a pretty good description of how it works but, in a nutshell, it relies on counting the number of trailing (or leading) zeroes in the binary stream and using that information to estimate number of unique values with a very high degree of accuracy.
HyperLogLog is widely used due to the following properties:
- Extremely low memory footprint -- 1KB of memory is enough to estimate the number of unique values with a very high degree of precision across billions of rows.
- It can be used to refresh the numbers incrementally by calculating HyperLogLog for the new data set and then combining it with the HyperLogLog for the existing one.
- It is very parallel friendly. Each piece of data can be independently computed before being combined for the final result -- essentially utilizing the same mechanism as would be used for incremental refresh.
Of course the simplest way was to implement it as a user-defined aggregate, which I did. Unfortunately the performance of that solution left a lot to be desired due to a simple fact that there is quite a bit of overhead in executing a user-defined aggregate function for each row in the data set. Based on that I don't think there is a lot of practical applications for HyperLogLog in this form apart from doing a technology demo.
However, it's quite simple to compute most of the HyperLogLog directly inside the SQL statement itself and, as expected, that provides a much better performance. So here is a quick example which demonstrates HyperLogLog in action.
Fist I will create a test table:
SQL> create table z_hll_test as
2 select dbms_random.string('x', 4)||rpad('x', 500, 'x') n
3 from dual
4 connect by level <= 1000000;
Table created
Next I will set my pga_aggregate_target=64m to simulate a situation where distinct will have to do a lot of spilling to disk due to large amounts of data being processed as I don't want to spent a lot of time generating hundreds of billions of rows instead.
First let's run Oracle's distinct:
SQL> set timing on
SQL> select /*+ parallel(z 4) */ count(distinct n) from z_hll_test z;
753521
Elapsed: 00:00:46.02
As you may figure out, most of the 46 seconds were spent spilling to temp:
Let's see what happens if we use HyperLogLog instead. First I will create a function to calculate the number of trailing zeroes we have in a number:
create or replace function num_zeroes( p_n binary_integer ) return binary_integer deterministic is l_t binary_integer; l_b binary_integer; begin --assume 32-bit hash value, 10-bits for bucket if (p_n = 0) then return 22; end if; l_t := 1; l_b := 0; while ( bitand(p_n,l_t) = 0 ) loop l_t := l_t*2; l_b := l_b+1; end loop; return l_b; end num_zeroes;Now we can use this function to compute the HyperLogLog value directly in a SQL statement:
SQL> select
case
2 3 when hll <= 2560 and zeroes > 0 then round(1024*ln(1024*1/zeroes))
when hll > 1/30 * power(2,32) then round(-power(2,32) * ln(1 - hll/power(2,32)))
4 5 else round(hll)
6 end num_distinct
7 from (
8 select 0.72054 * (1048576/(current_sum+zeroes)) hll, zeroes
9 from (
10 select sum(1/power(2, val)) current_sum, (1024-count(*)) zeroes
11 from (
12 select /*+ parallel(z 4) */
13 mod(ora_hash(n), 1024) bucket,
14 max(num_zeroes(trunc(ora_hash(n)/1024)))+1 val
15 from z_hll_test z
16 group by mod(ora_hash(n), 1024)
17 )
18 )
19 );
738105
Elapsed: 00:00:02.08
The SQL might look a bit complicated but it's pretty straightforward once you realize how HyperLogLog works (see the link I provided above). For you see, the performance is 22x faster because the entire thing was able to compute in-memory (there are only 1024 distinct values left after the group by as far as Oracle is concerned) and result is within 2% of the precise distinct count -- very impressive!
The algoirithm can have even more potential in the Exadata environments, if we imagine for a second that Oracle gives us a native implementation which is off-loaded, because each cell can independently compute it's part of the data and then simply sent the results out for final merge. Indeed, this is how places like Facebook and Google scale it across lots of small servers.
Tuesday, August 27, 2013
Flashback query FTS costs
There has been some information written on the subject already (see this post by Randolf Geist).
In a nutshell, the way optimizer costs full table scans when using flashback query makes it look much more expensive than without. What further complicates the problem is the fact that index access costs remain the same regardless of whether you're using flashback query or not. As a result you will be much more likely to see an index access paths when using flashback query.
Consider the following example:
Now let's see what happens if we run the same query as above but join into this view instead:
One way to deal with the problem is to see how much more expensive the table scan became when using flashback query:
In a nutshell, the way optimizer costs full table scans when using flashback query makes it look much more expensive than without. What further complicates the problem is the fact that index access costs remain the same regardless of whether you're using flashback query or not. As a result you will be much more likely to see an index access paths when using flashback query.
Consider the following example:
SQL> create table test as
2 select level n, rpad('x', 200, 'x') v
3 from dual
4 connect by level <= 10000;
Table created
SQL> alter table test add constraint pk_test primary key (n);
Table altered
SQL> exec dbms_stats.gather_table_stats(user, 'test');
PL/SQL procedure successfully completed
If I were to execute the following select it will run using a HASH JOIN:
SQL> with v as
(select /*+ cardinality(100) */ level n from dual connect by level <= 100)
select *
from v, test t
where v.n=t.n; 2 3 4 5
Execution Plan
----------------------------------------------------------
Plan hash value: 690578125
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 21800 | 60 (2)| 00:00:01 |
|* 1 | HASH JOIN | | 100 | 21800 | 60 (2)| 00:00:01 |
| 2 | VIEW | | 100 | 1300 | 2 (0)| 00:00:01 |
|* 3 | CONNECT BY WITHOUT FILTERING| | | | | |
| 4 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 5 | TABLE ACCESS FULL | TEST | 10000 | 2001K| 57 (0)| 00:00:01 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("V"."N"="T"."N")
3 - filter(LEVEL<=100)
In the particular case I'm talking about the flashback query was hidden behind the view (to make it transparent to the application) so we essentially had the following:
SQL> create or replace view v_test as 2 select * from test as of scn dbms_flashback.get_system_change_number; View created(dbms_flashback.get_system_change_number is only used here as a substitute example)
Now let's see what happens if we run the same query as above but join into this view instead:
SQL> with v as
(select /*+ cardinality(100) */ level n from dual connect by level <= 100)
select *
from v, v_test t
where v.n=t.n; 2 3 4 5
Execution Plan
----------------------------------------------------------
Plan hash value: 3196053776
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 21800 | 102 (0)| 00:00:02 |
| 1 | NESTED LOOPS | | | | | |
| 2 | NESTED LOOPS | | 100 | 21800 | 102 (0)| 00:00:02 |
| 3 | VIEW | | 100 | 1300 | 2 (0)| 00:00:01 |
|* 4 | CONNECT BY WITHOUT FILTERING| | | | | |
| 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
|* 6 | INDEX UNIQUE SCAN | PK_TEST | 1 | | 0 (0)| 00:00:01 |
| 7 | TABLE ACCESS BY INDEX ROWID | TEST | 1 | 205 | 1 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter(LEVEL<=100)
6 - access("V"."N"="N")
The plan suddenly changed to NL join! What happened is flashback query cranked the FTS cost up while leaving index access cost to be the same thus making an FTS to be much less appealing choice for the optimizer. This can make a lot of a difference especially if you're running in the Exadata environment. So what do you do?
One way to deal with the problem is to see how much more expensive the table scan became when using flashback query:
SQL> select * from test union all select * from v_test; 2 3 Execution Plan ---------------------------------------------------------- Plan hash value: 2275963031 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 20000 | 4003K| 367 (85)| 00:00:05 | | 1 | UNION-ALL | | | | | | | 2 | TABLE ACCESS FULL| TEST | 10000 | 2001K| 57 (0)| 00:00:01 | | 3 | TABLE ACCESS FULL| TEST | 10000 | 2001K| 310 (0)| 00:00:04 | ---------------------------------------------------------------------------The cost went up from 57 to 310. That's how much more expensive the view will going to make an FTS look like to the optimizer. So now we can counter-balance that increase with the corresponding increase in cost of index scans using opt_param hint in our view:
SQL> select round((310/57)*100) x from dual;
X
----------
544
SQL> create or replace view v_test as
2 select /*+ opt_param('optimizer_index_cost_adj',544) */ *
3 from test as of scn dbms_flashback.get_system_change_number;
View created
The plan will now go back to a HASH JOIN
SQL> with v as
(select /*+ cardinality(100) */ level n from dual connect by level <= 100)
select *
from v, v_test t
where v.n=t.n; 2 3 4 5
Execution Plan
----------------------------------------------------------
Plan hash value: 690578125
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 21800 | 313 (1)| 00:00:04 |
|* 1 | HASH JOIN | | 100 | 21800 | 313 (1)| 00:00:04 |
| 2 | VIEW | | 100 | 1300 | 2 (0)| 00:00:01 |
|* 3 | CONNECT BY WITHOUT FILTERING| | | | | |
| 4 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 5 | TABLE ACCESS FULL | TEST | 10000 | 2001K| 310 (0)| 00:00:04 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("V"."N"="N")
3 - filter(LEVEL<=100)
Of course there are a lot of limitations to this approach, mainly that any query which references the view will get the index access cost adjusted accordingly so if you have a bunch of flashback and non-flashback tables in the same query it deserves a careful thought.
Thursday, July 25, 2013
Enkitec E4 2013
Just a quick note that I'll be presenting at this year's Enkitec E4 conference. You can find the schedule here. I did some under the hood investigation regarding how the whole DBFS stack works from the performance perspective and, needless to say, some findings simply left me startled. If you want to see a session which will forever change the way you look at DBFS then this will definitely be the one to attend.
Tuesday, July 23, 2013
Oracle GoldenGate Integrated Capture #2
I have already written some words on the subject before. However, since then some interesting things have happened.
To recap (or save you some time if you don't want to read the original article) GoldenGate Integrated Capture is nothing else but Oracle Streams Capture in disguise. When running in the Integrated Capture mode pretty much the entire GoldenGate front-end gets yanked out of the picture and replaced with Oracle Streams Capture technology instead.
Let's take a closer look at some of the differences that happened between now and then. Here are two capture processes:
Indeed, if we take a look at the stats for both processes...
To recap (or save you some time if you don't want to read the original article) GoldenGate Integrated Capture is nothing else but Oracle Streams Capture in disguise. When running in the Integrated Capture mode pretty much the entire GoldenGate front-end gets yanked out of the picture and replaced with Oracle Streams Capture technology instead.
Let's take a closer look at some of the differences that happened between now and then. Here are two capture processes:
SQL> select capture_name, rule_set_name, purpose from dba_capture order by 1; CAPTURE_NAME RULE_SET_NAME PURPOSE ------------------------------ ------------------------------ ------------------- OGG$CAP_GG1_EXT1 OGG$GG1_EXT1_CAPTURE_I GoldenGate Capture OGG$CAP_GG1_EXT2 GoldenGate CaptureOGG$CAP_GG1_EXT1 was created using GoldenGate 11.2.1.0.3 while OGG$CAP_GG1_EXT2 was created using 11.2.1.0.6BP3. Notice how RULE_SET_NAME is empty for the second one? Without the rule set the Capture process will just capture everything that's happening in the database so my first concern was whether it's going to negatively affect the performance?
Indeed, if we take a look at the stats for both processes...
SID CAPTURE_NAME STARTUP_TIME TOTAL_PREFILTER_DISCARDED TOTAL_MESSAGES_ENQUEUED
---------- ---------------- --------------------- ------------------------- -----------------------
33 OGG$CAP_GG1_EXT1 23/07/2013 2:21:56 PM 47 71
147 OGG$CAP_GG1_EXT2 23/07/2013 2:21:56 PM 28 2589
...notice a big difference in TOTAL_MESSAGES_ENQUEUED. Both processes were started at exactly the same time and they are capturing from exactly the same objects. This difference makes sense -- without a rule set in place the second process will enqueue every change it sees. By itself, this change would be pretty bad, but something else has happened as well:
SQL> select capture_name, v.program 2 from v$streams_capture c, v$session v 3 where c.SID = v.SID; CAPTURE_NAME PROGRAM ------------------------------ ------------------------------------------------ OGG$CAP_GG1_EXT1 oracle@gg1.quadro.com (CP01) OGG$CAP_GG1_EXT2 extract@gg1.quadro.com (TNS V1-V3)Notice how OGG$CAP_GG1_EXT1 is an Oracle shadow process (a Streams Capture process) while OGG$CAP_GG1_EXT2 is the Extract process itself! What we had before is the Extract process acting as an XStreams client to the Streams Capture process. With this change it is no longer required and we can see that the XStreams client is now detached from the server:
SQL> select server_name, capture_name, status from dba_xstream_outbound order by 1; SERVER_NAME CAPTURE_NAME STATUS ------------------------------ ------------------------------ -------- OGG$GG1_EXT1 OGG$CAP_GG1_EXT1 ATTACHED OGG$GG1_EXT2 OGG$CAP_GG1_EXT2 DETACHEDThat clears the concern of an empty rule set -- the Extract can now filter the records internally because nothing needs to be shipped out anyways where before it would have resulted in a massive traffic between the Capture process, buffered queue and the XStreams client.
Wednesday, March 13, 2013
Parallel unfriendly
Take a look at the following Parallel section of a SQL Monitor report:
Any query which produces such a report won't care about how much parallel you're running because virtually all the time is spent by the query coordinator (which is a serial process) being busy.
In this case the query in question is quite simple:
So what do you do when you get impacted by a particular choice of algorithm implemented by Oracle? One way to deal with it is to see whether you can trade one set of problem for another, in case the alternative can be executed in a better way. In this particular case the fact table contains the sale transactions for a particular store chain. While there are many different ways to spent money, the number of distinct spending amounts should be relatively low compared to the number of rows we have in the table and in such a case we can calculate the median in a different way.
What we can do instead is count how many occurrences of each spending we have and, when sorted by the spending amount, that will give us a compressed form of the raw data which still retains all the information required to find a median. Let's say you have a table with the following data:
Our new real query will look like this:
The total execution time has dropped from almost 26 minutes down to 28 seconds. Moreover, the workload is now much more skewed towards parallel query slaves, which is exactly what we want to see. Of course, the trick only works if the group by is able to collapse the data sufficiently enough.
Any query which produces such a report won't care about how much parallel you're running because virtually all the time is spent by the query coordinator (which is a serial process) being busy.
In this case the query in question is quite simple:
select /*+ parallel(t,8) */ median(basket_amount) from whs.fact_sale tThe reason it behaves the way it does has everything to do with how Oracle executes it:
Execution Plan ---------------------------------------------------------- Plan hash value: 712547042 ----------------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib | ----------------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 4 | 110K (3)| 00:00:03 | | | | | | | 1 | SORT GROUP BY | | 1 | 4 | | | | | | | | | 2 | PX COORDINATOR | | | | | | | | | | | | 3 | PX SEND QC (RANDOM) | :TQ10000 | 724M| 2763M| 110K (3)| 00:00:03 | | | Q1,00 | P->S | QC (RAND) | | 4 | PX BLOCK ITERATOR | | 724M| 2763M| 110K (3)| 00:00:03 | 1 |1048575| Q1,00 | PCWC | | | 5 | TABLE ACCESS STORAGE FULL| FACT_SALE | 724M| 2763M| 110K (3)| 00:00:03 | 1 |1048575| Q1,00 | PCWP | | -----------------------------------------------------------------------------------------------------------------------------------------Each parallel query slave will gets it's own chunk of the table to read from and then simply send data back to the coordinator. The coordinator will then have to deal with all this data by sorting more than 700M rows which, of course, won't be particularly fast. In this sense median poses an interesting problem since Oracle can't calculate (or, rather, discover) it without having access to the entire data set and query coordinator is the only process which can do it.
So what do you do when you get impacted by a particular choice of algorithm implemented by Oracle? One way to deal with it is to see whether you can trade one set of problem for another, in case the alternative can be executed in a better way. In this particular case the fact table contains the sale transactions for a particular store chain. While there are many different ways to spent money, the number of distinct spending amounts should be relatively low compared to the number of rows we have in the table and in such a case we can calculate the median in a different way.
What we can do instead is count how many occurrences of each spending we have and, when sorted by the spending amount, that will give us a compressed form of the raw data which still retains all the information required to find a median. Let's say you have a table with the following data:
SQL> select n from z_t;
N
----------
1
1
2
3
3
5
7
7 rows selectedThe first step is to find how many occurrences of each value do we have:
SQL> select n, count(*) cnt
2 from z_t
3 group by n;
N CNT
---------- ----------
1 2
2 1
5 1
3 2
7 1If the number of distinct values is relatively low, the group by will be able to collapse the result set enough as to make subsequent work to be not very significant as well as do it in a very parallel friendly way. The next step is to calculate the cardinality of the data set, at which places we have each distinct value as well as how many values are there:
SQL> select n, lag(running_sum, 1, 0) over (order by n) prev_running_sum, running_sum, total_row_count
2 from (
3 select n,
4 sum(cnt) over (order by n) running_sum,
5 sum(cnt) over () total_row_count
6 from (
7 select n, count(*) cnt
8 from z_t
9 group by n
10 ));
N PREV_RUNNING_SUM RUNNING_SUM TOTAL_ROW_COUNT
---------- ---------------- ----------- ---------------
1 0 2 7
2 2 3 7
3 3 5 7
5 5 6 7
7 6 7 7
So what the above tells us is we have two 1s, followed by a single 2, followed by two 3s and so on. Because we have seven elements in our data set, we know that the median will be the item number four which we can now easily discover:
SQL> select avg(n) from (
2 select n, lag(value_begin, 1, 0) over (order by n) prev_value_begin, value_begin, total_row_count
3 from (
4 select n,
5 sum(cnt) over (order by n) value_begin,
6 sum(cnt) over () total_row_count
7 from (
8 select n, count(*) cnt
9 from z_t
10 group by n
11 ))) where total_row_count/2 between prev_value_begin and value_begin;
AVG(N)
----------
3The avg is there for a case where we have an even number of elements since the median in this case equals to a mean value of two values in the middle.
Our new real query will look like this:
select avg(n) from ( select n, lag(value_begin, 1, 0) over (order by n) prev_value_begin, value_begin, total_row_count from ( select n, sum(cnt) over (order by n) value_begin, sum(cnt) over () total_row_count from ( select /*+ parallel(t,8) */ basket_amount n, count(*) cnt from whs.fact_sale t group by basket_amount ))) where total_row_count/2 between prev_value_begin and value_begin;So what does a group by and a set of analytic functions is able to bring to a table? Let's take a look:
The total execution time has dropped from almost 26 minutes down to 28 seconds. Moreover, the workload is now much more skewed towards parallel query slaves, which is exactly what we want to see. Of course, the trick only works if the group by is able to collapse the data sufficiently enough.
Tuesday, February 26, 2013
In-memory PQ and physical reads
In my previous post I've demonstrated how in-memory PQ can access the table directly from the buffer cache even when you're using manual DOP.
One interesting question, however, is what happens when PQ slave needs to read some blocks from disk given that object has been qualified for in-memory (cached) access? Would the slave do it using direct or buffered I/O?
The answer becomes somewhat clear once you realize that in-memory PQ is enabled by simply utilizing buffered reads. Since direct path reads can not take advantage of any blocks stored in the buffer cache (local or remote), trying to do direct path reads will defeat the whole point of in-memory PQ.
To demonstrate the point here is the excerpt from a tkprof output for one of the parallel query slaves:
On another note it looks like the things came full circle. Serial sessions are now able to utilize direct path reads while parallel query slaves are able to do buffered I/O.
All tests were done on 11.2.0.3 (both Exadata and non-Exadata).
One interesting question, however, is what happens when PQ slave needs to read some blocks from disk given that object has been qualified for in-memory (cached) access? Would the slave do it using direct or buffered I/O?
The answer becomes somewhat clear once you realize that in-memory PQ is enabled by simply utilizing buffered reads. Since direct path reads can not take advantage of any blocks stored in the buffer cache (local or remote), trying to do direct path reads will defeat the whole point of in-memory PQ.
To demonstrate the point here is the excerpt from a tkprof output for one of the parallel query slaves:
Rows (1st) Rows (avg) Rows (max) Row Source Operation
---------- ---------- ---------- ---------------------------------------------------
0 0 0 SORT AGGREGATE (cr=0 pr=0 pw=0 time=0 us)
0 0 0 PX COORDINATOR (cr=0 pr=0 pw=0 time=0 us)
0 0 0 PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
1 1 1 SORT AGGREGATE (cr=62838 pr=62500 pw=0 time=7492751 us)
57696 62500 72120 PX BLOCK ITERATOR (cr=62838 pr=62500 pw=0 time=24528381 us cost=18846 size=0 card=500000)
57696 62500 72120 TABLE ACCESS FULL Z_TEST (cr=62838 pr=62500 pw=0 time=23944184 us cost=18846 size=0 card=500000)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
PX Deq: Execution Msg 128 0.19 1.06
db file scattered read 4546 0.16 43.99
latch free 1 0.00 0.00
latch: cache buffers lru chain 31 0.01 0.03
latch: cache buffers chains 3 0.00 0.00
latch: object queue header operation 4 0.00 0.00
Note that the slave waited on db file scattered read event which is nothing else but buffered multiblock reads. If you were to run the same test on the Exadata platform you would see cell multiblock physical read event instead, given that in-memory PQ did get utilized. There are a couple of consequences for this:
- There is no need to do object level checkpoint. This makes in-memory PQ somewhat more friendly to be running in OLTP environment since it doesn't need to flush any dirty buffers to disk.
- If you running on the Exadata platform, none of the offloading will happen, even if you have to read the significant portion of the table from disk.
On another note it looks like the things came full circle. Serial sessions are now able to utilize direct path reads while parallel query slaves are able to do buffered I/O.
All tests were done on 11.2.0.3 (both Exadata and non-Exadata).
Monday, February 18, 2013
Does in-memory PQ work with PARALLEL_DEGREE_POLICY=MANUAL?
In-memory parallel execution seems to be gaining popularity especially among people running x2-8 and x3-8 Exadata systems or any other system that have large amounts of memory capable of caching lots of data.
Oracle documentation suggests that in order to utilize in-memory PQ, parallel_degree_policy needs to be set to auto.
_parallel_cluster_cache_policy
One of the parameters influenced by parallel_degree_policy is _parallel_cluster_cache_policy. When using Auto DOP _parallel_cluster_cache_policy will be set to cached. The question then becomes what happens if we set _parallel_cluster_cache_policy=cached while still keeping Manual DOP? Will the system use in-memory PQ?
Test table
Below is a test table setup:
Classic PQ #1
Without setting any additional parameters PQ behave the way it always did -- by utilizing direct path reads directly to the process' memory:
Caching the table
Of course, before testing the in-memory PQ, we need to make sure that our entire table sits in the buffer cache. The easiest way to do it is perform an FTS on the table using an index:
Classic PQ #2
Even with the table entirely cached we still get it using physical reads when utilizing classic PQ -- as it should be:
Let's flip the parameter responsible for in-memory PQ (while still keeping parallel_degree_policy=manual) and see what happens:
Conclusion
It is possible to use in-memory PQ with Manual DOP by setting _parallel_cluster_cache_policy=cached. Of course, always consult with Oracle support before flipping any of the underscore parameters.
Oracle documentation suggests that in order to utilize in-memory PQ, parallel_degree_policy needs to be set to auto.
_parallel_cluster_cache_policy
One of the parameters influenced by parallel_degree_policy is _parallel_cluster_cache_policy. When using Auto DOP _parallel_cluster_cache_policy will be set to cached. The question then becomes what happens if we set _parallel_cluster_cache_policy=cached while still keeping Manual DOP? Will the system use in-memory PQ?
Test table
Below is a test table setup:
SQL> create table z_test tablespace data as
select level n, rpad('*', 4000, '*') v
from dual
connect by level <= 500000;
Table created.
SQL> alter table z_test add constraint pk_z_test primary key (n);
Table altered.
SQL> select bytes/power(1024,2) mb
from user_segments
where segment_name='Z_TEST';
MB
----------
3968
SQL> exec dbms_stats.gather_table_stats(user, 'z_test');
PL/SQL procedure successfully completed.
The instance is running with 12G buffer cache so it'll have no problems fully caching the above table. All tests were done on my in-house test lab with Oracle 11.2.0.3.3 running inside a Linux VM.
Classic PQ #1
Without setting any additional parameters PQ behave the way it always did -- by utilizing direct path reads directly to the process' memory:
SQL> set timing on
SQL> set autot trace exp stat
SQL> select /*+ parallel(8) full(z_test) */ count(*) from z_test;
Elapsed: 00:00:02.86
Execution Plan
----------------------------------------------------------
Plan hash value: 2128527892
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 18846 (1)| 00:00:01 | | | |
| 1 | SORT AGGREGATE | | 1 | | | | | |
| 2 | PX COORDINATOR | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | | | Q1,00 | PCWP | |
| 5 | PX BLOCK ITERATOR | | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWC | |
| 6 | TABLE ACCESS FULL| Z_TEST | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWP | |
--------------------------------------------------------------------------------------------------------
Note
-----
- Degree of Parallelism is 8 because of hint
Statistics
----------------------------------------------------------
25 recursive calls
0 db block gets
500525 consistent gets
500000 physical reads
0 redo size
526 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
It took us about 2.86 seconds to full scan the table which equals 1387MB/s throughput (my test lab storage setup is described here). The above clearly shows that we had to do physical reads in order to access the entire table.
Caching the table
Of course, before testing the in-memory PQ, we need to make sure that our entire table sits in the buffer cache. The easiest way to do it is perform an FTS on the table using an index:
SQL> select /*+ index(z_test,pk_z_test) */ v from z_test;
500000 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 579016438
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 500K| 1907M| 501K (1)| 00:00:03 |
| 1 | TABLE ACCESS BY INDEX ROWID| Z_TEST | 500K| 1907M| 501K (1)| 00:00:03 |
| 2 | INDEX FULL SCAN | PK_Z_TEST | 500K| | 1052 (1)| 00:00:01 |
-----------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
534311 consistent gets
501105 physical reads
0 redo size
2021185355 bytes sent via SQL*Net to client
367187 bytes received via SQL*Net from client
33335 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
500000 rows processed
Let's check to make sure all table blocks are in the buffer cache:
SQL> set autot off
SQL> select count(*)
from v$bh
where objd= (select data_object_id from user_objects where object_name='Z_TEST')
and status='xcur';
COUNT(*)
----------
500001
Now we're good to go!
Classic PQ #2
Even with the table entirely cached we still get it using physical reads when utilizing classic PQ -- as it should be:
SQL> set autot trace exp stat
SQL> select /*+ parallel(8) full(z_test) */ count(*) from z_test;
Elapsed: 00:00:02.83
Execution Plan
----------------------------------------------------------
Plan hash value: 2128527892
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 18846 (1)| 00:00:01 | | | |
| 1 | SORT AGGREGATE | | 1 | | | | | |
| 2 | PX COORDINATOR | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | | | Q1,00 | PCWP | |
| 5 | PX BLOCK ITERATOR | | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWC | |
| 6 | TABLE ACCESS FULL| Z_TEST | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWP | |
--------------------------------------------------------------------------------------------------------
Note
-----
- Degree of Parallelism is 8 because of hint
Statistics
----------------------------------------------------------
24 recursive calls
0 db block gets
500525 consistent gets
500000 physical reads
0 redo size
526 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
In-memory PQ
Let's flip the parameter responsible for in-memory PQ (while still keeping parallel_degree_policy=manual) and see what happens:
SQL> alter session set "_parallel_cluster_cache_policy"=cached;
Session altered.
Elapsed: 00:00:00.01
SQL> select /*+ parallel(8) full(z_test) */ count(*) from z_test;
Elapsed: 00:00:00.36
Execution Plan
----------------------------------------------------------
Plan hash value: 2128527892
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 18846 (1)| 00:00:01 | | | |
| 1 | SORT AGGREGATE | | 1 | | | | | |
| 2 | PX COORDINATOR | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | | | Q1,00 | PCWP | |
| 5 | PX BLOCK ITERATOR | | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWC | |
| 6 | TABLE ACCESS FULL| Z_TEST | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWP | |
--------------------------------------------------------------------------------------------------------
Note
-----
- Degree of Parallelism is 8 because of hint
Statistics
----------------------------------------------------------
24 recursive calls
0 db block gets
502709 consistent gets
0 physical reads
0 redo size
526 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
For you see -- the entire table got read from the buffer cache this time and much faster! The fact that we did zero physical IOs shows in-memory PQ kicking in.
Conclusion
It is possible to use in-memory PQ with Manual DOP by setting _parallel_cluster_cache_policy=cached. Of course, always consult with Oracle support before flipping any of the underscore parameters.
Monday, January 28, 2013
GoldenGate and transient PK updates
The problem of transient PK updates is well known and comes from the fact that pretty much every replication solution on the market applies changes using row-by-row approach.
To quickly recap the problem, if you have a table like:
Oracle Streams historically dealt with that problem using internal mechanism which you could leverage by executing a specially constructed LCR. Other (third-party) replication solution were pretty much out of luck and had to rely on elaborate tricks in order to work around the problem.
This is all about to change.
The necessity to better integrate GoldenGate and bring it feature set up has driven quite a bit of exciting innovation. One of these innovations solves the transient PK problem not only for GoldenGate but for everybody else.
dbms_xstream_gg package
The above package has been available at least since 11.2.0.2 and has two procedures which are directly relevant to the problem described above. I'm talking about enable_tdup_workspace and disable_tdup_workspace. Here is a quick demonstration of how they work:
I think anybody who were into any sort of replication and its problems will find this to be one of the most significant new features made available. The only caveat is that the above package is not documented so anyone thinking about leveraging it needs to carefully think about the way it behaves.
To quickly recap the problem, if you have a table like:
SQL> create table tpk (n number primary key); Table created SQL> insert into tpk values (1); 1 row inserted SQL> insert into tpk values (2); 1 row inserted SQL> commit; Commit completeThen executing the following statement...
update tpk set n=n+1...will result in a transient PK problem since the replication solution will have to decompose it into the following two statements:
update tpk set n=2 where n=1; update tpk set n=3 where n=2;There are two immediate (and major) problems with the above statements. The first problem is that we can't execute the first statement without violating the primary key constraint. Another problem is, even if we somehow could execute the first statement, the second statement will result in updating both rows since they now have the same value!
Oracle Streams historically dealt with that problem using internal mechanism which you could leverage by executing a specially constructed LCR. Other (third-party) replication solution were pretty much out of luck and had to rely on elaborate tricks in order to work around the problem.
This is all about to change.
The necessity to better integrate GoldenGate and bring it feature set up has driven quite a bit of exciting innovation. One of these innovations solves the transient PK problem not only for GoldenGate but for everybody else.
dbms_xstream_gg package
The above package has been available at least since 11.2.0.2 and has two procedures which are directly relevant to the problem described above. I'm talking about enable_tdup_workspace and disable_tdup_workspace. Here is a quick demonstration of how they work:
SQL> --this will result in PK violation SQL> update tpk set n=2 where n=1; update tpk set n=2 where n=1 ORA-00001: unique constraint (ROOT.SYS_C005031) violated SQL> exec dbms_xstream_gg.enable_tdup_workspace; PL/SQL procedure successfully completed SQL> --this is now works! SQL> update tpk set n=2 where n=1; 1 row updated SQL> update tpk set n=3 where n=2; 1 row updated SQL> exec dbms_xstream_gg.disable_tdup_workspace; PL/SQL procedure successfully completed SQL> commit; Commit completeAs you can see, the procedure allows us to avoid classical transient PK problem! Indeed, that's what GoldenGate uses internally to avoid getting in troubles as well. The implementation seems to be leveraging the same delete+insert trick Oracle Streams did:
SQL> exec dbms_xstream_gg.enable_tdup_workspace;
PL/SQL procedure successfully completed
SQL> select n, rowid from tpk;
N ROWID
---------- ------------------
1 AAAO25AAFAAClWFAAA
2 AAAO25AAFAAClWFAAB
SQL> update tpk set n=2 where n=1;
1 row updated
SQL> select n, rowid from tpk;
N ROWID
---------- ------------------
2 AAAO25AAFAAClWFAAB
SQL> update tpk set n=3 where n=2;
1 row updated
SQL> select n, rowid from tpk;
N ROWID
---------- ------------------
2 AAAO25AAFAAClWFAAA
3 AAAO25AAFAAClWFAAB
Note how the row mysteriously disappears after the first update and then suddenly comes back after the second one?
I think anybody who were into any sort of replication and its problems will find this to be one of the most significant new features made available. The only caveat is that the above package is not documented so anyone thinking about leveraging it needs to carefully think about the way it behaves.
Thursday, December 20, 2012
Oracle GoldenGate Sequence Replication
When using Oracle GoldenGate sequence replication there is a number of issues you need to be aware of especially if you replicate quite a lot of busy sequences.
The first issue is that GoldenGate sequence replication does not use bind variables. Let's execute the following statements on the source system:
When using higher cache values the problem of hard parses gets somewhat mitigated but there is another issue. When replicating such a sequence GoldenGate follows these steps:
The first issue is that GoldenGate sequence replication does not use bind variables. Let's execute the following statements on the source system:
SQL> create sequence rep1.s1 nocache;
Sequence created
SQL> select rep1.s1.nextval from dual;
NEXTVAL
----------
1
SQL> select rep1.s1.nextval from dual;
NEXTVAL
----------
2
GoldenGate uses PL/SQL procedure called replicateSequence each time it needs to sync sequence values. The following calls will be made on the destination system as a result of the above statements:
BEGIN ggext .replicateSequence (TO_NUMBER(2), TO_NUMBER(20), TO_NUMBER(1), 'REP1', TO_NUMBER(0), 'S1', UPPER('ggrep'), TO_NUMBER(1), TO_NUMBER (0), ''); END;
BEGIN ggext .replicateSequence (TO_NUMBER(3), TO_NUMBER(20), TO_NUMBER(1), 'REP1', TO_NUMBER(0), 'S1', UPPER('ggrep'), TO_NUMBER(1), TO_NUMBER (0), ''); END;
The first parameter is a target sequence value (seq$.highwater) and it's the one which is causing most of the issues, especially if the sequence has been declared with relatively low cache value (or nocache at all, as in my example). Every time a new sequence last value gets written into the source system data dictionary we get a hard parse on the destination!
When using higher cache values the problem of hard parses gets somewhat mitigated but there is another issue. When replicating such a sequence GoldenGate follows these steps:
- Sets sequence to nocache
- Executes sequence.nextval until it reaches the target value
- Restores altered sequence properties
Saturday, October 20, 2012
ZFS Home Storage Network at 10GbE
About a year ago I've decided to put all my data on a home built ZFS storage server. The growing amount of devices around my household prompted for an easier and much faster way to share the data. Since then the box was happily serving both CIFS as well as iSCSI over 1GbE network without any issues.
I was keen on upgrading to 10GbE for quite some time as both my server as well as clients could easily saturate 1GbE link when ZFS had all the required data in ARC. 32GB RAM in my storage server usually left me with the ARC of about 20GB which in most cases happened to be enough to cache the entire workset I was working with. Bottom line is the box rarely had to touch the disk and even if it did there was 120GB L2ARC SSD to even the bumps capable of maxing out 1GbE link as well.
It so happened that I managed to get my hands on a pair of 10GBASE-T Emulex OCe11102-NT NICs which I bought at a significant discount. With 10GBASE-T switches still costing upwards of multiple thousand dollars (even when used) I decided to just get a pair of CAT6 crossover cables running from the ZFS storage box to my workstation and do some tests to see what this configuration could be capable of.
Storage Server
My storage server is running Solaris 11 and the storage pool is built using 4x3TB Hitachi 5400RPM drives in RAID10 (ZFS mirror). The box has 32GB RAM and 120GB Vertex 3 MAX IOPS SSD for L2ARC. As mentioned above, the cache subsystem is enough to keep the box from hitting the disks most of the time. All that is driven by Intel Core i7-3770 CPU (Ivy Bridge).
iSCSI network
I've decided to dedicate 10GbE adapters to the iSCSI network I have between the storage box and my workstation. First of all, this is where I need all the speed I can get. Secondly, I can utlilize both ports with iSCSI MPIO thus archiving 20Gb available bandwidth. This is probably a total overkill but since my cards are dual ported I may as well use both ports as all I need is an extra cat6 cable. The network utilizes 9K jumbo frames. ZFS volume is using 4K block size to match NTFS file system cluster size built on top of the iSCSI volume. COMSTAR is used as an iSCSI target server with Microsoft iSCSI Initiator used for the client.
Test Results - IOPS
I'll start with IOPS results with 100% random read access over 20GB of data using Iometer at different block sizes and worker counts. Each worker was set to do 16 outstanding I/Os.
With 4K blocks the system is able to archive quite impressive 226K IOPS! The storage sever CPU is simply running flat out at this point so I'm confident there is more available from the network. At 16K blocks the system is pushing over 1.5GB/s of random IO which is equally impressive and clearly goes beyond what a single 10GbE link is capable of so the second link is certainly being put to a good use.
Test Results - Bandwidth
For bandwidth test I've just set Iometer to do 1MB sequential reads with 16 outstanding IO/s per worker.
Even with a single worker the system can push 2085MB/s across the wire which is getting quite close to the maximum practical speed you can get out of 2x10GbE NICs so I'm quite happy with this result!
Conclusion
I'll be doing more testing in the upcoming days but so far it appears that the upgrade was totally worth it. Having a home system capable of pushing 226K IOPs and 2GB/s bandwidth is an impressive show of how far you can push the system consisting mostly of consumer grade components. Keep in mind that the only way I could get the above numbers is by making sure all the data is available in ZFS ARC which was the initial goal of my setup.
I was keen on upgrading to 10GbE for quite some time as both my server as well as clients could easily saturate 1GbE link when ZFS had all the required data in ARC. 32GB RAM in my storage server usually left me with the ARC of about 20GB which in most cases happened to be enough to cache the entire workset I was working with. Bottom line is the box rarely had to touch the disk and even if it did there was 120GB L2ARC SSD to even the bumps capable of maxing out 1GbE link as well.
It so happened that I managed to get my hands on a pair of 10GBASE-T Emulex OCe11102-NT NICs which I bought at a significant discount. With 10GBASE-T switches still costing upwards of multiple thousand dollars (even when used) I decided to just get a pair of CAT6 crossover cables running from the ZFS storage box to my workstation and do some tests to see what this configuration could be capable of.
Storage Server
My storage server is running Solaris 11 and the storage pool is built using 4x3TB Hitachi 5400RPM drives in RAID10 (ZFS mirror). The box has 32GB RAM and 120GB Vertex 3 MAX IOPS SSD for L2ARC. As mentioned above, the cache subsystem is enough to keep the box from hitting the disks most of the time. All that is driven by Intel Core i7-3770 CPU (Ivy Bridge).
iSCSI network
I've decided to dedicate 10GbE adapters to the iSCSI network I have between the storage box and my workstation. First of all, this is where I need all the speed I can get. Secondly, I can utlilize both ports with iSCSI MPIO thus archiving 20Gb available bandwidth. This is probably a total overkill but since my cards are dual ported I may as well use both ports as all I need is an extra cat6 cable. The network utilizes 9K jumbo frames. ZFS volume is using 4K block size to match NTFS file system cluster size built on top of the iSCSI volume. COMSTAR is used as an iSCSI target server with Microsoft iSCSI Initiator used for the client.
Test Results - IOPS
I'll start with IOPS results with 100% random read access over 20GB of data using Iometer at different block sizes and worker counts. Each worker was set to do 16 outstanding I/Os.
With 4K blocks the system is able to archive quite impressive 226K IOPS! The storage sever CPU is simply running flat out at this point so I'm confident there is more available from the network. At 16K blocks the system is pushing over 1.5GB/s of random IO which is equally impressive and clearly goes beyond what a single 10GbE link is capable of so the second link is certainly being put to a good use.
Test Results - Bandwidth
For bandwidth test I've just set Iometer to do 1MB sequential reads with 16 outstanding IO/s per worker.
Even with a single worker the system can push 2085MB/s across the wire which is getting quite close to the maximum practical speed you can get out of 2x10GbE NICs so I'm quite happy with this result!
Conclusion
I'll be doing more testing in the upcoming days but so far it appears that the upgrade was totally worth it. Having a home system capable of pushing 226K IOPs and 2GB/s bandwidth is an impressive show of how far you can push the system consisting mostly of consumer grade components. Keep in mind that the only way I could get the above numbers is by making sure all the data is available in ZFS ARC which was the initial goal of my setup.
Thursday, July 26, 2012
Exporting DBFS via NFS
Anybody who was thinking about exporting DBFS via NFS have probably stumbled upon the fact the Oracle says it can not be done:
There is nothing wrong with DBFS itself. The problem originated form the fact that FUSE did not have proper interfaces implemented to support exporting by the kernel. Newer versions of the Linux kernel fully support exporting. I know that OEL 6.x works for sure as I did the DBFS exports myself through both NFS as well as Samba. The common minimum kernel version circulating across the internet seems to be 2.6.27 but I haven't had a chance to check whether it's true.
Older Kernels
Fact of the matter is -- it was always possible to export FUSE via NFS. You just had to use user mode NFS server, like UNFS3. I did that too and know that it works regardless of the kernel version you're running. Unfortunately projects like UNFS3 are lagging in development as well as feature-wise. But they do work if you have to have the NFS exports working and stuck with the older Linux kernels.
DBFS does not support exporting NFS or SAMBA exportsWhat's wrong with DBFS?
There is nothing wrong with DBFS itself. The problem originated form the fact that FUSE did not have proper interfaces implemented to support exporting by the kernel. Newer versions of the Linux kernel fully support exporting. I know that OEL 6.x works for sure as I did the DBFS exports myself through both NFS as well as Samba. The common minimum kernel version circulating across the internet seems to be 2.6.27 but I haven't had a chance to check whether it's true.
Older Kernels
Fact of the matter is -- it was always possible to export FUSE via NFS. You just had to use user mode NFS server, like UNFS3. I did that too and know that it works regardless of the kernel version you're running. Unfortunately projects like UNFS3 are lagging in development as well as feature-wise. But they do work if you have to have the NFS exports working and stuck with the older Linux kernels.
Tuesday, July 24, 2012
Oracle GoldenGate Integrated Capture
Oracle GoldenGate 11.2 release notes contain an interesting new feature:
Stack dumps
Stack dump reveals a lot of interesting information already (I've left only relevant pieces in place):
XStream
Oracle GoldenGate Extract process became nothing else but an XStream client which receives LCRs from Streams Capture process and converts these to GoldenGate trail files. Anybody who was worried about Oracle Streams future because GoldenGate was supposed to be the new strategic direction (and so on and so forth) can stop worrying now. Oracle GoldenGate is Oracle Streams. The only part left are trail files and I wouldn't be surprised if that will eventually go away as well.
Oracle Streams have won the tech battle.
Extract can now be used in integrated capture mode with an Oracle database. Extract integrates with an Oracle database log mining server to receive change data from that server in the form of logical change records (LCR).All of that just rings too many bells so I've decided to find out what exactly have happened. This feature requires database patches to be installed (described in Note:1411356.1).
Stack dumps
Stack dump reveals a lot of interesting information already (I've left only relevant pieces in place):
... #10 0x00002b08f2ba21b7 in knxoutReceiveLCR () from /u01/app/oracle/ggs/libclntsh.so.11.1 #11 0x00002b08f2ae1048 in OCIXStreamOutLCRReceive () from /u01/app/oracle/ggs/libclntsh.so.11.1 #12 0x0000000000721a96 in IXAsyncReader::ProcessBatchNonCallbackArray() () #13 0x0000000000722dbc in IXAsyncReader::ReaderThread(void*) () #14 0x0000003ce8a0673d in start_thread () from /lib64/libpthread.so.0 ...The first point of entry into libclntsh.so.11.1 library is nothing else but OCIXStreamOutLCRReceive. This is a function to receive LCR (Logical Change Record) from an outbound XStream server. Let's confirm that we have an outbound server in our database:
SQL> select server_name, capture_name from dba_xstream_outbound; SERVER_NAME CAPTURE_NAME ------------------------------ ------------------------------ OGG$TEST_EXT OGG$CAP_TEST_EXTAn Oracle Streams Capture process itself:
SQL> select capture_name, rule_set_name, purpose 2 from dba_capture; CAPTURE_NAME RULE_SET_NAME PURPOSE ------------------------------ ------------------------------ ------------------- OGG$CAP_TEST_EXT OGG$TEST_EXT_CAPTURE_I GoldenGate CaptureAnd we can see all the familiar Oracle Streams rules:
SQL> select rsr.rule_name, r.rule_condition
from dba_rule_set_rules rsr, dba_rules r
where rsr.rule_set_owner = r.rule_owner
and rsr.rule_name = r.rule_name
and rsr.rule_set_name='OGG$TEST_EXT_CAPTURE_I'; 2 3 4 5
RULE_NAME RULE_CONDITION
--------------- ----------------------------------------------------------------------
TEST21 (((:ddl.get_source_database_name() = 'TEST' )) and (:ddl.get_source_da
tabase_name() != '$'))
GGS_TRACE20 (((:dml.get_object_owner() = 'GGEXT' and :dml.get_object_name() = 'GGS
_TRACE')) and :dml.get_source_database_name() = 'TEST' )
TEST19 ((:dml.get_object_owner() = 'TEST') and :dml.get_source_database_name(
) = 'TEST' )
These rules got generated based on the parameter file I have for my Extract.
XStream
Oracle GoldenGate Extract process became nothing else but an XStream client which receives LCRs from Streams Capture process and converts these to GoldenGate trail files. Anybody who was worried about Oracle Streams future because GoldenGate was supposed to be the new strategic direction (and so on and so forth) can stop worrying now. Oracle GoldenGate is Oracle Streams. The only part left are trail files and I wouldn't be surprised if that will eventually go away as well.
Oracle Streams have won the tech battle.
Subscribe to:
Posts (Atom)