Thursday, July 25, 2013
Enkitec E4 2013
Just a quick note that I'll be presenting at this year's Enkitec E4 conference. You can find the schedule here. I did some under the hood investigation regarding how the whole DBFS stack works from the performance perspective and, needless to say, some findings simply left me startled. If you want to see a session which will forever change the way you look at DBFS then this will definitely be the one to attend.
Tuesday, July 23, 2013
Oracle GoldenGate Integrated Capture #2
I have already written some words on the subject before. However, since then some interesting things have happened.
To recap (or save you some time if you don't want to read the original article) GoldenGate Integrated Capture is nothing else but Oracle Streams Capture in disguise. When running in the Integrated Capture mode pretty much the entire GoldenGate front-end gets yanked out of the picture and replaced with Oracle Streams Capture technology instead.
Let's take a closer look at some of the differences that happened between now and then. Here are two capture processes:
Indeed, if we take a look at the stats for both processes...
To recap (or save you some time if you don't want to read the original article) GoldenGate Integrated Capture is nothing else but Oracle Streams Capture in disguise. When running in the Integrated Capture mode pretty much the entire GoldenGate front-end gets yanked out of the picture and replaced with Oracle Streams Capture technology instead.
Let's take a closer look at some of the differences that happened between now and then. Here are two capture processes:
SQL> select capture_name, rule_set_name, purpose from dba_capture order by 1; CAPTURE_NAME RULE_SET_NAME PURPOSE ------------------------------ ------------------------------ ------------------- OGG$CAP_GG1_EXT1 OGG$GG1_EXT1_CAPTURE_I GoldenGate Capture OGG$CAP_GG1_EXT2 GoldenGate CaptureOGG$CAP_GG1_EXT1 was created using GoldenGate 11.2.1.0.3 while OGG$CAP_GG1_EXT2 was created using 11.2.1.0.6BP3. Notice how RULE_SET_NAME is empty for the second one? Without the rule set the Capture process will just capture everything that's happening in the database so my first concern was whether it's going to negatively affect the performance?
Indeed, if we take a look at the stats for both processes...
SID CAPTURE_NAME STARTUP_TIME TOTAL_PREFILTER_DISCARDED TOTAL_MESSAGES_ENQUEUED
---------- ---------------- --------------------- ------------------------- -----------------------
33 OGG$CAP_GG1_EXT1 23/07/2013 2:21:56 PM 47 71
147 OGG$CAP_GG1_EXT2 23/07/2013 2:21:56 PM 28 2589
...notice a big difference in TOTAL_MESSAGES_ENQUEUED. Both processes were started at exactly the same time and they are capturing from exactly the same objects. This difference makes sense -- without a rule set in place the second process will enqueue every change it sees. By itself, this change would be pretty bad, but something else has happened as well:
SQL> select capture_name, v.program 2 from v$streams_capture c, v$session v 3 where c.SID = v.SID; CAPTURE_NAME PROGRAM ------------------------------ ------------------------------------------------ OGG$CAP_GG1_EXT1 oracle@gg1.quadro.com (CP01) OGG$CAP_GG1_EXT2 extract@gg1.quadro.com (TNS V1-V3)Notice how OGG$CAP_GG1_EXT1 is an Oracle shadow process (a Streams Capture process) while OGG$CAP_GG1_EXT2 is the Extract process itself! What we had before is the Extract process acting as an XStreams client to the Streams Capture process. With this change it is no longer required and we can see that the XStreams client is now detached from the server:
SQL> select server_name, capture_name, status from dba_xstream_outbound order by 1; SERVER_NAME CAPTURE_NAME STATUS ------------------------------ ------------------------------ -------- OGG$GG1_EXT1 OGG$CAP_GG1_EXT1 ATTACHED OGG$GG1_EXT2 OGG$CAP_GG1_EXT2 DETACHEDThat clears the concern of an empty rule set -- the Extract can now filter the records internally because nothing needs to be shipped out anyways where before it would have resulted in a massive traffic between the Capture process, buffered queue and the XStreams client.
Wednesday, March 13, 2013
Parallel unfriendly
Take a look at the following Parallel section of a SQL Monitor report:
Any query which produces such a report won't care about how much parallel you're running because virtually all the time is spent by the query coordinator (which is a serial process) being busy.
In this case the query in question is quite simple:
So what do you do when you get impacted by a particular choice of algorithm implemented by Oracle? One way to deal with it is to see whether you can trade one set of problem for another, in case the alternative can be executed in a better way. In this particular case the fact table contains the sale transactions for a particular store chain. While there are many different ways to spent money, the number of distinct spending amounts should be relatively low compared to the number of rows we have in the table and in such a case we can calculate the median in a different way.
What we can do instead is count how many occurrences of each spending we have and, when sorted by the spending amount, that will give us a compressed form of the raw data which still retains all the information required to find a median. Let's say you have a table with the following data:
Our new real query will look like this:
The total execution time has dropped from almost 26 minutes down to 28 seconds. Moreover, the workload is now much more skewed towards parallel query slaves, which is exactly what we want to see. Of course, the trick only works if the group by is able to collapse the data sufficiently enough.
Any query which produces such a report won't care about how much parallel you're running because virtually all the time is spent by the query coordinator (which is a serial process) being busy.
In this case the query in question is quite simple:
select /*+ parallel(t,8) */ median(basket_amount) from whs.fact_sale tThe reason it behaves the way it does has everything to do with how Oracle executes it:
Execution Plan ---------------------------------------------------------- Plan hash value: 712547042 ----------------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib | ----------------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 4 | 110K (3)| 00:00:03 | | | | | | | 1 | SORT GROUP BY | | 1 | 4 | | | | | | | | | 2 | PX COORDINATOR | | | | | | | | | | | | 3 | PX SEND QC (RANDOM) | :TQ10000 | 724M| 2763M| 110K (3)| 00:00:03 | | | Q1,00 | P->S | QC (RAND) | | 4 | PX BLOCK ITERATOR | | 724M| 2763M| 110K (3)| 00:00:03 | 1 |1048575| Q1,00 | PCWC | | | 5 | TABLE ACCESS STORAGE FULL| FACT_SALE | 724M| 2763M| 110K (3)| 00:00:03 | 1 |1048575| Q1,00 | PCWP | | -----------------------------------------------------------------------------------------------------------------------------------------Each parallel query slave will gets it's own chunk of the table to read from and then simply send data back to the coordinator. The coordinator will then have to deal with all this data by sorting more than 700M rows which, of course, won't be particularly fast. In this sense median poses an interesting problem since Oracle can't calculate (or, rather, discover) it without having access to the entire data set and query coordinator is the only process which can do it.
So what do you do when you get impacted by a particular choice of algorithm implemented by Oracle? One way to deal with it is to see whether you can trade one set of problem for another, in case the alternative can be executed in a better way. In this particular case the fact table contains the sale transactions for a particular store chain. While there are many different ways to spent money, the number of distinct spending amounts should be relatively low compared to the number of rows we have in the table and in such a case we can calculate the median in a different way.
What we can do instead is count how many occurrences of each spending we have and, when sorted by the spending amount, that will give us a compressed form of the raw data which still retains all the information required to find a median. Let's say you have a table with the following data:
SQL> select n from z_t;
N
----------
1
1
2
3
3
5
7
7 rows selectedThe first step is to find how many occurrences of each value do we have:
SQL> select n, count(*) cnt
2 from z_t
3 group by n;
N CNT
---------- ----------
1 2
2 1
5 1
3 2
7 1If the number of distinct values is relatively low, the group by will be able to collapse the result set enough as to make subsequent work to be not very significant as well as do it in a very parallel friendly way. The next step is to calculate the cardinality of the data set, at which places we have each distinct value as well as how many values are there:
SQL> select n, lag(running_sum, 1, 0) over (order by n) prev_running_sum, running_sum, total_row_count
2 from (
3 select n,
4 sum(cnt) over (order by n) running_sum,
5 sum(cnt) over () total_row_count
6 from (
7 select n, count(*) cnt
8 from z_t
9 group by n
10 ));
N PREV_RUNNING_SUM RUNNING_SUM TOTAL_ROW_COUNT
---------- ---------------- ----------- ---------------
1 0 2 7
2 2 3 7
3 3 5 7
5 5 6 7
7 6 7 7
So what the above tells us is we have two 1s, followed by a single 2, followed by two 3s and so on. Because we have seven elements in our data set, we know that the median will be the item number four which we can now easily discover:
SQL> select avg(n) from (
2 select n, lag(value_begin, 1, 0) over (order by n) prev_value_begin, value_begin, total_row_count
3 from (
4 select n,
5 sum(cnt) over (order by n) value_begin,
6 sum(cnt) over () total_row_count
7 from (
8 select n, count(*) cnt
9 from z_t
10 group by n
11 ))) where total_row_count/2 between prev_value_begin and value_begin;
AVG(N)
----------
3The avg is there for a case where we have an even number of elements since the median in this case equals to a mean value of two values in the middle.
Our new real query will look like this:
select avg(n) from ( select n, lag(value_begin, 1, 0) over (order by n) prev_value_begin, value_begin, total_row_count from ( select n, sum(cnt) over (order by n) value_begin, sum(cnt) over () total_row_count from ( select /*+ parallel(t,8) */ basket_amount n, count(*) cnt from whs.fact_sale t group by basket_amount ))) where total_row_count/2 between prev_value_begin and value_begin;So what does a group by and a set of analytic functions is able to bring to a table? Let's take a look:
The total execution time has dropped from almost 26 minutes down to 28 seconds. Moreover, the workload is now much more skewed towards parallel query slaves, which is exactly what we want to see. Of course, the trick only works if the group by is able to collapse the data sufficiently enough.
Tuesday, February 26, 2013
In-memory PQ and physical reads
In my previous post I've demonstrated how in-memory PQ can access the table directly from the buffer cache even when you're using manual DOP.
One interesting question, however, is what happens when PQ slave needs to read some blocks from disk given that object has been qualified for in-memory (cached) access? Would the slave do it using direct or buffered I/O?
The answer becomes somewhat clear once you realize that in-memory PQ is enabled by simply utilizing buffered reads. Since direct path reads can not take advantage of any blocks stored in the buffer cache (local or remote), trying to do direct path reads will defeat the whole point of in-memory PQ.
To demonstrate the point here is the excerpt from a tkprof output for one of the parallel query slaves:
On another note it looks like the things came full circle. Serial sessions are now able to utilize direct path reads while parallel query slaves are able to do buffered I/O.
All tests were done on 11.2.0.3 (both Exadata and non-Exadata).
One interesting question, however, is what happens when PQ slave needs to read some blocks from disk given that object has been qualified for in-memory (cached) access? Would the slave do it using direct or buffered I/O?
The answer becomes somewhat clear once you realize that in-memory PQ is enabled by simply utilizing buffered reads. Since direct path reads can not take advantage of any blocks stored in the buffer cache (local or remote), trying to do direct path reads will defeat the whole point of in-memory PQ.
To demonstrate the point here is the excerpt from a tkprof output for one of the parallel query slaves:
Rows (1st) Rows (avg) Rows (max) Row Source Operation
---------- ---------- ---------- ---------------------------------------------------
0 0 0 SORT AGGREGATE (cr=0 pr=0 pw=0 time=0 us)
0 0 0 PX COORDINATOR (cr=0 pr=0 pw=0 time=0 us)
0 0 0 PX SEND QC (RANDOM) :TQ10000 (cr=0 pr=0 pw=0 time=0 us)
1 1 1 SORT AGGREGATE (cr=62838 pr=62500 pw=0 time=7492751 us)
57696 62500 72120 PX BLOCK ITERATOR (cr=62838 pr=62500 pw=0 time=24528381 us cost=18846 size=0 card=500000)
57696 62500 72120 TABLE ACCESS FULL Z_TEST (cr=62838 pr=62500 pw=0 time=23944184 us cost=18846 size=0 card=500000)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
PX Deq: Execution Msg 128 0.19 1.06
db file scattered read 4546 0.16 43.99
latch free 1 0.00 0.00
latch: cache buffers lru chain 31 0.01 0.03
latch: cache buffers chains 3 0.00 0.00
latch: object queue header operation 4 0.00 0.00
Note that the slave waited on db file scattered read event which is nothing else but buffered multiblock reads. If you were to run the same test on the Exadata platform you would see cell multiblock physical read event instead, given that in-memory PQ did get utilized. There are a couple of consequences for this:
- There is no need to do object level checkpoint. This makes in-memory PQ somewhat more friendly to be running in OLTP environment since it doesn't need to flush any dirty buffers to disk.
- If you running on the Exadata platform, none of the offloading will happen, even if you have to read the significant portion of the table from disk.
On another note it looks like the things came full circle. Serial sessions are now able to utilize direct path reads while parallel query slaves are able to do buffered I/O.
All tests were done on 11.2.0.3 (both Exadata and non-Exadata).
Monday, February 18, 2013
Does in-memory PQ work with PARALLEL_DEGREE_POLICY=MANUAL?
In-memory parallel execution seems to be gaining popularity especially among people running x2-8 and x3-8 Exadata systems or any other system that have large amounts of memory capable of caching lots of data.
Oracle documentation suggests that in order to utilize in-memory PQ, parallel_degree_policy needs to be set to auto.
_parallel_cluster_cache_policy
One of the parameters influenced by parallel_degree_policy is _parallel_cluster_cache_policy. When using Auto DOP _parallel_cluster_cache_policy will be set to cached. The question then becomes what happens if we set _parallel_cluster_cache_policy=cached while still keeping Manual DOP? Will the system use in-memory PQ?
Test table
Below is a test table setup:
Classic PQ #1
Without setting any additional parameters PQ behave the way it always did -- by utilizing direct path reads directly to the process' memory:
Caching the table
Of course, before testing the in-memory PQ, we need to make sure that our entire table sits in the buffer cache. The easiest way to do it is perform an FTS on the table using an index:
Classic PQ #2
Even with the table entirely cached we still get it using physical reads when utilizing classic PQ -- as it should be:
Let's flip the parameter responsible for in-memory PQ (while still keeping parallel_degree_policy=manual) and see what happens:
Conclusion
It is possible to use in-memory PQ with Manual DOP by setting _parallel_cluster_cache_policy=cached. Of course, always consult with Oracle support before flipping any of the underscore parameters.
Oracle documentation suggests that in order to utilize in-memory PQ, parallel_degree_policy needs to be set to auto.
_parallel_cluster_cache_policy
One of the parameters influenced by parallel_degree_policy is _parallel_cluster_cache_policy. When using Auto DOP _parallel_cluster_cache_policy will be set to cached. The question then becomes what happens if we set _parallel_cluster_cache_policy=cached while still keeping Manual DOP? Will the system use in-memory PQ?
Test table
Below is a test table setup:
SQL> create table z_test tablespace data as
select level n, rpad('*', 4000, '*') v
from dual
connect by level <= 500000;
Table created.
SQL> alter table z_test add constraint pk_z_test primary key (n);
Table altered.
SQL> select bytes/power(1024,2) mb
from user_segments
where segment_name='Z_TEST';
MB
----------
3968
SQL> exec dbms_stats.gather_table_stats(user, 'z_test');
PL/SQL procedure successfully completed.
The instance is running with 12G buffer cache so it'll have no problems fully caching the above table. All tests were done on my in-house test lab with Oracle 11.2.0.3.3 running inside a Linux VM.
Classic PQ #1
Without setting any additional parameters PQ behave the way it always did -- by utilizing direct path reads directly to the process' memory:
SQL> set timing on
SQL> set autot trace exp stat
SQL> select /*+ parallel(8) full(z_test) */ count(*) from z_test;
Elapsed: 00:00:02.86
Execution Plan
----------------------------------------------------------
Plan hash value: 2128527892
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 18846 (1)| 00:00:01 | | | |
| 1 | SORT AGGREGATE | | 1 | | | | | |
| 2 | PX COORDINATOR | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | | | Q1,00 | PCWP | |
| 5 | PX BLOCK ITERATOR | | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWC | |
| 6 | TABLE ACCESS FULL| Z_TEST | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWP | |
--------------------------------------------------------------------------------------------------------
Note
-----
- Degree of Parallelism is 8 because of hint
Statistics
----------------------------------------------------------
25 recursive calls
0 db block gets
500525 consistent gets
500000 physical reads
0 redo size
526 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
It took us about 2.86 seconds to full scan the table which equals 1387MB/s throughput (my test lab storage setup is described here). The above clearly shows that we had to do physical reads in order to access the entire table.
Caching the table
Of course, before testing the in-memory PQ, we need to make sure that our entire table sits in the buffer cache. The easiest way to do it is perform an FTS on the table using an index:
SQL> select /*+ index(z_test,pk_z_test) */ v from z_test;
500000 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 579016438
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 500K| 1907M| 501K (1)| 00:00:03 |
| 1 | TABLE ACCESS BY INDEX ROWID| Z_TEST | 500K| 1907M| 501K (1)| 00:00:03 |
| 2 | INDEX FULL SCAN | PK_Z_TEST | 500K| | 1052 (1)| 00:00:01 |
-----------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
534311 consistent gets
501105 physical reads
0 redo size
2021185355 bytes sent via SQL*Net to client
367187 bytes received via SQL*Net from client
33335 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
500000 rows processed
Let's check to make sure all table blocks are in the buffer cache:
SQL> set autot off
SQL> select count(*)
from v$bh
where objd= (select data_object_id from user_objects where object_name='Z_TEST')
and status='xcur';
COUNT(*)
----------
500001
Now we're good to go!
Classic PQ #2
Even with the table entirely cached we still get it using physical reads when utilizing classic PQ -- as it should be:
SQL> set autot trace exp stat
SQL> select /*+ parallel(8) full(z_test) */ count(*) from z_test;
Elapsed: 00:00:02.83
Execution Plan
----------------------------------------------------------
Plan hash value: 2128527892
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 18846 (1)| 00:00:01 | | | |
| 1 | SORT AGGREGATE | | 1 | | | | | |
| 2 | PX COORDINATOR | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | | | Q1,00 | PCWP | |
| 5 | PX BLOCK ITERATOR | | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWC | |
| 6 | TABLE ACCESS FULL| Z_TEST | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWP | |
--------------------------------------------------------------------------------------------------------
Note
-----
- Degree of Parallelism is 8 because of hint
Statistics
----------------------------------------------------------
24 recursive calls
0 db block gets
500525 consistent gets
500000 physical reads
0 redo size
526 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
In-memory PQ
Let's flip the parameter responsible for in-memory PQ (while still keeping parallel_degree_policy=manual) and see what happens:
SQL> alter session set "_parallel_cluster_cache_policy"=cached;
Session altered.
Elapsed: 00:00:00.01
SQL> select /*+ parallel(8) full(z_test) */ count(*) from z_test;
Elapsed: 00:00:00.36
Execution Plan
----------------------------------------------------------
Plan hash value: 2128527892
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 18846 (1)| 00:00:01 | | | |
| 1 | SORT AGGREGATE | | 1 | | | | | |
| 2 | PX COORDINATOR | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 1 | | | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | | | Q1,00 | PCWP | |
| 5 | PX BLOCK ITERATOR | | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWC | |
| 6 | TABLE ACCESS FULL| Z_TEST | 500K| 18846 (1)| 00:00:01 | Q1,00 | PCWP | |
--------------------------------------------------------------------------------------------------------
Note
-----
- Degree of Parallelism is 8 because of hint
Statistics
----------------------------------------------------------
24 recursive calls
0 db block gets
502709 consistent gets
0 physical reads
0 redo size
526 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
For you see -- the entire table got read from the buffer cache this time and much faster! The fact that we did zero physical IOs shows in-memory PQ kicking in.
Conclusion
It is possible to use in-memory PQ with Manual DOP by setting _parallel_cluster_cache_policy=cached. Of course, always consult with Oracle support before flipping any of the underscore parameters.
Monday, January 28, 2013
GoldenGate and transient PK updates
The problem of transient PK updates is well known and comes from the fact that pretty much every replication solution on the market applies changes using row-by-row approach.
To quickly recap the problem, if you have a table like:
Oracle Streams historically dealt with that problem using internal mechanism which you could leverage by executing a specially constructed LCR. Other (third-party) replication solution were pretty much out of luck and had to rely on elaborate tricks in order to work around the problem.
This is all about to change.
The necessity to better integrate GoldenGate and bring it feature set up has driven quite a bit of exciting innovation. One of these innovations solves the transient PK problem not only for GoldenGate but for everybody else.
dbms_xstream_gg package
The above package has been available at least since 11.2.0.2 and has two procedures which are directly relevant to the problem described above. I'm talking about enable_tdup_workspace and disable_tdup_workspace. Here is a quick demonstration of how they work:
I think anybody who were into any sort of replication and its problems will find this to be one of the most significant new features made available. The only caveat is that the above package is not documented so anyone thinking about leveraging it needs to carefully think about the way it behaves.
To quickly recap the problem, if you have a table like:
SQL> create table tpk (n number primary key); Table created SQL> insert into tpk values (1); 1 row inserted SQL> insert into tpk values (2); 1 row inserted SQL> commit; Commit completeThen executing the following statement...
update tpk set n=n+1...will result in a transient PK problem since the replication solution will have to decompose it into the following two statements:
update tpk set n=2 where n=1; update tpk set n=3 where n=2;There are two immediate (and major) problems with the above statements. The first problem is that we can't execute the first statement without violating the primary key constraint. Another problem is, even if we somehow could execute the first statement, the second statement will result in updating both rows since they now have the same value!
Oracle Streams historically dealt with that problem using internal mechanism which you could leverage by executing a specially constructed LCR. Other (third-party) replication solution were pretty much out of luck and had to rely on elaborate tricks in order to work around the problem.
This is all about to change.
The necessity to better integrate GoldenGate and bring it feature set up has driven quite a bit of exciting innovation. One of these innovations solves the transient PK problem not only for GoldenGate but for everybody else.
dbms_xstream_gg package
The above package has been available at least since 11.2.0.2 and has two procedures which are directly relevant to the problem described above. I'm talking about enable_tdup_workspace and disable_tdup_workspace. Here is a quick demonstration of how they work:
SQL> --this will result in PK violation SQL> update tpk set n=2 where n=1; update tpk set n=2 where n=1 ORA-00001: unique constraint (ROOT.SYS_C005031) violated SQL> exec dbms_xstream_gg.enable_tdup_workspace; PL/SQL procedure successfully completed SQL> --this is now works! SQL> update tpk set n=2 where n=1; 1 row updated SQL> update tpk set n=3 where n=2; 1 row updated SQL> exec dbms_xstream_gg.disable_tdup_workspace; PL/SQL procedure successfully completed SQL> commit; Commit completeAs you can see, the procedure allows us to avoid classical transient PK problem! Indeed, that's what GoldenGate uses internally to avoid getting in troubles as well. The implementation seems to be leveraging the same delete+insert trick Oracle Streams did:
SQL> exec dbms_xstream_gg.enable_tdup_workspace;
PL/SQL procedure successfully completed
SQL> select n, rowid from tpk;
N ROWID
---------- ------------------
1 AAAO25AAFAAClWFAAA
2 AAAO25AAFAAClWFAAB
SQL> update tpk set n=2 where n=1;
1 row updated
SQL> select n, rowid from tpk;
N ROWID
---------- ------------------
2 AAAO25AAFAAClWFAAB
SQL> update tpk set n=3 where n=2;
1 row updated
SQL> select n, rowid from tpk;
N ROWID
---------- ------------------
2 AAAO25AAFAAClWFAAA
3 AAAO25AAFAAClWFAAB
Note how the row mysteriously disappears after the first update and then suddenly comes back after the second one?
I think anybody who were into any sort of replication and its problems will find this to be one of the most significant new features made available. The only caveat is that the above package is not documented so anyone thinking about leveraging it needs to carefully think about the way it behaves.
Thursday, December 20, 2012
Oracle GoldenGate Sequence Replication
When using Oracle GoldenGate sequence replication there is a number of issues you need to be aware of especially if you replicate quite a lot of busy sequences.
The first issue is that GoldenGate sequence replication does not use bind variables. Let's execute the following statements on the source system:
When using higher cache values the problem of hard parses gets somewhat mitigated but there is another issue. When replicating such a sequence GoldenGate follows these steps:
The first issue is that GoldenGate sequence replication does not use bind variables. Let's execute the following statements on the source system:
SQL> create sequence rep1.s1 nocache;
Sequence created
SQL> select rep1.s1.nextval from dual;
NEXTVAL
----------
1
SQL> select rep1.s1.nextval from dual;
NEXTVAL
----------
2
GoldenGate uses PL/SQL procedure called replicateSequence each time it needs to sync sequence values. The following calls will be made on the destination system as a result of the above statements:
BEGIN ggext .replicateSequence (TO_NUMBER(2), TO_NUMBER(20), TO_NUMBER(1), 'REP1', TO_NUMBER(0), 'S1', UPPER('ggrep'), TO_NUMBER(1), TO_NUMBER (0), ''); END;
BEGIN ggext .replicateSequence (TO_NUMBER(3), TO_NUMBER(20), TO_NUMBER(1), 'REP1', TO_NUMBER(0), 'S1', UPPER('ggrep'), TO_NUMBER(1), TO_NUMBER (0), ''); END;
The first parameter is a target sequence value (seq$.highwater) and it's the one which is causing most of the issues, especially if the sequence has been declared with relatively low cache value (or nocache at all, as in my example). Every time a new sequence last value gets written into the source system data dictionary we get a hard parse on the destination!
When using higher cache values the problem of hard parses gets somewhat mitigated but there is another issue. When replicating such a sequence GoldenGate follows these steps:
- Sets sequence to nocache
- Executes sequence.nextval until it reaches the target value
- Restores altered sequence properties
Saturday, October 20, 2012
ZFS Home Storage Network at 10GbE
About a year ago I've decided to put all my data on a home built ZFS storage server. The growing amount of devices around my household prompted for an easier and much faster way to share the data. Since then the box was happily serving both CIFS as well as iSCSI over 1GbE network without any issues.
I was keen on upgrading to 10GbE for quite some time as both my server as well as clients could easily saturate 1GbE link when ZFS had all the required data in ARC. 32GB RAM in my storage server usually left me with the ARC of about 20GB which in most cases happened to be enough to cache the entire workset I was working with. Bottom line is the box rarely had to touch the disk and even if it did there was 120GB L2ARC SSD to even the bumps capable of maxing out 1GbE link as well.
It so happened that I managed to get my hands on a pair of 10GBASE-T Emulex OCe11102-NT NICs which I bought at a significant discount. With 10GBASE-T switches still costing upwards of multiple thousand dollars (even when used) I decided to just get a pair of CAT6 crossover cables running from the ZFS storage box to my workstation and do some tests to see what this configuration could be capable of.
Storage Server
My storage server is running Solaris 11 and the storage pool is built using 4x3TB Hitachi 5400RPM drives in RAID10 (ZFS mirror). The box has 32GB RAM and 120GB Vertex 3 MAX IOPS SSD for L2ARC. As mentioned above, the cache subsystem is enough to keep the box from hitting the disks most of the time. All that is driven by Intel Core i7-3770 CPU (Ivy Bridge).
iSCSI network
I've decided to dedicate 10GbE adapters to the iSCSI network I have between the storage box and my workstation. First of all, this is where I need all the speed I can get. Secondly, I can utlilize both ports with iSCSI MPIO thus archiving 20Gb available bandwidth. This is probably a total overkill but since my cards are dual ported I may as well use both ports as all I need is an extra cat6 cable. The network utilizes 9K jumbo frames. ZFS volume is using 4K block size to match NTFS file system cluster size built on top of the iSCSI volume. COMSTAR is used as an iSCSI target server with Microsoft iSCSI Initiator used for the client.
Test Results - IOPS
I'll start with IOPS results with 100% random read access over 20GB of data using Iometer at different block sizes and worker counts. Each worker was set to do 16 outstanding I/Os.
With 4K blocks the system is able to archive quite impressive 226K IOPS! The storage sever CPU is simply running flat out at this point so I'm confident there is more available from the network. At 16K blocks the system is pushing over 1.5GB/s of random IO which is equally impressive and clearly goes beyond what a single 10GbE link is capable of so the second link is certainly being put to a good use.
Test Results - Bandwidth
For bandwidth test I've just set Iometer to do 1MB sequential reads with 16 outstanding IO/s per worker.
Even with a single worker the system can push 2085MB/s across the wire which is getting quite close to the maximum practical speed you can get out of 2x10GbE NICs so I'm quite happy with this result!
Conclusion
I'll be doing more testing in the upcoming days but so far it appears that the upgrade was totally worth it. Having a home system capable of pushing 226K IOPs and 2GB/s bandwidth is an impressive show of how far you can push the system consisting mostly of consumer grade components. Keep in mind that the only way I could get the above numbers is by making sure all the data is available in ZFS ARC which was the initial goal of my setup.
I was keen on upgrading to 10GbE for quite some time as both my server as well as clients could easily saturate 1GbE link when ZFS had all the required data in ARC. 32GB RAM in my storage server usually left me with the ARC of about 20GB which in most cases happened to be enough to cache the entire workset I was working with. Bottom line is the box rarely had to touch the disk and even if it did there was 120GB L2ARC SSD to even the bumps capable of maxing out 1GbE link as well.
It so happened that I managed to get my hands on a pair of 10GBASE-T Emulex OCe11102-NT NICs which I bought at a significant discount. With 10GBASE-T switches still costing upwards of multiple thousand dollars (even when used) I decided to just get a pair of CAT6 crossover cables running from the ZFS storage box to my workstation and do some tests to see what this configuration could be capable of.
Storage Server
My storage server is running Solaris 11 and the storage pool is built using 4x3TB Hitachi 5400RPM drives in RAID10 (ZFS mirror). The box has 32GB RAM and 120GB Vertex 3 MAX IOPS SSD for L2ARC. As mentioned above, the cache subsystem is enough to keep the box from hitting the disks most of the time. All that is driven by Intel Core i7-3770 CPU (Ivy Bridge).
iSCSI network
I've decided to dedicate 10GbE adapters to the iSCSI network I have between the storage box and my workstation. First of all, this is where I need all the speed I can get. Secondly, I can utlilize both ports with iSCSI MPIO thus archiving 20Gb available bandwidth. This is probably a total overkill but since my cards are dual ported I may as well use both ports as all I need is an extra cat6 cable. The network utilizes 9K jumbo frames. ZFS volume is using 4K block size to match NTFS file system cluster size built on top of the iSCSI volume. COMSTAR is used as an iSCSI target server with Microsoft iSCSI Initiator used for the client.
Test Results - IOPS
I'll start with IOPS results with 100% random read access over 20GB of data using Iometer at different block sizes and worker counts. Each worker was set to do 16 outstanding I/Os.
With 4K blocks the system is able to archive quite impressive 226K IOPS! The storage sever CPU is simply running flat out at this point so I'm confident there is more available from the network. At 16K blocks the system is pushing over 1.5GB/s of random IO which is equally impressive and clearly goes beyond what a single 10GbE link is capable of so the second link is certainly being put to a good use.
Test Results - Bandwidth
For bandwidth test I've just set Iometer to do 1MB sequential reads with 16 outstanding IO/s per worker.
Even with a single worker the system can push 2085MB/s across the wire which is getting quite close to the maximum practical speed you can get out of 2x10GbE NICs so I'm quite happy with this result!
Conclusion
I'll be doing more testing in the upcoming days but so far it appears that the upgrade was totally worth it. Having a home system capable of pushing 226K IOPs and 2GB/s bandwidth is an impressive show of how far you can push the system consisting mostly of consumer grade components. Keep in mind that the only way I could get the above numbers is by making sure all the data is available in ZFS ARC which was the initial goal of my setup.
Thursday, July 26, 2012
Exporting DBFS via NFS
Anybody who was thinking about exporting DBFS via NFS have probably stumbled upon the fact the Oracle says it can not be done:
There is nothing wrong with DBFS itself. The problem originated form the fact that FUSE did not have proper interfaces implemented to support exporting by the kernel. Newer versions of the Linux kernel fully support exporting. I know that OEL 6.x works for sure as I did the DBFS exports myself through both NFS as well as Samba. The common minimum kernel version circulating across the internet seems to be 2.6.27 but I haven't had a chance to check whether it's true.
Older Kernels
Fact of the matter is -- it was always possible to export FUSE via NFS. You just had to use user mode NFS server, like UNFS3. I did that too and know that it works regardless of the kernel version you're running. Unfortunately projects like UNFS3 are lagging in development as well as feature-wise. But they do work if you have to have the NFS exports working and stuck with the older Linux kernels.
DBFS does not support exporting NFS or SAMBA exportsWhat's wrong with DBFS?
There is nothing wrong with DBFS itself. The problem originated form the fact that FUSE did not have proper interfaces implemented to support exporting by the kernel. Newer versions of the Linux kernel fully support exporting. I know that OEL 6.x works for sure as I did the DBFS exports myself through both NFS as well as Samba. The common minimum kernel version circulating across the internet seems to be 2.6.27 but I haven't had a chance to check whether it's true.
Older Kernels
Fact of the matter is -- it was always possible to export FUSE via NFS. You just had to use user mode NFS server, like UNFS3. I did that too and know that it works regardless of the kernel version you're running. Unfortunately projects like UNFS3 are lagging in development as well as feature-wise. But they do work if you have to have the NFS exports working and stuck with the older Linux kernels.
Tuesday, July 24, 2012
Oracle GoldenGate Integrated Capture
Oracle GoldenGate 11.2 release notes contain an interesting new feature:
Stack dumps
Stack dump reveals a lot of interesting information already (I've left only relevant pieces in place):
XStream
Oracle GoldenGate Extract process became nothing else but an XStream client which receives LCRs from Streams Capture process and converts these to GoldenGate trail files. Anybody who was worried about Oracle Streams future because GoldenGate was supposed to be the new strategic direction (and so on and so forth) can stop worrying now. Oracle GoldenGate is Oracle Streams. The only part left are trail files and I wouldn't be surprised if that will eventually go away as well.
Oracle Streams have won the tech battle.
Extract can now be used in integrated capture mode with an Oracle database. Extract integrates with an Oracle database log mining server to receive change data from that server in the form of logical change records (LCR).All of that just rings too many bells so I've decided to find out what exactly have happened. This feature requires database patches to be installed (described in Note:1411356.1).
Stack dumps
Stack dump reveals a lot of interesting information already (I've left only relevant pieces in place):
... #10 0x00002b08f2ba21b7 in knxoutReceiveLCR () from /u01/app/oracle/ggs/libclntsh.so.11.1 #11 0x00002b08f2ae1048 in OCIXStreamOutLCRReceive () from /u01/app/oracle/ggs/libclntsh.so.11.1 #12 0x0000000000721a96 in IXAsyncReader::ProcessBatchNonCallbackArray() () #13 0x0000000000722dbc in IXAsyncReader::ReaderThread(void*) () #14 0x0000003ce8a0673d in start_thread () from /lib64/libpthread.so.0 ...The first point of entry into libclntsh.so.11.1 library is nothing else but OCIXStreamOutLCRReceive. This is a function to receive LCR (Logical Change Record) from an outbound XStream server. Let's confirm that we have an outbound server in our database:
SQL> select server_name, capture_name from dba_xstream_outbound; SERVER_NAME CAPTURE_NAME ------------------------------ ------------------------------ OGG$TEST_EXT OGG$CAP_TEST_EXTAn Oracle Streams Capture process itself:
SQL> select capture_name, rule_set_name, purpose 2 from dba_capture; CAPTURE_NAME RULE_SET_NAME PURPOSE ------------------------------ ------------------------------ ------------------- OGG$CAP_TEST_EXT OGG$TEST_EXT_CAPTURE_I GoldenGate CaptureAnd we can see all the familiar Oracle Streams rules:
SQL> select rsr.rule_name, r.rule_condition
from dba_rule_set_rules rsr, dba_rules r
where rsr.rule_set_owner = r.rule_owner
and rsr.rule_name = r.rule_name
and rsr.rule_set_name='OGG$TEST_EXT_CAPTURE_I'; 2 3 4 5
RULE_NAME RULE_CONDITION
--------------- ----------------------------------------------------------------------
TEST21 (((:ddl.get_source_database_name() = 'TEST' )) and (:ddl.get_source_da
tabase_name() != '$'))
GGS_TRACE20 (((:dml.get_object_owner() = 'GGEXT' and :dml.get_object_name() = 'GGS
_TRACE')) and :dml.get_source_database_name() = 'TEST' )
TEST19 ((:dml.get_object_owner() = 'TEST') and :dml.get_source_database_name(
) = 'TEST' )
These rules got generated based on the parameter file I have for my Extract.
XStream
Oracle GoldenGate Extract process became nothing else but an XStream client which receives LCRs from Streams Capture process and converts these to GoldenGate trail files. Anybody who was worried about Oracle Streams future because GoldenGate was supposed to be the new strategic direction (and so on and so forth) can stop worrying now. Oracle GoldenGate is Oracle Streams. The only part left are trail files and I wouldn't be surprised if that will eventually go away as well.
Oracle Streams have won the tech battle.
Wednesday, July 18, 2012
V$CELL_CONFIG
If you ever wondered how you can convert XML information in v$cell_config on Exadata database servers into relational rows and columns so you can get a nice view of the cell configuration without going into the cell itself then here is a nice piece of SQL which does the job. I'm using celldisks as an example and selecting only a number of column so the output fits nicely on the screen:
SQL> select cellname,
2 name,
3 deviceName,
4 diskType,
5 round(freeSpace/power(1024,3), 2) freeSpace,
6 round(disk_size/power(1024,3), 2) disk_size
7 from (
8 select cellname, XMLTYPE.createXML(confval) confval
9 from v$cell_config
10 where conftype='CELLDISKS'
11 and cellname='192.168.12.3'
12 ) v,
13 xmltable('/cli-output/celldisk' passing v.confval
14 columns
15 name varchar(15) path 'name',
16 creationtime varchar(25) path 'creationTime',
17 deviceName varchar(9) path 'deviceName',
18 devicePartition varchar2(10) path 'devicePartition',
19 diskType varchar2(9) path 'diskType',
20 errorCount number path 'errorCount',
21 freeSpace number path 'freeSpace',
22 id varchar2(50) path 'id',
23 interleaving varchar(10) path 'interleaving',
24 lun varchar2(5) path 'lun',
25 raidLevel number path 'raidLevel',
26 disk_size number path 'size',
27 status varchar2(10) path 'status'
28 );
CELLNAME NAME DEVICENAME DISKTYPE FREESPACE DISK_SIZE
--------------- --------------- ---------- --------- ---------- ----------
192.168.12.3 CD_00_enkcel01 /dev/sda HardDisk 0 1832.59
192.168.12.3 CD_01_enkcel01 /dev/sdb HardDisk 0 1832.59
192.168.12.3 CD_02_enkcel01 /dev/sdc HardDisk 0 1861.7
192.168.12.3 CD_04_enkcel01 /dev/sdl HardDisk 0 1861.7
192.168.12.3 CD_05_enkcel01 /dev/sde HardDisk 0 1861.7
192.168.12.3 CD_06_enkcel01 /dev/sdf HardDisk 0 1861.7
192.168.12.3 CD_07_enkcel01 /dev/sdg HardDisk 0 1861.7
192.168.12.3 CD_08_enkcel01 /dev/sdh HardDisk 0 1861.7
192.168.12.3 CD_09_enkcel01 /dev/sdi HardDisk 0 1861.7
192.168.12.3 CD_10_enkcel01 /dev/sdj HardDisk 0 1861.7
192.168.12.3 CD_11_enkcel01 /dev/sdk HardDisk 0 1861.7
192.168.12.3 FD_00_enkcel01 /dev/sds FlashDisk 0 22.88
192.168.12.3 FD_01_enkcel01 /dev/sdr FlashDisk 0 22.88
192.168.12.3 FD_02_enkcel01 /dev/sdt FlashDisk 0 22.88
192.168.12.3 FD_03_enkcel01 /dev/sdu FlashDisk 0 22.88
192.168.12.3 FD_04_enkcel01 /dev/sdaa FlashDisk 0 22.88
192.168.12.3 FD_05_enkcel01 /dev/sdz FlashDisk 0 22.88
192.168.12.3 FD_06_enkcel01 /dev/sdab FlashDisk 0 22.88
192.168.12.3 FD_07_enkcel01 /dev/sdac FlashDisk 0 22.88
192.168.12.3 FD_08_enkcel01 /dev/sdn FlashDisk 0 22.88
192.168.12.3 FD_09_enkcel01 /dev/sdo FlashDisk 0 22.88
192.168.12.3 FD_10_enkcel01 /dev/sdp FlashDisk 0 22.88
192.168.12.3 FD_11_enkcel01 /dev/sdq FlashDisk 0 22.88
192.168.12.3 FD_12_enkcel01 /dev/sdv FlashDisk 0 22.88
192.168.12.3 FD_13_enkcel01 /dev/sdw FlashDisk 0 22.88
192.168.12.3 FD_14_enkcel01 /dev/sdx FlashDisk 0 22.88
192.168.12.3 FD_15_enkcel01 /dev/sdy FlashDisk 0 22.88
27 rows selected
Of course, the same approach can be applied to get other information out (like Grid Disks, etc.).
Tuesday, July 17, 2012
Displaying ASM Partner Disks
Here is a quick SQL which I sometimes use to show people disks and their respective partners (grouped by a failure group) inside an ASM disk group every time I need to explain both of these concepts.
An example output from a quarter rack Exadata with a normal redundancy disk group:
An example output from a quarter rack Exadata with a normal redundancy disk group:
SQL> column p format a80
SQL> variable group_number number
SQL> exec :group_number := 1;
PL/SQL procedure successfully completed.
SQL> select d||' => '||listagg(p, ',') within group (order by p) p
from (
select ad1.failgroup||'('||to_char(ad1.disk_number, 'fm000')||')' d,
ad2.failgroup||'('||listagg(to_char(p.number_kfdpartner, 'fm000'), ',') within group (order by ad1.disk_number)||')' p
from v$asm_disk ad1, x$kfdpartner p, v$asm_disk ad2
where ad1.disk_number = p.disk
and p.number_kfdpartner=ad2.disk_number
and ad1.group_number = p.grp
and ad2.group_number = p.grp
and p.grp = :group_number
group by ad1.failgroup, ad1.disk_number, ad2.failgroup
) group by d
order by d;
P
--------------------------------------------------------------------------------
EXAPCEL01(12) => EXAPCEL02(26,30,31,35),EXAPCEL03(06,08,09,11)
EXAPCEL01(13) => EXAPCEL02(30,31,34,35),EXAPCEL03(03,07,09,11)
EXAPCEL01(14) => EXAPCEL02(24,28,29,34),EXAPCEL03(05,09,10,11)
EXAPCEL01(15) => EXAPCEL02(28,29,33,34),EXAPCEL03(04,05,08,11)
EXAPCEL01(16) => EXAPCEL02(26,27,33,35),EXAPCEL03(02,06,07,09)
EXAPCEL01(17) => EXAPCEL02(25,32,33,35),EXAPCEL03(00,01,03,07)
EXAPCEL01(18) => EXAPCEL02(24,27,32,34),EXAPCEL03(00,03,08,10)
EXAPCEL01(19) => EXAPCEL02(25,26,32,33),EXAPCEL03(01,02,07,10)
EXAPCEL01(20) => EXAPCEL02(28,30,31,32),EXAPCEL03(04,05,06,10)
EXAPCEL01(21) => EXAPCEL02(24,27,29,31),EXAPCEL03(00,01,04,08)
EXAPCEL01(22) => EXAPCEL02(25,26,27,30),EXAPCEL03(02,03,04,06)
EXAPCEL01(23) => EXAPCEL02(24,25,28,29),EXAPCEL03(00,01,02,05)
EXAPCEL02(24) => EXAPCEL01(14,18,21,23),EXAPCEL03(05,07,09,11)
EXAPCEL02(25) => EXAPCEL01(17,19,22,23),EXAPCEL03(02,06,09,11)
EXAPCEL02(26) => EXAPCEL01(12,16,19,22),EXAPCEL03(04,08,10,11)
EXAPCEL02(27) => EXAPCEL01(16,18,21,22),EXAPCEL03(01,04,08,11)
EXAPCEL02(28) => EXAPCEL01(14,15,20,23),EXAPCEL03(00,06,08,09)
EXAPCEL02(29) => EXAPCEL01(14,15,21,23),EXAPCEL03(01,02,03,05)
EXAPCEL02(30) => EXAPCEL01(12,13,20,22),EXAPCEL03(04,05,08,10)
EXAPCEL02(31) => EXAPCEL01(12,13,20,21),EXAPCEL03(00,01,07,10)
EXAPCEL02(32) => EXAPCEL01(17,18,19,20),EXAPCEL03(00,06,07,10)
EXAPCEL02(33) => EXAPCEL01(15,16,17,19),EXAPCEL03(02,03,05,09)
EXAPCEL02(34) => EXAPCEL01(13,14,15,18),EXAPCEL03(02,03,04,07)
EXAPCEL02(35) => EXAPCEL01(12,13,16,17),EXAPCEL03(00,01,03,06)
EXAPCEL03(00) => EXAPCEL01(17,18,21,23),EXAPCEL02(28,31,32,35)
EXAPCEL03(01) => EXAPCEL01(17,19,21,23),EXAPCEL02(27,29,31,35)
EXAPCEL03(02) => EXAPCEL01(16,19,22,23),EXAPCEL02(25,29,33,34)
EXAPCEL03(03) => EXAPCEL01(13,17,18,22),EXAPCEL02(29,33,34,35)
EXAPCEL03(04) => EXAPCEL01(15,20,21,22),EXAPCEL02(26,27,30,34)
EXAPCEL03(05) => EXAPCEL01(14,15,20,23),EXAPCEL02(24,29,30,33)
EXAPCEL03(06) => EXAPCEL01(12,16,20,22),EXAPCEL02(25,28,32,35)
EXAPCEL03(07) => EXAPCEL01(13,16,17,19),EXAPCEL02(24,31,32,34)
EXAPCEL03(08) => EXAPCEL01(12,15,18,21),EXAPCEL02(26,27,28,30)
EXAPCEL03(09) => EXAPCEL01(12,13,14,16),EXAPCEL02(24,25,28,33)
EXAPCEL03(10) => EXAPCEL01(14,18,19,20),EXAPCEL02(26,30,31,32)
EXAPCEL03(11) => EXAPCEL01(12,13,14,15),EXAPCEL02(24,25,26,27)
Friday, July 06, 2012
ASM normal redundancy with high protection template
One of the ways you can control how many mirror copies ASM keeps on disk is via diskgroup templates.
This presents an interesting question -- can you turn your normal redundancy disk group into a high redundancy by using a diskgroup template with high protection attribute thus telling ASM to use triple mirroring for related files?
I'll start by creating a diskgroup:
Now, because all data essentially got triple mirrored for our MIRROR_HIGH tablespace it ought to survive two disk failure, right? Let's give it a try:
The conclusion is that triple-mirroring inside a normal redundancy disk group can help you in certain cases (like physical corruption of both primary and secondary extents so ASM can use a third one to read the data) but it's not a substitute for a high redundancy disk group.
Update a little bit later: if anyone is thinking about doing it the other way around then remember that mirroring attribute works only for normal redundancy disk groups.
This presents an interesting question -- can you turn your normal redundancy disk group into a high redundancy by using a diskgroup template with high protection attribute thus telling ASM to use triple mirroring for related files?
I'll start by creating a diskgroup:
SQL> create diskgroup data normal redundancy 2 disk '/dev/sdb1', '/dev/sdc1', '/dev/sdd1'; Diskgroup created.I'm not explicitly specifying any failgroups, each disk will end up in it's own FG and triple-mirroring will spread extents across all three disks. Let's add a high protection template:
SQL> alter diskgroup data add template mirror_high attributes (high coarse); Diskgroup altered.Now let's create two tablespaces, one using the default mirroring template (two-way) and one using mirror_high (three-way) template:
SQL> create tablespace mirror_normal datafile '+data' size 256m; Tablespace created. SQL> create tablespace mirror_high datafile '+data(mirror_high)' size 256m; Tablespace created.We can confirm how many mirror extents our datafiles have:
SQL> select a.name,
case LXN_KFFXP
when 0 then 'primary'
when 1 then 'secondary'
when 2 then 'third'
end, count(*)
from X$KFFXP, v$asm_alias a
where a.file_number=NUMBER_KFFXP
group by a.name, LXN_KFFXP
order by 1, 2; 2 3 4 5 6 7 8 9 10
NAME CASELXN_K COUNT(*)
------------------------------ --------- ----------
MIRROR_HIGH.257.787940101 primary 258
MIRROR_HIGH.257.787940101 secondary 258
MIRROR_HIGH.257.787940101 third 258
MIRROR_NORMAL.256.787940093 primary 258
MIRROR_NORMAL.256.787940093 secondary 258
MIRROR_NORMAL.256.787940093 third 1
6 rows selected.
Indeed, all extents got triple mirrored for MIRROR_HIGH tablespace and only double mirrored for MIRROR_NORMAL (with the exception of one extent which probably holds some triple-mirrored metadata).
Now, because all data essentially got triple mirrored for our MIRROR_HIGH tablespace it ought to survive two disk failure, right? Let's give it a try:
SQL> alter diskgroup data dismount; Diskgroup altered.... then I'm going to change the permissions and make two disks inaccessible to ASM:
[oracle@ora11gr2 ~]$ chmod 000 /dev/sdc1 [oracle@ora11gr2 ~]$ chmod 000 /dev/sdd1Let's see if I could mount it back:
SQL> alter diskgroup data mount force; alter diskgroup data mount force * ERROR at line 1: ORA-15032: not all alterations performed ORA-15017: diskgroup "DATA" cannot be mounted ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA"...looks like the answer is no. It still sees that the remaining disk is missing required partners and does not allow the disk group to be mounted. How many extent copies the datafile had becomes a moot point.
The conclusion is that triple-mirroring inside a normal redundancy disk group can help you in certain cases (like physical corruption of both primary and secondary extents so ASM can use a third one to read the data) but it's not a substitute for a high redundancy disk group.
Update a little bit later: if anyone is thinking about doing it the other way around then remember that mirroring attribute works only for normal redundancy disk groups.
Thursday, July 05, 2012
Serial direct path reads in 11GR2 and Exadata environments
Serial direct path reads were first introduced in Oracle 11G which were noticed by many customers (in both positive and negative ways) who upgraded from the earlier releases. Back then I did a quick write up on the subject to try and tackle some of the key variables which affect the behavior.
To recap, the following were observed for 11G:
To refresh the memory, here is how the methodology works (all code is available in the 11G post).
Start threshold
The function grows a segment in a loop, performing a full table scan and using physical reads direct statistic as an indicator. Every iteration buffer cache is flushed to make sure cached blocks do not affect the numbers.
Stop cached threshold
A loop is used to cache more and more blocks from a table followed by a full table scan. Again, physical reads direct statistic is used to see when serial direct path reads no longer happen. Buffer cache is flushed at the beginning of the test.
Stop dirty threshold
Similar to the above but blocks are updated instead of just being cached.
Results
Summary
11GR2 threshold is indeed more aggressive and starts at _small_table_threshold (i.e. five times lower compared to 11G). There seems to be no change to the rest of the numbers. Behavior appears to be the same on both Exadata and non-Exadata systems.
Bottom line is 11GR2 starts doing serial direct path reads earlier and stops doing these at the same time as 11G did.
All testing were done on 11.2.0.3
To recap, the following were observed for 11G:
- Serial direct path reads start at _small_table_threshold*5.
- Serial direct path reads stop when 50% of the table blocks are cached.
- Serial direct path reads stop when 25% of the table blocks are dirty.
To refresh the memory, here is how the methodology works (all code is available in the 11G post).
Start threshold
The function grows a segment in a loop, performing a full table scan and using physical reads direct statistic as an indicator. Every iteration buffer cache is flushed to make sure cached blocks do not affect the numbers.
Stop cached threshold
A loop is used to cache more and more blocks from a table followed by a full table scan. Again, physical reads direct statistic is used to see when serial direct path reads no longer happen. Buffer cache is flushed at the beginning of the test.
Stop dirty threshold
Similar to the above but blocks are updated instead of just being cached.
Results
| Non-Exadata | Exadata | |
|---|---|---|
| Start threshold | _small_table_threshold | _small_table_threshold |
| Stop cached | 50% | 50% |
| Stop dirty | 25% | 25% |
Summary
11GR2 threshold is indeed more aggressive and starts at _small_table_threshold (i.e. five times lower compared to 11G). There seems to be no change to the rest of the numbers. Behavior appears to be the same on both Exadata and non-Exadata systems.
Bottom line is 11GR2 starts doing serial direct path reads earlier and stops doing these at the same time as 11G did.
All testing were done on 11.2.0.3
Tuesday, June 19, 2012
Online Redefinition and Reference Partitioning
Reference partitioning combined with dbms_redefinition can result in a perpetually disabled foreign key constraint (I'm using 11.2.0.3 as an example):
Footnote:
The bug number for this one is 13572659.
SQL> create table p ( 2 n number primary key 3 ) partition by list (n) 4 ( 5 partition p1 values (1) 6 ); Table created SQL> create table r1 (n number not null); Table created SQL> create table r2 ( 2 n number not null, 3 constraint fk_r$n foreign key (n) referencing p (n) 4 ) partition by reference (fk_r$n); Table created SQL> exec dbms_redefinition.start_redef_table(user, 'R1', 'R2', options_flag => dbms_redefinition.cons_use_rowid); PL/SQL procedure successfully completed SQL> exec dbms_redefinition.finish_redef_table(user, 'R1', 'R2'); PL/SQL procedure successfully completed SQL> select status, validated from user_constraints where constraint_name='FK_R$N'; STATUS VALIDATED -------- ------------- DISABLED NOT VALIDATEDThe annoying part comes when you realize that you can't do anything about it (at least not by normal means):
SQL> alter table r1 enable constraint FK_R$N; alter table r1 enable constraint FK_R$N ORA-14650: operation not supported for reference-partitioned tables SQL> alter table r1 enable novalidate constraint FK_R$N; alter table r1 enable novalidate constraint FK_R$N ORA-14650: operation not supported for reference-partitioned tablesSurprisingly enough the constraint actually works as far as reference partitioning is concerned. However, it results in an "unclean" status which can't be fixed! Of course the view just shows us what we have in cdef$:
SQL> select con#, enabled, defer
2 from sys.cdef$
3 where con# =
4 (select con#
5 from sys.con$
6 where owner#=(select user# from sys.user$ where name=user)
7 and name= 'FK_R$N'
8 );
CON# ENABLED DEFER
---------- ---------- ----------
10997 512
ENABLED set to NULL will decode to DISABLED while the following results in NOT VALIDATED:
SQL> select decode(bitand(512, 4), 4, 'VALIDATED', 'NOT VALIDATED') VALIDATED from dual; VALIDATED ------------- NOT VALIDATEDSo what we really need is ENABLED set to 1 and DEFER set to 516. Of course, playing with the data dictionary in such a way is a big no-no on production system without blessing of Oracle support, but since I'm on my sandbox database I'll give it a shot:
SQL> update cdef$ set enabled=1, defer=516 where con#=10997; 1 row updated. SQL> commit; Commit complete. SQL> startup force ORACLE instance started. Total System Global Area 400846848 bytes Fixed Size 2228784 bytes Variable Size 146804176 bytes Database Buffers 243269632 bytes Redo Buffers 8544256 bytes Database mounted. Database opened. SQL> select status, validated from user_constraints where constraint_name='FK_R$N'; STATUS VALIDATED -------- ------------- ENABLED VALIDATEDLooks like it worked :) Note that there is practically no way for a reference partitioning constraint to be either disabled or not validated -- the rows have to go somewhere, after all, so we don't introduce any logical inconsistencies by setting cdef$ values to what they really should be.
Footnote:
The bug number for this one is 13572659.
Wednesday, May 02, 2012
Deferred Segment Creation Quirks
If you have a habit, as I do, to quickly check dba_segments to make sure there is nothing important in the tablespace you are about to drop then deferred segment creation can present some surprises.
Consider the following:
SQL> create tablespace test; Tablespace created SQL> create table t (n number) tablespace test; Table created SQL> select segment_created from user_tables where table_name='T'; SEGMENT_CREATED --------------- NO SQL> select * from dba_segments where tablespace_name='TEST'; no rows selectedI can now go ahead and drop the tablespace. Since there isn't really any segment Oracle allows me to do that without issuing any warnings:
SQL> drop tablespace test; Tablespace droppedThe weird part comes when you want to do something with that table:
SQL> insert into t values (1); insert into t values (1) ORA-00959: tablespace 'TEST' does not exist SQL> alter table t move tablespace users; alter table t move tablespace users ORA-00959: tablespace 'TEST' does not exist SQL> drop table t; drop table t ORA-00959: tablespace 'TEST' does not existThe most annoying part is that you can't move or even drop the table and I see no real reason for such behavior. After all, if Oracle allows you to drop the tablespace pretending that nothing is there, it should allow me to drop/change storage parameters for a non existing segment as well. Otherwise it looks like the "front end" got updated to support deferred segment creation while the "back end" still operates the legacy way. Thankfully there is an easy workaround -- just create a shell tablespace with required name and then move the table:
SQL> create tablespace test; Tablespace created SQL> alter table t move tablespace users; Table altered
Tuesday, May 01, 2012
Result Cache Latch in 11GR2: Shared Mode Gets
Almost two years ago I wrote a post about 11GR2 Result Cache Scalability. In the post I noted that there has been a significant improvement compared to 11GR1. What left me a little bit puzzled is where exactly the performance advantage came from. After all, it didn't look like the latch was changed to allow for shared mode gets as I still observed some sleeps during my tests. Back then I was quite happy with the results and made a note to revisit the subject in order to figure out exactly what happened which then got on my back burner and eventually slipped away.
Result Cache: RC Latch
We can begin by testing whether shared mode gets are allowed or not using an oradebug call:
Session 1:
So it is confirmed -- in 11GR2 the Result Cache latch can be acquired in a shared mode!
Result Cache: RC Latch
We can begin by testing whether shared mode gets are allowed or not using an oradebug call:
SQL> select addr from v$latch where name='Result Cache: RC Latch'; ADDR ---------------- 0000000060040B70 SQL> oradebug setmypid Statement processed. SQL> oradebug call kslgetsl_w 0x0000000060040B70 1 1 1 8 Function returned 1The forth argument (8) tells the function to acquire the latch in the shared mode and it does work! Now let's see what happens when another session wants to put something into the result cache:
SQL> create table t as select 1 n from dual; Table created. SQL> select /*+ result_cache */ * from t; --session waitsAs expected, the session waits. We do need to get a latch in exclusive mode in order to change the result cache memory contents so we can put something new in there. We can confirm that the session indeed waits on the Result Cache latch (which can be seen by the latch address):
SQL> select event, to_char(p1, 'xxxxxxxxxx') latch_addr, seconds_in_wait, state 2 from v$session_wait 3 where sid=63; EVENT LATCH_ADDR SECONDS_IN_WAIT STATE --------------- ----------- --------------- ------------------- latch free 60040b70 49 WAITINGLet's free the latch in our first session:
SQL> oradebug call kslfre 0x0000000060040B70 Function returned 0Immediately our second session proceeds. Now it's time to do the most interesting part -- would the session block again if it only needs to read from the result cache?
Session 1:
SQL> oradebug call kslgetsl_w 0x0000000060040B70 1 1 1 8 Function returned 1Session 2:
SQL> select /*+ result_cache */ * from t;
N
----------
1
For you see -- the latch was indeed changed to allow for shared mode gets! We can further verify the behavior by acquiring the Result Cache latch in exclusive mode in our first session:
SQL> oradebug call kslfre 0x0000000060040B70 Function returned 0 SQL> oradebug call kslgetsl_w 0x0000000060040B70 1 1 1 16 Function returned 1The value of 16 in the forth argument tells the function to acquire the latch in exclusive mode. Let's test the second session again:
SQL> select /*+ result_cache */ * from t; --sessin waitsWell, you can't get a shared mode if someone else has exclusive. Again, freeing the latch in our first session allows second session to proceeds.
So it is confirmed -- in 11GR2 the Result Cache latch can be acquired in a shared mode!
Monday, April 30, 2012
ORA-09925: Unable to create audit trail file
Sometimes the issues you hit make no sense at all.
Consider the following when things behaving normally:
[oracle@ora11gr2 ~]$ export ORACLE_SID=test [oracle@ora11gr2 ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.3.0 Production on Mon Apr 30 17:09:31 2012 Copyright (c) 1982, 2011, Oracle. All rights reserved. Connected to an idle instance. SQL> exit DisconnectedI have no instance named test running on the box and get the expected behavior. Now let's continue:
[oracle@ora11gr2 ~]$ touch /tmp/test [oracle@ora11gr2 ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.3.0 Production on Mon Apr 30 17:10:54 2012 Copyright (c) 1982, 2011, Oracle. All rights reserved. ERROR: ORA-09925: Unable to create audit trail file Linux-x86_64 Error: 2: No such file or directory Additional information: 9925 ORA-09925: Unable to create audit trail file Linux-x86_64 Error: 2: No such file or directory Additional information: 9925The only thing I changed is created an empty file /tmp/test and suddenly I'm getting some "ORA-09925: Unable to create audit trail file" errors? What's going on? If we strace SQL*Plus we would see the following:
3798 stat("/u01/app/oracle/admin/test/adump", 0x7fff1a8b5bd0) = -1 ENOENT (No such file or directory)
3798 access("/tmp/test", F_OK) = 0
3798 open("/u01/app/oracle/product/11.2.0/dbhome_2/rdbms/audit//tmp/test_ora_3798_1.aud", O_RDWR|O_CREAT|O_EXCL, 0660) = -1 ENOENT (No such file or directory)
The first command checks for audit file directory in a default location. However, the second pair of commands does something strange. It tests whether file /tmp/test exists and if it does it tries to open an audit trace file under a really weird path. But where does this path comes from?
If I look at my environment variables:
[oracle@ora11gr2 ~]$ env | grep /tmp ORACLE_PATH=/tmpApparently I have some strange ORACLE_PATH variable set. What if I unset it?
[oracle@ora11gr2 ~]$ unset ORACLE_PATH [oracle@ora11gr2 ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.3.0 Production on Mon Apr 30 17:22:27 2012 Copyright (c) 1982, 2011, Oracle. All rights reserved. Connected to an idle instance. SQL> exit DisconnectedStrangely enough everything works normally again! So what I discovered is the following:
- When you run "sqlplus / as sysdba" SQL*Plus will check whether you have ORACLE_PATH environment variable set.
- If you do, it looks up a file named $ORACLE_PATH/$ORACLE_SID.
- If such a file exists you will get the above behavior.
Bug 13584524: STRANGE AUDIT DESTINATION IF ORACLE_PATH IS SET: ORA-09925For now the bug appears to be affecting only 11.2.0.3 regardless of the platform.
Saturday, February 04, 2012
session_cached_cursors again
Not so long time ago I had to troubleshoot code performance regression issue between 9iR2 and 11gR2.
The issue had been related to a business-critical stored procedure that, for some reason, was performing almost 50% slower when run in 11gR2 compared to 9iR2. When I took a look at the procedure I've discovered that there were, strictly speaking, nothing that could regress. At least not in the commons sense when we're talking about upgrades. All the procedure was doing was a look up using a statement containing UNION ALL.
Below is the artificially created test case which demonstrates the issue.
9iR2 results:
11gR2 results:
As you can see from the above figures, 11gR2 spent significantly more time parsing. Given that both sessions had 100% session cursor cache hit (and yes, the procedure didn't make use of bind variables either) and were otherwise run on pretty much identical environments (apart from the Oracle version), such a difference in parse time is certainly unexpected.
It was another time to gear up for a reproducible test case. Perhaps I'm being lucky but it boiled down to a session_cache_cursors setting again. To better demonstrate the point I'm going to run the following anonymous PL/SQL block in my 11gR2 database:
What it does is sets cursor_sharing=force and then executes the same statement with different settings for session_cached_cursors -- from 100 to 1000 (with 100 increment). What do you think the results would be?
What happened is 9iR2 was running with session_cached_cursors=1000 and, apparently, it never had any negative effect... until the setting was transferred to 11gR2. It's somewhat hard to speculate what exactly is happening but based on my limited test cases it appears that you need (a) a union all in the statement, (b) run with cursor_sharing=force and (c) have literals in your query. Under these circumstances there appears to be that strange performance degradation observed in 11gR2. But I need to do more investigation on that one. For now all I can say is that, after setting session_cached_cursors=100, 11gR2 exhibited the same results as 9iR2.
***** UPDATE 07-Feb-2012 *****
It appears that only 11.2.0.1 and 11.2.0.2 are affected by this issue (see comments for more details).
***** UPDATE 07-Feb-2012 #2 *****
I've found the bug# for this one: "Bug 12345980 high parse time with cursor_sharing=force when session_cached_cursors set".
The issue had been related to a business-critical stored procedure that, for some reason, was performing almost 50% slower when run in 11gR2 compared to 9iR2. When I took a look at the procedure I've discovered that there were, strictly speaking, nothing that could regress. At least not in the commons sense when we're talking about upgrades. All the procedure was doing was a look up using a statement containing UNION ALL.
Below is the artificially created test case which demonstrates the issue.
9iR2 results:
select * from dual where :"SYS_B_0"=:"SYS_B_1" union all select * from dual where :"SYS_B_2"=:"SYS_B_3" call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 99999 3.88 4.41 0 0 0 0 Execute 99999 1.23 1.31 0 0 0 0 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 199998 5.11 5.73 0 0 0 0
11gR2 results:
select * from dual where :"SYS_B_0"=:"SYS_B_1" union all select * from dual where :"SYS_B_2"=:"SYS_B_3" call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 99999 6.14 7.21 0 0 0 0 Execute 99999 1.04 1.38 0 0 0 0 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 199998 7.19 8.59 0 0 0 0
As you can see from the above figures, 11gR2 spent significantly more time parsing. Given that both sessions had 100% session cursor cache hit (and yes, the procedure didn't make use of bind variables either) and were otherwise run on pretty much identical environments (apart from the Oracle version), such a difference in parse time is certainly unexpected.
It was another time to gear up for a reproducible test case. Perhaps I'm being lucky but it boiled down to a session_cache_cursors setting again. To better demonstrate the point I'm going to run the following anonymous PL/SQL block in my 11gR2 database:
SQL> begin 2 execute immediate 'alter session set cursor_sharing=force'; 3 for i in 1 .. 10 4 loop 5 execute immediate 'alter session set session_cached_cursors='||to_char(i*100); 6 for j in 1 .. 10000 7 loop 8 execute immediate 9 'select /* test_'||to_char(i*100)||' */ * from dual where '||to_char(j)||'='||to_char(j)|| 10 ' union all select * from dual where '||to_char(j)||'='||to_char(j); 11 end loop; 12 end loop; 13 end; 14 / PL/SQL procedure successfully completed
What it does is sets cursor_sharing=force and then executes the same statement with different settings for session_cached_cursors -- from 100 to 1000 (with 100 increment). What do you think the results would be?
SQL> select to_number(substr(sql_text, instr(sql_text, 'test')+5, 4)) cursor_cache, executions, cpu_time
2 from v$sql
3 where sql_text like '%test%'
4 and executions=10000
5 order by 1;
CURSOR_CACHE EXECUTIONS CPU_TIME
------------ ---------- ----------
100 10000 571916
200 10000 517928
300 10000 559908
400 10000 572909
500 10000 561912
600 10000 616915
700 10000 652899
800 10000 670886
900 10000 671892
1000 10000 702893
10 rows selectedSurprisingly enough, rising session_cached_cursors from 100 to 1000 increases the CPU time required to execute the statements from 571916 to 702893! You can also spot how the CPU time increases together with the session_cached_cursors setting.What happened is 9iR2 was running with session_cached_cursors=1000 and, apparently, it never had any negative effect... until the setting was transferred to 11gR2. It's somewhat hard to speculate what exactly is happening but based on my limited test cases it appears that you need (a) a union all in the statement, (b) run with cursor_sharing=force and (c) have literals in your query. Under these circumstances there appears to be that strange performance degradation observed in 11gR2. But I need to do more investigation on that one. For now all I can say is that, after setting session_cached_cursors=100, 11gR2 exhibited the same results as 9iR2.
***** UPDATE 07-Feb-2012 *****
It appears that only 11.2.0.1 and 11.2.0.2 are affected by this issue (see comments for more details).
***** UPDATE 07-Feb-2012 #2 *****
I've found the bug# for this one: "Bug 12345980 high parse time with cursor_sharing=force when session_cached_cursors set".
Subscribe to:
Comments (Atom)

