Saturday, February 04, 2012

session_cached_cursors again

Not so long time ago I had to troubleshoot code performance regression issue between 9iR2 and 11gR2.

The issue had been related to a business-critical stored procedure that, for some reason, was performing almost 50% slower when run in 11gR2 compared to 9iR2. When I took a look at the procedure I've discovered that there were, strictly speaking, nothing that could regress. At least not in the commons sense when we're talking about upgrades. All the procedure was doing was a look up using a statement containing UNION ALL.

Below is the artificially created test case which demonstrates the issue.

9iR2 results:

select *
from
 dual where :"SYS_B_0"=:"SYS_B_1" union all select * from dual where
  :"SYS_B_2"=:"SYS_B_3"


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    99999      3.88       4.41          0          0          0           0
Execute  99999      1.23       1.31          0          0          0           0
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   199998      5.11       5.73          0          0          0           0

11gR2 results:

select *
from
 dual where :"SYS_B_0"=:"SYS_B_1" union all select * from dual where
  :"SYS_B_2"=:"SYS_B_3"


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    99999      6.14       7.21          0          0          0           0
Execute  99999      1.04       1.38          0          0          0           0
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   199998      7.19       8.59          0          0          0           0

As you can see from the above figures, 11gR2 spent significantly more time parsing. Given that both sessions had 100% session cursor cache hit (and yes, the procedure didn't make use of bind variables either) and were otherwise run on pretty much identical environments (apart from the Oracle version), such a difference in parse time is certainly unexpected.

It was another time to gear up for a reproducible test case. Perhaps I'm being lucky but it boiled down to a session_cache_cursors setting again. To better demonstrate the point I'm going to run the following anonymous PL/SQL block in my 11gR2 database:

SQL> begin
  2   execute immediate 'alter session set cursor_sharing=force';
  3   for i in 1 .. 10
  4   loop
  5    execute immediate 'alter session set session_cached_cursors='||to_char(i*100);
  6    for j in 1 .. 10000
  7    loop
  8     execute immediate
  9      'select /* test_'||to_char(i*100)||' */ * from dual where '||to_char(j)||'='||to_char(j)||
 10      ' union all select * from dual where '||to_char(j)||'='||to_char(j);
 11    end loop;
 12   end loop;
 13  end;
 14  /
 
PL/SQL procedure successfully completed

What it does is sets cursor_sharing=force and then executes the same statement with different settings for session_cached_cursors -- from 100 to 1000 (with 100 increment). What do you think the results would be?

SQL> select to_number(substr(sql_text, instr(sql_text, 'test')+5, 4)) cursor_cache, executions, cpu_time
  2   from v$sql
  3   where sql_text like '%test%'
  4    and executions=10000
  5   order by 1;
 
CURSOR_CACHE EXECUTIONS   CPU_TIME
------------ ---------- ----------
         100      10000     571916
         200      10000     517928
         300      10000     559908
         400      10000     572909
         500      10000     561912
         600      10000     616915
         700      10000     652899
         800      10000     670886
         900      10000     671892
        1000      10000     702893
 
10 rows selected
Surprisingly enough, rising session_cached_cursors from 100 to 1000 increases the CPU time required to execute the statements from 571916 to 702893! You can also spot how the CPU time increases together with the session_cached_cursors setting.

What happened is 9iR2 was running with session_cached_cursors=1000 and, apparently, it never had any negative effect... until the setting was transferred to 11gR2. It's somewhat hard to speculate what exactly is happening but based on my limited test cases it appears that you need (a) a union all in the statement, (b) run with cursor_sharing=force and (c) have literals in your query. Under these circumstances there appears to be that strange performance degradation observed in 11gR2. But I need to do more investigation on that one. For now all I can say is that, after setting session_cached_cursors=100, 11gR2 exhibited the same results as 9iR2.

***** UPDATE 07-Feb-2012 *****

It appears that only 11.2.0.1 and 11.2.0.2 are affected by this issue (see comments for more details).

***** UPDATE 07-Feb-2012 #2 *****

I've found the bug# for this one: "Bug 12345980 high parse time with cursor_sharing=force when session_cached_cursors set".

Wednesday, February 01, 2012

Latch: row cache objects, session_cached_cursors and a database link

What could possibly be common about the above things?

Sometime ago I was involved in tuning a large scale production system which experienced quite a lot of latch contention. In fact the contention was bad enough as to render the entire system unusable during peak hours.

A typical AWR report during times with moderate workload looked like this:

Event                      Waits           Time(s) Avg wait (ms)   % DB time  Wait Class
DB CPU                     21,805          36.95  
latch: row cache objects   14,524,462      11,552  1               19.57      Concurrency
db file sequential read    2,697,778       8,988   3               15.23      User I/O
gc current block 3-way     4,202,356       3,599   1               6.10       Cluster
gc current block 2-way     3,670,293       2,330   1               3.95       Cluster

(in reality it was a 7-node RAC cluster but it is not really relevant to our subject).

It is not hard to spot latch: row cache objects on a second place consuming almost 20% of DB time. This event was rapidly escalating whether load increased quickly bringing the entire cluster into unusable state as far as any front-end application were concerned. The total number of gets during an hour (the interval with which this particular AWR report had been made) totaled 1,281,167,103 which averages a whopping 355,879 gets per second!

The first step was to take a look at the Dictionary Cache statistics and see if there were any obvious deviations:
Cache                   Get Requests    Pct Miss
dc_awr_control          72              9.72
dc_database_links       124,452         0.05
dc_files                46,900          17.14
dc_global_oids          555,599         0.07
dc_histogram_data       8,962,576       1.01
dc_histogram_defs       412,885,094     0.02
dc_object_grants        52,475          1.13
dc_objects              2,860,222       0.64
dc_profiles             61,189          0.01
dc_rollback_segments    392,885         0.00
dc_segments             657,653         7.72
dc_sequences            1,324           67.67
dc_table_scns           3,206           3.31
dc_tablespaces          1,253,322       0.06
dc_users                5,475,824       0.02
global database name    65,092          0.01
kqlsubheap_object       10,910          0.46
outstanding_alerts      721             95.98

dc_histogram_defs really stands out but so far provides no direct clues. Though the entire dictionary cache seems to be quite busy. Let's take a look at instance activity statistics:
                  Per Second   Per Transaction  Per Exec        Per Call
DB Time(s):       16.4         0.5              0.00            0.00
DB CPU(s):        6.1          0.2              0.00            0.00
Redo size:        87,543.5     2,398.6
Logical reads:    336,555.7    9,221.2
Block changes:    449.3        12.3
Physical reads:   1,018.3      27.9
Physical writes:  36.8         1.0
User calls:       41,250.4     1,130.2
Parses:           6,264.4      171.6
Hard parses:      27.1         0.7
W/A MB processed: 33,769,202.4 925,233.0
Logons:           18.3         0.5
Executes:         14,230.5     389.9
Rollbacks:        3.4          0.1
Transactions:     36.5

A couple of things to notice:
  1. The number of executions is not that high (at least for an IBM box with 56 CPU cores, that is).
  2. Execute to Parse ratio is not particularly good nor bad (6,264.4 parses per second compared to 14,230.5 executions).
  3. None of the above seems to be enough to justify 355,879 latch: row cache objects gets per second, even when you consider 27.1 hard parses/sec.

There had to be more going on. Looking at the SQL ordered by Parse Calls section of the report revealed that all the top parsing queries were coming from a database link. Which prompted me to ask additional questions...

It turned out that the system had a somewhat weird architecture. A 7-node RAC cluster served as a back end to a numerous front-end Oracle databases which were essentially acting as PL/SQL-engines, accessing all the data through the database link while dong all the processing. The architectural feasibility of such solution is debatable (to say the least) but let's not steer away from the main topic of our discussion.

I geared to find out whether there was some special case which could cause latch: row cache objects to shoot through the roof when we deal with a lot of queries executed through a database link. After a number of experiments I eventually found a cause which explains the title of this blog post.

Take a look at the following example:

SQL> declare
  2   l_gets number;
  3  begin
  4   select gets into l_gets from v$latch@dblink where name='row cache objects';
  5  
  6   for i in 1 .. 1000
  7   loop
  8    for cur in (select null from dual@dblink) loop null; end loop;
  9   end loop;
 10  
 11   select sum(gets)-l_gets into l_gets from v$latch@dblink where name='row cache objects';
 12   dbms_output.put_line(to_char(l_gets));
 13  end;
 14  /
 
79
 
PL/SQL procedure successfully completed

Accessing a remote table 1000 times through a database link only caused 79 row cache objects gets on the destination database. However, exactly the same example can produce quite a different result:

SQL> alter session set session_cached_cursors=0;
 
Session altered
 
SQL> 
SQL> declare
  2   l_gets number;
  3  begin
  4   select gets into l_gets from v$latch@dblink where name='row cache objects';
  5  
  6   for i in 1 .. 1000
  7   loop
  8    for cur in (select null from dual@dblink) loop null; end loop;
  9   end loop;
 10  
 11   select sum(gets)-l_gets into l_gets from v$latch@dblink where name='row cache objects';
 12   dbms_output.put_line(to_char(l_gets));
 13  end;
 14  /
 
3082
 
PL/SQL procedure successfully completed

Now we jumped from 79 gets to 3082 which is a 39x increase! What's really astonishing is the fact that it's a cursor cache miss on the source which wreaks havoc on the destination. Evidently, the solution to the problem had nothing to do with the database where the problem was observed. The "PL/SQL-databases" had an insufficient setting for session_cached_cursors and once that was corrected the issue disappeared.

It is somewhat ironic that that problem would have never appears had the data and PL/SQL be allowed to be kept in the same database (where they belong).

Tuesday, April 05, 2011

Oracle GoldenGate and undocumented OCI redo log APIs

Last time I did some quick write up on the new APIs appeared in Oracle GoldenGate 11G which can be used against Oracle 10.2.0.5 and 11.2.0.2 to improve how GG handles redo on ASM.

Some of the items still required follow up, namely:
  • What are the names of these new API functions?
  • What kind of security privileges do you need to call these?
I was especially interested in the last item since the ability to read Oracle's redo log gives you extremely powerful access to database information.

New OCI APIs

Discovering these is pretty straightforward, all we need to do is a stack dump on a running GG Extract process:
[oracle@quadro.com ~]$ pstack 4390
...
#10 0x00002acc23443ef0 in knxOGGRedoLogRead () from /u01/app/oracle/ggs_11g/libclntsh.so.11.1
#11 0x00002acc232a9b3b in OCIPOGGRedoLogRead () from /u01/app/oracle/ggs_11g/libclntsh.so.11.1
#12 0x00000000008fa92c in ASMReader2::readFile(unsigned int, unsigned int, unsigned int*, char*, char*) ()
...
Once we know the name of the read function (OCIPOGGRedoLogRead) we can find the rest of them:
[oracle@quadro.com ~]$ nm /u01/app/oracle/ggs_11g/libclntsh.so.11.1 | grep OCIPOGG
00000000010e3954 T OCIPOGGRedoLogClose
00000000010e3c90 T OCIPOGGRedoLogOpen
00000000010e3a2a T OCIPOGGRedoLogRead
As you can see, these functions are indeed exposed directly through libclntsh library. We can even see the Extract process "in action" calling these functions with the help of Linux's gdb:
[oracle@quadro.com ~]$ ps -aef | grep extract
oracle    4390  4386  0 19:36 ?        00:00:00 /u01/app/oracle/ggs_11g/extract PARAMFILE /u01/app/oracle/ggs_11g/dirprm/11g_ext.prm REPORTFILE /u01/app/oracle/ggs_11g/dirrpt/11G_EXT.rpt PROCESSID 11G_EXT USESUBDIRS
oracle    4985  4114  0 19:51 pts/2    00:00:00 grep extract
[oracle@quadro.com ~]$ gdb /u01/app/oracle/ggs_11g/extract -p 4390
...
(gdb) break OCIPOGGRedoLogClose
Breakpoint 1 at 0x2acc232a9958
(gdb) break OCIPOGGRedoLogOpen
Breakpoint 2 at 0x2acc232a9c94
(gdb) break OCIPOGGRedoLogRead
Breakpoint 3 at 0x2acc232a9a2e
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00002acc232a9958 
2       breakpoint     keep y   0x00002acc232a9c94 
3       breakpoint     keep y   0x00002acc232a9a2e 
After we've sat up our breakpoints we can let the Extract run and hit one of these:
(gdb) continue
Continuing.

Breakpoint 3, 0x00002acc232a9a2e in OCIPOGGRedoLogRead ()
   from /u01/app/oracle/ggs_11g/libclntsh.so.11.1

(gdb) bt
#0  0x00002acc232a9a2e in OCIPOGGRedoLogRead ()
   from /u01/app/oracle/ggs_11g/libclntsh.so.11.1
#1  0x00000000008fa92c in ASMReader2::readFile (this=0xa408100,
    offset=, bytes_to_read=1024000,
    bytes_read=0x7fffb8c73ad4, buffer=0xabc0000 "\001\"", errtext=0xc457c0 "")
    at /home/ecloud/workspace/Build_FBO_OpenSys_r11.1.1.0.0_078_[34086]/perforce/src/app/er/redo/oracle/asm.c:798
...
On top of the stack we can see GoldenGate making a call to OCIPOGGRedoLogRead. We can also get a sense of parameters by looking at the calling function (ASMReader2::readFile). The function specifies how many bytes it would like to read (bytes_to_read=1024000), a pointer to a variable which will hold the actual amount read (bytes_read=0x7fffb8c73ad4), a pointer to a buffer to store the returned redo data (buffer=0xabc0000) and the error text, if any (errtext=0xc457c0).

Now it is time to see the other two functions usage. What I did is switched the redo logs in my database, disabled breakpoint number 3 (the one which points to OCIPOGGRedoLogRead) and let the Extract process continue:
(gdb) disable 3

(gdb) continue
Continuing.

Breakpoint 2, 0x00002acc232a9c94 in OCIPOGGRedoLogOpen ()
   from /u01/app/oracle/ggs_11g/libclntsh.so.11.1
(gdb) bt
#0  0x00002acc232a9c94 in OCIPOGGRedoLogOpen ()
   from /u01/app/oracle/ggs_11g/libclntsh.so.11.1
#1  0x00000000008fa859 in ASMReader2::openFile (this=0xa408100,
    filename=0xa408350 "+DATA/ora11gr2/onlinelog/group_3.258.743164441",
    lblksize=, blksize=0x0, tot_blks=0x0,
    filetype=, errtext=0xc457c0 "")
    at /home/ecloud/workspace/Build_FBO_OpenSys_r11.1.1.0.0_078_[34086]/perforce/src/app/er/redo/oracle/asm.c:762
...
Interestingly enough we never made a call to OCIPOGGRedoLogClose. Maybe I should open an Oracle SR and complain :)

As before, we can get a glimpse of parameters which include redo log file name as well as bunch of output parameters specifying the block size and so on (which looks somewhat similar to what you get from dbms_diskgroup.open call).

Security

Looking at how all this stuff is being exposed through OCI, I became even more eager to find out what kind of privileges do you require on the database side in order to be able to use these. The first step was to try and run the Extract process with stripped out privileges which thankfully resulted in the following error:
2011-04-05 20:10:36  ERROR   OGG-00446  Opening ASM file +DATA/ora11gr2/onlinelog/group_3.258.743164441 in DBLOGREADER mode: (1031) ORA-01031: insufficient privilegesNot able to establish initial position for sequence 398, rba 1040.
At least there is something but what exactly is it? GoldenGate's documentation (should I add "as usual?") doesn't mention anything in that regards so I had to figure that one out on my own.

After some back and forth I was able to eventually discover the privilege which makes it all possible -- it's select any transaction. Alas it is not documented to play this role but now you know anyway.

Saturday, March 05, 2011

Oracle GoldenGate, ASM and DBLOGREADER option

Some time ago I did a write up detailing how Oracle GoldenGate interacts with Oracle ASM. The technology stack where the Extract process establishes a connection with the ASM instance and then reads file contents using dbms_diskgroup package is quite cumbersome to say the least. When compared to how GG reads redo from a filesystem (detailed here), the way GG has to deal with ASM is quite inefficient to say the least. In fact, as part of my original article, the recommendation was to use a bequeath connection when possible to cut on the amount of overhead. Nowadays the use of bequeath connection became an official recommendation from Oracle outlined in Oracle GG 11G Release Notes.

DBLOGREADER

I was quite excited to see that Oracle made some progress in it's newer release of GoldenGate (11.1). Here is what the documentation says about the new option:

A new DBLOGREADER option was added to TRANLOGOPTIONS to enable Extract to use a
newer API for capturing from an Oracle ASM instance. This feature is available as of Oracle
10.2.0.5.

This was indeed promising, however, what exactly has changed remained to be discovered.

The first change you're going to notice when using DBLOGREADER is that the Extract process establishes a connection with the RDBMS instance (not an ASM instance). If you trace the relevant server side process then you'll observer the following event being emitted each time it reads from the redo:

WAIT #0: nam='log file sequential read' ela= 3 log#=0 block#=102653 blocks=2000 obj#=-1 tim=1299345636565911

One of the advantages of the new option is the ability to use a much larger buffer (the old mechanism was limited to something around 28K which is abysmal for any system with even moderate redo generation rate). As we can see from the above, we had 2000x512=1000K worth of data read in one shot so at least that part works as advertised. But how is it done?

A stack dump reveals the following:

[oracle@ora11gr2 ggs_11g]$ pstack 4740
#1  0x00000000090bae87 in sskgpwwait ()
#2  0x00000000090b9f9a in skgpwwait ()
#3  0x0000000008d29f93 in ksliwat ()
#4  0x0000000008d29541 in kslwaitctx ()
#5  0x0000000008d269ab in kslwait ()
#6  0x00000000071f8566 in knloggRedoRead ()
#7  0x00000000071f9e05 in knloggmain ()
#8  0x0000000008eae67a in opiodr ()
#9  0x00000000090467f9 in ttcpip ()
#10 0x0000000001722f86 in opitsk ()
#11 0x0000000001727c26 in opiino ()
#12 0x0000000008eae67a in opiodr ()
#13 0x000000000171eedc in opidrv ()
#14 0x0000000001d95f5f in sou2o ()
#15 0x0000000000a07935 in opimai_real ()
#16 0x0000000001d9b408 in ssthrdmain ()
#17 0x0000000000a078a1 in main ()

Look at the functions in line #6 and #7 (knloggmain() and knloggRedoRead()). These functions follow right after OPI (Oracle Program Interface) layer which makes this code path look rather efficient. No need to deal with compilation (KK) and execution (KX) layers. Just a direct and straightforward call to the functions we need. Indeed, when I did some quick and dirty tests (using 11.2.0.2), the new code path was able to archive about 90% performance compared to just reading your redo from ext3 filesystem.

The only question I have about all this is a security paradigm behind knloggmain() calls. This function appears to be directly exposed through the OPI layer and, while this helps archive greater efficiency, how does it control who can who can not calls it? You may not necessarily want somebody connect to your database and start reading your redo stream as you may get all kinds of sensitive information right there. I haven't spent any time figuring that one out yet though.

Sunday, December 19, 2010

Oracle GoldenGate Trail File Size

When it comes to estimating how big your trail files will be, Oracle documentation suggests to use the following formula:
[log volume in one hour] x [number of hours downtime] x .4 = trail disk space
It is also described as being a conservative estimate so you are likely to archive a better mileage. What I found is that there are some corner cases which can produce some "anomaly" results.

Test Case

I'll start with a specially crafted schema and table names as well as data:
create user uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu identified by "u";

User created.

SQL> grant resource to uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu;

Grant succeeded.

SQL> create table uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu.tttttttttttttttttttttttttttttt
  2  (
  3     n number primary key
  4  );

Table created.

SQL> alter table uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu.tttttttttttttttttttttttttttttt
  2 add supplemental log data (primary key) columns;

Table altered.
I've just created a new trail file which is currently 974 bytes in size:
[oracle@gg1 dirdat]$ ls -l aa000000
-rw-rw-rw- 1 oracle oinstall 974 Dec 19 11:50 aa000000
What I'm going to do is insert some data into a table and then measure how much redo as well as trail data were generated:
SQL> set autot traceonly stat
SQL> insert into uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu.tttttttttttttttttttttttttttttt
      select power(10,10)+(level-1)*power(10,10)
        from dual
        connect by level <= 10000;  2    3    4

10000 rows created.


Statistics
----------------------------------------------------------
        0  recursive calls
      329  db block gets
       31  consistent gets
        0  physical reads
     364612  redo size
      821  bytes sent via SQL*Net to client
      917  bytes received via SQL*Net from client
        3  SQL*Net roundtrips to/from client
        2  sorts (memory)
        0  sorts (disk)
      10000  rows processed

SQL> commit;

Commit complete.
Let's take a look at the trail file size:
[oracle@gg1 dirdat]$ ls -l aa000000
-rw-rw-rw- 1 oracle oinstall 1619890 Dec 19 13:40 aa000000
That is roughly 4.4 times bigger than our redo size and 11 times bigger than Oracle's "conservative" estimate. Your storage provisioning might be in for a surprise.

Trail File

To understand the key factors which resulted in our trail file to be of such outrageous size let's use logdump and take a look inside:
Logdump 11 >open ./dirdat/aa000000
Current LogTrail is /u01/app/oracle/ggs/dirdat/aa000000
Logdump 12 >ghdr on
Logdump 13 >next 2

2010/12/19 13:39:16.632.818 FileHeader           Len   966 RBA 0
Name: *FileHeader*
 3000 01b6 3000 0008 4747 0d0a 544c 0a0d 3100 0002 | 0...0...GG..TL..1...
 0002 3200 0004 2000 0000 3300 0008 02f1 bdfb 2d3e | ..2... ...3.......->
 74f2 3400 0028 0026 7572 693a 6767 313a 7175 6164 | t.4..(.&uri:gg1:quad
 726f 3a63 6f6d 3a3a 7530 313a 6170 703a 6f72 6163 | ro:com::u01:app:orac
 6c65 3a67 6773 3600 0025 0023 2f75 3031 2f61 7070 | le:ggs6..%.#/u01/app
 2f6f 7261 636c 652f 6767 732f 6469 7264 6174 2f61 | /oracle/ggs/dirdat/a
 6130 3030 3030 3037 0000 0101 3800 0004 0000 0000 | a0000007....8.......

___________________________________________________________________
Hdr-Ind    :     E  (x45)     Partition  :     .  (x04)
UndoFlag   :     .  (x00)     BeforeAfter:     A  (x41)
RecLength  :    19  (x0013)   IO Time    : 2010/12/19 13:40:32.000.000
IOType     :     5  (x05)     OrigNode   :   255  (xff)
TransInd   :     .  (x00)     FormatType :     R  (x52)
SyskeyLen  :     0  (x00)     Incomplete :     .  (x00)
AuditRBA   :         45       AuditPos   : 37458516
Continued  :     N  (x00)     RecCount   :     1  (x01)

2010/12/19 13:40:32.000.000 Insert               Len    19 RBA 974
Name: UUUUUUUUUUUUUUUUUUUUUUUUUUUUUU.TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
After  Image:                                             Partition 4   G  b
 0000 000f 0000 000b 3130 3030 3030 3030 3030 30   | ........10000000000
The first record is a standard trail file header and is of little interest to us. We're going to take look at the second record. First of all, as you can see, we've got a fully qualified table name stored in there. We can see the inserted value as well. Now if all that stuff got stored in plain text that could explain it... let's check it out:
[oracle@gg1 dirdat]$ grep --binary-files=text -o -i \
> uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu.tttttttttttttttttttttttttttttt \
> aa000000 \
> | wc -l
10000
Indeed, we've got a fully qualified table name appearing 10 thousand times on our trail file! The longer your schema and/or your table names are the bigger your trail file will be. The main reason Oracle Streams are using object identifiers (as well as Oracle's redo) is to avoid getting into exactly this kind of trouble. This makes GoldenGate configuration easier as it doesn't have to store the mapping information on the target database but you also have to pay the price with every row modification which gets captured.

The story doesn't end there, however. Let's take a look at the inserted data. How do you think that got stored?
[oracle@gg1 dirdat]$ grep --binary-files=text -o -i \
> 0000000000 \
> aa000000 \
> | wc -l
10000
That's right - all numbers got stored in plain text. The fundamental problem here is that trail file does not store data types. Everything just gets converted to strings. This helps dealing with heterogeneous data sources but also makes the storage for certain data types to be very inefficient. Again, you have to pay the price somewhere. If you gzip the above trail file you'll get a whopping 25x compression ratio.

You may ask yourself how column names are being stored? The answer is there are no column names in the trail file, just column positions. That's why GoldenGate requires column order to match between source and target databases, otherwise you'll get your data mapped to a wrong column. Add a lack of data types and you might be in for a very interesting results. If column order doesn't match then you'll have to use defgen to get your data across properly.

In a nutshell, watch out for tables/schemas with long names and lots of numeric data.

Monday, July 19, 2010

ASM Mirroring and Disk Partnership

Let's say you have a RAID10 disk array with 50 disks. I'm offering you to play the following game. You'll spin a wheel of fortune divided into 50 equal slots with numbers ranging from 1 to 50. After you spun the wheel for the first time you'll write the resulted number down. After that I'll offer you to spin the second wheel of fortune equally divided into 49 slots where the number you've just pulled is missing (so you can't pull the same number twice).

After that you'll end up with two (mutually exclusive) random numbers in a range from 1 to 50. What I'm going to ask you do next is pull the disks with corresponding numbers out of your RAID10 array. What are the odds of your entire array going down?

In a classical RAID10 setup where every drive is being mirrored by a single other drive you can calculate the probability in a straightforward fashion. After you've pulled the first drive out (doesn't matter which one) there is only one specific disk out of 49 which you have to pull in order for the entire array to go bust. So your odds of taking the entire array down will be 1/49 or roughly 2%.

Now let's say that instead of a regular RAID10 array you've got a normal redundancy ASM disk group with two failure groups (25 disks each). What are the odds now?

ASM Mirroring

Before we're going to answer the above question we need to realize that ASM does not mirror disks the same way traditional RAID10 does. In fact it doesn't mirror disks at all. It mirrors extents instead. For all you know is that the extents from the disk you've just pulled out won't be mirrored in the same failure group. So that leaves us with 24 disks as safe. But what about the other 25 disks from the other failure group? How much of these disks are unsafe and will result in your normal redundancy disk group going south?

Disk Partnership


When mirroring extents ASM uses a concept called Disk Partnership. Every disk in a normal redundancy disk group has one or more partners which are used to mirror primary extents from that disk. This also means that the loss of any of the partner disks is fatal to the disk group as you'll nuke both the primary extent and it's mirror copy (keep in mind that we're talking about pulling both disks out before the disk group would be able to rebalance). At least now we know what we need next in order to solve the puzzle. We need to find out how many partners each of the disks in our array have.

Disk Partners


Let's say that the first disk we've pulled out was disk number 0. The following query can be used to find all partners for the disk number 0 in a first disk group:

SQL> select p.number_kfdpartner, d.FAILGROUP
        from x$kfdpartner p, v$asm_disk d
        where p.disk=0
                and p.grp=1
                and p.grp=group_number
                and p.number_kfdpartner=d.disk_number;  2    3    4    5    6

NUMBER_KFDPARTNER FAILGROUP
----------------- ------------------------------
               25 FG2
               26 FG2
               27 FG2
               29 FG2
               38 FG2
               46 FG2
               48 FG2
               49 FG2

8 rows selected.
Pulling any of the above disks out at the same time with the disk number 0 will be fatal for our normal redundancy disk group. In other words, once we pull the first disk out, there are other 8 disks out of 49 which are unsafe. That will bring our odds up (or down, depending in which outcome you're interested :) to 8/49 or a little bit more than 16%.

You can confirm that every disk has exactly 8 partners by running the following query:

SQL> select min(cnt), max(cnt) from (
select number_kfdpartner disk_number, count(*) cnt
        from x$kfdpartner
        where grp=1
        group by number_kfdpartner);  2    3    4    5

  MIN(CNT)   MAX(CNT)
---------- ----------
         8          8

Partner Disk Count

Is there any way to control the number of partner disks which ASM uses for extents mirroring? Turns out that there is. Note that it's a completely non supported operation so you shouldn't be playing with it.

The parameter which controls the maximum number of partner disks is called _asm_partner_target_disk_part. In 11GR2 that parameter has a default value of 8. I didn't have a chance to check it in a previous versions but supposedly it's default value there is 10(1). So at least we know that Oracle itself sometimes changes it between different releases.

All you need to do after changing this parameter is to rebalance the disk group:

SQL> alter system set "_asm_partner_target_disk_part"=2;

System altered.

SQL> alter diskgroup data rebalance;

Diskgroup altered.

SQL> select p.number_kfdpartner, d.FAILGROUP
        from x$kfdpartner p, v$asm_disk d
        where p.disk=0
                and p.grp=1
                and p.grp=group_number
                and p.number_kfdpartner=d.disk_number;  2    3    4    5    6
NUMBER_KFDPARTNER FAILGROUP
----------------- ------------------------------
               25 FG2
               38 FG2
So here it goes!

References
Oracle Automatic Storage Management: Under-the-Hood & Practical Deployment Guide
ASM Metadata and Internals

All test were performed with Oracle Grid Infrastructure 11.2.0.1

Tuesday, June 08, 2010

11GR2 Result Cache Scalability

Result Cache in 11GR1

Two years ago I wrote a series of posts where I explained some of the dynamics around Result Cache latch. To recap, the result cache memory in 11GR1 is backed up by a single RC latch. That in itself wouldn't be so much of an issue (at least relatively to what we've got in reality) had the latch allowed for shared mode gets in case all you have to do is read from the result cache memory.

Alas, the latch turned out to be without shared mode gets. It is going almost without saying that, as concurrency levels increased, that single latch was behaving more and more like to a hand brake (link to a test I've done back then on a 8-way Itanium 2).

Back to the future

When 11GR2 has been released I knew that at some point in time I'll need to go back and revisit this subject. What I did is a couple of quick and dirty runs which came back confirming the same single latch and no shared mode gets so it didn't look like something has really changed. At this point I've decided to revisit it a bit later. This a "bit later" happened just recently.

How bad can it get?

What I wanted to do is get an UltraSPARC T2 and face it against Core i7 980X on a different concurrency levels in order to see how bad it can get. T2 will require quite a lot of parallelism in order to keep up even with a single i7 core. But since all we've got is a single RC latch, I've expected T2 to choke on it quite fast as not only there will be a lot of processes competing for the same latch, the slow single-threaded performance will cause the latch to be held for a much longer periods of time. Performance degradation will be dare.

Result Cache in 11GR2

I used the same test described here as it is targeted at exploiting RC latch weakness and gives me the ability to compare with the old results. I've used 250K lookup iterations. The performance was measured as a total number of lookups performed per second and RC latch statistics were captured for analysis.

Since 980X has 6 cores and 12 threads, the tests were done with 1 to 12 processes running at the same time which also gave an opportunity to see how well HT will scale. Note that I plan to do some further testing on T2 with up to 64 threads but for now I've tested up to 12 threads only as I couldn't get a test window big enough.

UltraSPARC T2 Results





























































































# of processes Buffer Cache % linear Result Cache % linear
1 4426 100 4555 100
2 8930 100.88 9124 100.15
3 13465 101.41 13731 100.48
4 17886 101.03 18179 99.77
5 22290 100.72 22715 99.74
6 26615 100.22 27012 98.84
7 30659 98.96 30804 96.61
8 34347 97 34910 95.8
9 38389 96.37 39029 95.2
10 42772 96.64 43126 94.68
11 46840 96.21 46936 93.68
12 50667 95.4 50590 92.55

When I saw these numbers for the first time I was quite surprised just how good these results are! UlstraSPARC T2 end up being far from choking and, as a matter of fact, the only position where Result Cache had to give up is the last one. If you reference the results I've obtained on 8-way Itanium 2 in 11GR1 you'll see that Result Cache gave up much earlier and scaled a lot worse.

This certainly looks promising so let's take a look at the RC latch statistic:






























































































# of processes Gets Misses Sleeps Wait Time
1 500001 0 0 0
2 1000002 40253 1 0
3 1500003 50404 0 0
4 2000004 165116 9 464
5 2500005 211559 5 182
6 3000006 437898 8 6877
7 3500007 805752 52 16556
8 4000008 1214762 20 2980
9 4500009 1775372 188 3140
10 5000010 2244964 491 29568
11 5500011 2552323 664 28011
12 6000012 3019903 1226 60005


There is one astonishing fact about the above number. Let's get some efficiency metrics in place for comparison between these numbers and the ones I've got in 11GR1. I'll use a data point with eight parallel processes as it's the highest reference point I can get across both data sets.

First of all, the number of gets per execution remained the same and equals two gets per exec. If we were going to calculate % miss per get we'll get 28.62% in 11GR1 and 50.33% in 11GR2. In other words, roughly every second get request has resulted in a miss in 11GR2 and every third in 11GR1. It may appear as if this got worse but it's really a consequence from something else.

If we calculate % sleep per miss we'll get 31.36% in 11GR1 but only 0.04% in 11GR2! In other words, the amount of times a process had to go to sleep has drastically decreased. In almost all of the cases the process was able to acquire a latch during a spin without going into a sleep. This also explains why % miss per get in 11GR2 went up and shows that a lowering in efficiency for a single metric does not necessarily indicates a problem, it might happen because some other correlated metric has in fact improved.

There is certainly a sign of a great improvement but what is it? Most likely the improvement is related to the optimization of how long the latch is required to be held. The time required to hold the latch became so small that, in most of the cases, the process is able to acquire it during spinning before being required to go to sleep (i.e. less than _spin_count iterations).

Core i7 980X Results





























































































# of processes Buffer Cache % linear Result Cache % linear
1 40064 100 43554 100
2 78989 98.58 84602 97.12
3 121753 101.3 127768 97.79
4 159490 99.52 166667 95.67
5 194704 97.2 204583 93.94
6 229709 95.56 240770 92.13
7 231788 82.65 244755 80.28
8 233918 72.98 246305 70.69
9 250836 69.57 260718 66.51
10 267094 66.67 275330 63.22
11 280326 63.61 290084 60.55
12 290416 60.41 293830 56.22


Here Result Cache won across all the positions. We need about 10 processes running on UltraSPARC T2 in order to beat a single process running on i7 980X. Performance gains declined rapidly once we got over six concurrent processes but still we were able to realize some additional performance with 12 threads being about 22% faster than 6 threads.

Latch statistics:






























































































# of processes Gets Misses Sleeps Wait Time
1 500001 0 0 0
2 1000002 40456 0 0
3 1500003 117893 5 71
4 2000004 209399 0 0
5 2500005 381160 0 0
6 3000006 517745 11 179
7 3500007 913125 20 555
8 4000008 1355226 26 11914
9 4500009 1834112 13 1017
10 5000010 2602801 42 1607
11 5500011 3196415 145 3451
12 6000012 3730467 184 123954


Essentially we're looking at the same phenomena with the amount of sleeps being significantly lower compared to what we observed in 11GR1. With six concurrent processes % miss per get is 17.26% and % sleep per miss is 0.002%! This allowed Result Cache to stay ahead with up to (and including) 12 concurrent processes running.

UltraSPARC T2 vs i7 980X

We'll wrap up with a nice graph showing result cache performance on both UltraSPARC T2 and Core i7 980X:



i7 980X starts almost where 12 UltraSPARC T2 processes ends. Would T2 be able to narrow the gap with more parallel threads? I'll certainly find out.

Conclusion

There is an enormous improvement when it comes to Result Cache scalability in 11GR2. Still it's slower than if we had shared mode gets (or multiple child latches or, even better, both) but it gets very, very close.

Monday, June 07, 2010

Row cache objects latch contention

A data loading process was running on UltraSPARC T2 CPU. To take advantage of the platform architecture (or, I'd rather say, to avoid it's limitations) the loading process has been design to run with a massive amount of parallel query slaves in order to extract the maximum output from CMT architecture.

Every time this data loading process executed, it experienced strange slowdowns on seemingly random points in time. Performance drops were quite substantial, which prompted to do an additional investigation. Upon a closer examination of ASH data, it turned out that all slowdowns were due to latch: row cache objects contention.

Row cache objects latch protects the dictionary cache. The first thing was to figure out whether most of the contention was contributed by a particular row cache objects child latch:
SQL> select latch#, child#, sleeps
2 from v$latch_children
3 where name='row cache objects'
4 and sleeps > 0
5 order by sleeps desc;

LATCH# CHILD# SLEEPS
---------- ---------- ----------
270 1 24241645
270 5 523
270 4 52
The first child certainly doesn't look good when we take the amount of sleeps experienced by it, compared to all the other child latches. Once we have the troublesome child latch identified, we can move on and see which type of dictionary cache it protects:
SQL> select distinct s.kqrstcln latch#,r.cache#,r.parameter name,r.type,r.subordinate#
from v$rowcache r,x$kqrst s
where r.cache#=s.kqrstcid
order by 1,4,5; 2 3 4

LATCH# CACHE# NAME TYPE SUBORDINATE#
------- ------ --------------------------------- ----------- ------------
1 3 dc_rollback_segments PARENT
2 1 dc_free_extents PARENT
3 4 dc_used_extents PARENT
4 2 dc_segments PARENT
5 0 dc_tablespaces PARENT
6 5 dc_tablespace_quotas PARENT
7 6 dc_files PARENT
8 10 dc_users PARENT
8 7 dc_users SUBORDINATE 0
8 7 dc_users SUBORDINATE 1
8 7 dc_users SUBORDINATE 2
9 8 dc_objects PARENT
9 8 dc_object_grants SUBORDINATE 0
10 17 dc_global_oids PARENT
11 12 dc_constraints PARENT
12 13 dc_sequences PARENT
13 16 dc_histogram_defs PARENT
13 16 dc_histogram_data SUBORDINATE 0
13 16 dc_histogram_data SUBORDINATE 1
14 32 kqlsubheap_object PARENT
15 19 dc_table_scns PARENT
15 19 dc_partition_scns SUBORDINATE 0
16 18 dc_outlines PARENT
17 14 dc_profiles PARENT
18 47 realm cache PARENT
18 47 realm auth SUBORDINATE 0
19 48 Command rule cache PARENT
20 49 Realm Object cache PARENT
20 49 Realm Subordinate Cache SUBORDINATE 0
21 46 Rule Set Cache PARENT
22 34 extensible security user and rol PARENT
23 35 extensible security principal pa PARENT
24 37 extensible security UID to princ PARENT
25 36 extensible security principal na PARENT
26 33 extensible security principal ne PARENT
27 38 XS security class privilege PARENT
28 39 extensible security midtier cach PARENT
29 44 event map PARENT
30 45 format PARENT
31 43 audit collector PARENT
32 15 global database name PARENT
33 20 rule_info PARENT
34 21 rule_or_piece PARENT
34 21 rule_fast_operators SUBORDINATE 0
35 23 dc_qmc_ldap_cache_entries PARENT
36 52 qmc_app_cache_entries PARENT
37 53 qmc_app_cache_entries PARENT
38 27 qmtmrcin_cache_entries PARENT
39 28 qmtmrctn_cache_entries PARENT
40 29 qmtmrcip_cache_entries PARENT
41 30 qmtmrctp_cache_entries PARENT
42 31 qmtmrciq_cache_entries PARENT
43 26 qmtmrctq_cache_entries PARENT
44 9 qmrc_cache_entries PARENT
45 50 qmemod_cache_entries PARENT
46 24 outstanding_alerts PARENT
47 22 dc_awr_control PARENT
48 25 SMO rowcache PARENT
49 40 sch_lj_objs PARENT
50 41 sch_lj_oids PARENT

60 rows selected.
The first child protects dc_rollback_segments. We can confirm it by referencing data in v$rowcache:
SQL> select parameter, gets
2 from v$rowcache
3 order by gets desc;

PARAMETER GETS
-------------------------------- ----------
dc_rollback_segments 310995555
dc_tablespaces 76251831
dc_segments 3912096
dc_users 2307601
dc_objects 1460725
dc_users 608659
dc_histogram_defs 250666
global database name 67475
dc_histogram_data 43098
dc_histogram_data 14364
dc_global_oids 14320
outstanding_alerts 2956
dc_profiles 2555
dc_awr_control 1925
dc_object_grants 745
dc_files 532
dc_constraints 201
sch_lj_oids 158
dc_sequences 156
dc_table_scns 20
sch_lj_objs 18
dc_qmc_ldap_cache_entries 0
qmc_app_cache_entries 0
qmc_app_cache_entries 0
qmtmrcin_cache_entries 0
qmtmrctn_cache_entries 0
qmtmrcip_cache_entries 0
qmtmrctp_cache_entries 0
qmtmrciq_cache_entries 0
qmtmrctq_cache_entries 0
qmrc_cache_entries 0
qmemod_cache_entries 0
SMO rowcache 0
dc_users 0
dc_partition_scns 0
dc_users 0
realm auth 0
Realm Subordinate Cache 0
rule_or_piece 0
rule_info 0
audit collector 0
format 0
event map 0
extensible security midtier cach 0
XS security class privilege 0
extensible security principal ne 0
extensible security principal na 0
extensible security UID to princ 0
extensible security principal pa 0
extensible security user and rol 0
Rule Set Cache 0
Realm Object cache 0
Command rule cache 0
realm cache 0
dc_outlines 0
kqlsubheap_object 0
dc_tablespace_quotas 0
dc_used_extents 0
rule_fast_operators 0
dc_free_extents 0

60 rows selected
The next step is to see whether latch miss source can give us some more hints regarding the issue:
SQL> select "WHERE", sleep_count, location
2 from v$latch_misses
3 where parent_name='row cache objects'
4 and sleep_count > 0;

WHERE SLEEP_COUNT LOCATION
------------------- ----------- ------------------------------
kqrpre: find obj 20612167 kqrpre: find obj
kqrpup 7 kqrpup
kqrcmt: while loop 1 kqrcmt: while loop
kqrcmt: clear flag 1 kqrcmt: clear flag
kqreqd 1026837 kqreqd
kqreqd: reget 2602576 kqreqd: reget

6 rows selected
Now if you take kqrpre: find obj and plug it into a search on My Oracle Support you'll quickly yield Bug 5749075 High Requests on dc_rollback_segments. Among other things, this note points out at the unusually high number of undo segments being created due to cleanup not able to work properly...
SQL> select count(*) from dba_rollback_segs;

COUNT(*)
----------
14838
...and this seems to be the case. The only difference is that the issue has been observed on 11GR2 and the bug has been filled against the older versions. Though it was still worth checking in case we were seeing a regression. Indeed, after getting rid of that many undo segments by simply recreating the undo tablespace, the issue, thought not completely vanished, manifested itself a lot less making it's impact relatively insignificant to the process throughput.

Wednesday, May 12, 2010

The case of a slow lookup

Take a look at the runtime difference for the same anonymous PL/SQL block between 11GR1 and 11GR2:
Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set timing on
SQL> begin
for i in 1 .. 1000000
loop
for cur in (select n from z_t where n=mod(i,10000)+1)
loop
null;
end loop;
end loop;
end; 2 3 4 5 6 7 8 9
10 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:31.03

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set timing on
SQL> begin
for i in 1 .. 1000000
loop
for cur in (select n from z_t where n=mod(i,10000)+1)
loop
null;
end loop;
end loop;
end; 2 3 4 5 6 7 8 9
10 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:46.23
11GR2 end up being about 50% slower.

A little background

I've been running some tests and the numbers I was getting back just weren't making any sense. I've been using 11GR2 but all I was getting were marginally better results compared to the tests I've done a couple of years ago using 11GR1 on a much less powerful hardware. Something definitely didn't look right so I geared up to find out what it is. The above test case is something I arrived at, which is simple but demonstrates the issue. This test case was run using the same hardware so the only difference was release of Oracle database being used.

tkprof results

To see whether there are any low hanging fruits I've traced the same loop with 250K iterations (to cut a bit on a trace file size):

11GR1
SQL ID: 85f6s9ndbynfc
Plan Hash: 3532178995
SELECT N
FROM
Z_T WHERE N=MOD(:B1 ,10000)+1


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 249995 2.51 2.24 0 0 0 0
Fetch 249995 1.54 1.68 0 499990 0 249995
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 499991 4.06 3.92 0 499990 0 249995

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 54 (recursive depth: 1)

Rows Row Source Operation
------- ---------------------------------------------------
1 INDEX UNIQUE SCAN PK_Z_T (cr=2 pr=0 pw=0 time=0 us cost=1 size=13 card=1)(object id 61973)

11GR2
SQL ID: 85f6s9ndbynfc
Plan Hash: 3532178995
SELECT N
FROM
Z_T WHERE N=MOD(:B1 ,10000)+1


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 249999 4.07 4.06 0 0 0 0
Fetch 249999 2.62 2.67 0 499998 0 249999
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 499999 6.69 6.74 0 499998 0 249999

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 57 (recursive depth: 1)

Rows Row Source Operation
------- ---------------------------------------------------
1 INDEX UNIQUE SCAN PK_Z_T (cr=2 pr=0 pw=0 time=0 us cost=1 size=13 card=1)(object id 69265)

These statistics looks almost identical but with one big difference. 11GR2 is about 70% slower.

What's going on?

Certainly you wouldn't expect the newest release of Oracle database to be so much slower compared to the previous version. As a first step I've traced both loops which resulted in discovering some timing improvements which I've blogged about just recently. But the additional overhead introduced by times syscalls was not even remotely enough to justify the difference. As a next step I've compared session statistics between 11GR1 and 11GR2 which lead to session cached cursors overflow issue discovery. Apart from that the only difference was significantly greater amount of CPU time consumed by 11GR2 (as evident by the trace file as well).

Now what?

Certainly some more diagnostic were required. What I did is a couple of stack dumps while the loop was running to see whether there will be something unusual... and there were something indeed:
[oracle@ora11gr2 trace]$ pstack 1052
#0 0x0000003b83c99247 in times () from /lib64/libc.so.6
#1 0x00000000085556c7 in sltrgatime64 ()
#2 0x0000000005521a44 in auddft ()
#3 0x0000000008438ee3 in audStatement ()

#4 0x00000000083fc299 in opiexe ()
#5 0x0000000008405f57 in opipls ()
#6 0x00000000083f78b9 in opiodr ()
#7 0x00000000084892af in __PGOSF141_rpidrus ()
#8 0x00000000085ee820 in skgmstack ()
#9 0x000000000848a759 in rpiswu2 ()
#10 0x0000000008489e9d in rpidrv ()
#11 0x0000000008476438 in psddr0 ()
#12 0x0000000008476048 in psdnal ()
#13 0x000000000288b3aa in pevm_BFTCHC ()
#14 0x000000000287ab84 in pfrinstr_BFTCHC ()
#15 0x0000000008645105 in pfrrun_no_tool ()
#16 0x00000000086439ea in pfrrun ()
#17 0x000000000864a7d6 in plsql_run ()
#18 0x0000000002870d07 in peicnt ()
#19 0x0000000002867649 in kkxexe ()
#20 0x00000000083fe337 in opiexe ()
#21 0x00000000019da690 in kpoal8 ()
#22 0x00000000083f78b9 in opiodr ()
#23 0x0000000008583187 in ttcpip ()
#24 0x0000000001702661 in opitsk ()
#25 0x0000000001707282 in opiino ()
#26 0x00000000083f78b9 in opiodr ()
#27 0x00000000016fe708 in opidrv ()
#28 0x0000000001b7183f in sou2o ()
#29 0x00000000009d3f05 in opimai_real ()
#30 0x0000000001b76ace in ssthrdmain ()
#31 0x00000000009d3e71 in main ()
Look at the lines #2 and #3. Are these some auditing functions getting in the way? I verified audit_trail parameter which turned out to be sat to DB:
SQL> show parameter audit_trail

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
audit_trail string DB

Though I didn't have any auditing enabled for any of the statements let alone the table I was selecting from. But after I've confirmed that audit_trail was set to NONE on 11GR1 database, disabling audit completely looked like something worth trying.

After I've repeated the test with audit_trail sat to NONE, I end up getting virtually identical results compared to 11GR1:
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set timing on
SQL> begin
for i in 1 .. 1000000
loop
for cur in (select n from z_t where n=mod(i,10000)+1)
loop
null;
end loop;
end loop;
end; 2 3 4 5 6 7 8 9
10 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:32.42
...and two audit functions has disappeared from the stack dump as well.

It looks like the impact of having auditing enabled end up being high enough to draw some attention in this particular case even though it wasn't sat to capture anything about our test. On a side note this can be considered an invitation to instrument some auditing functionality as well.

Tuesday, May 11, 2010

Timing improvements in Oracle 11GR2 trace

I've been doing some research recently (I might blog about it a bit later if something worthwhile comes out of it) which resulted in a byproduct I found interesting to blog about.

In a nutshell, in one of the steps I've run the following PL/SQL block to trace the results of a very tight loop:
begin
dbms_monitor.session_trace_enable(waits => false, binds => false);
for i in 1 .. 100000
loop
for cur in (select null from dual)
loop
null;
end loop;
end loop;
dbms_monitor.session_trace_disable;
end;

While comparing raw trace files between 11.1.0.7 and 11.2.0.1 I've noticed something interesting. Here is an excerpt from 11.1.0.7 raw trace:
PARSING IN CURSOR #2 len=21 dep=1 uid=54 oct=3 lid=54 tim=1273621434711518
hv=2339989301 ad='86044a88' sqlid='bg2ng0u5rkttp'
SELECT NULL FROM DUAL
END OF STMT
EXEC #2:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,plh=1388734953,tim=1273621434711518
FETCH #2:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=1,dep=1,og=1,plh=1388734953,tim=1273621434711518
CLOSE #2:c=0,e=0,dep=1,type=3,tim=1273621434711518
Notice how tim= values are the same across parse, exec, fetch and close; and e= values are zero for fetch, close and execute.

Now take a look at a similar excerpt from 11GR2:
PARSING IN CURSOR #3 len=21 dep=1 uid=57 oct=3 lid=57 tim=1273621440328909
hv=2339989301 ad='844d5a20' sqlid='bg2ng0u5rkttp'
SELECT NULL FROM DUAL
END OF STMT
EXEC #3:c=0,e=9,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,plh=1388734953,tim=1273621440328908
FETCH #3:c=0,e=6,p=0,cr=0,cu=0,mis=0,r=1,dep=1,og=1,plh=1388734953,tim=1273621440328947
CLOSE #3:c=0,e=1,dep=1,type=3,tim=1273621440328968
Notice how tim= value changes from line to line and e= has a corresponding value associated with it.

Before we move on you may ask how 11.1.0.7 calculates the time if all the values appear to be zero? The answer is that every so often you'll see a line like that:
FEXEC #5:c=1000,e=20960,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,plh=1388734953,tim=1273621434873334
Sure we didn't spent that much time fetching a single row. This value comes from the fact that we've just dumped all the time associated with multiple executions in this single line.

Now for 11GR2 to calculate the time with that additional precision will require a heck more times syscalls. To confirm this I've ran strace for processes in both 11GR1 and 11GR2 and here is what I came up with:

11GR1
[oracle@ora11gr1 tmp]$ grep -c times 4279.out
116
11GR2
[oracle@ora11gr2 tmp]$ grep -c times 5318.out
200149
Indeed, 11GR2 end up calling times twice per iteration while 11GR1 was doing one call per about thousand iterations. In total 11GR2 end up doing roughly 200K more syscalls. Next I became curious about how much overhead all these additional syscalls introduce and in order to find out I've run a small program:
[oracle@ora11gr2 tmp]$ cat do_times.c
#include

int main()
{
int i;
struct tms t;
clock_t c;
for (i=1;i<=200000;i++)
{
c=times(&t);
}

return 0;
}
[oracle@ora11gr2 tmp]$ gcc do_times.c
[oracle@ora11gr2 tmp]$ time ./a.out

real 0m0.096s
user 0m0.037s
sys 0m0.059s
We're talking about 0.01 second for 200K times syscalls, in other words - not much.

It's definitely nice to see all these small improvements being done.

Thursday, August 06, 2009

How to install Oracle Grid Control Agents on a Windows failover cluster with no downtime

Metalink Note:464191.1 describes steps required to configure Oracle Grid Control agent in Windows failover cluster environment. Unfortunately, as part of the configuration, the cluster disk containing virtual agent's state information have to be moved to the node where the agent is being deployed.

The agent state directory has to "follow" virtual agent when a failover occurs, hence the requirement for it to be on a cluster disk resource. And since the cluster disk resource is visible on the active node only, you can not deploy virtual agent on any of the passive nodes without moving the group containing disk with state information first.

The above is not a big deal when you're doing install on a brand new or development system but what if you have to deal with a production cluster where any potential downtime that might be associated with moving the group across the nodes would better be avoided?

Let's say you have an Oracle FailSafe configuration and you intend to use Oracle Grid Control to monitor your Oracle database. In this case your virtual agent will be a part of the same cluster group where your Oracle database is. Failing over your database across all the nodes for the sake of deploying a virtual agent may not necessarily be what you want to do.

Of course, the easy workaround is to add another disk (LUN) to a cluster, use it to deploy the agents and, once the deployment has been done, add it to the same group where your database is. But what if you do not have any spare disks and have to share the same cluster disk with your Oracle database?

I gave this problem a bit of research and, as it turned out, there is a really simple workaround which may come in handy in case you'll be faced with the same problem.

I'll use the following configuration as an example:
ORA01A -- first (active) node.
ORA01B -- second (passive) node.
ORA01V -- Oracle Database VIP.
c:\oracle\product\10.2.0\agent10g -- Oracle Grid Control agent home.

Let's say that each system has a local drive C: and the deployment will be done on a cluster drive D:.

Deploy virtual agent on the active node

This is where you follow exactly what Metalink note says you to do:
C:\>emctl deploy agent -n OracleAgentORA01V d:\agent10g ORA01V:1830 ORA01A:1830
Creating shared install...
Source location: C:\oracle\product\10.2.0\agent10g
Destination (shared install) : d:\agent10g
DeployMode : agent

Creating directories...
Creating targets.xml...
Creating emctl control program...
Creating emtgtctl control program...
Setting log and trace files locations for Agent ...
Secure agent found. New agent should be configured for secure mode

Source Agent operating in secure mode.
Run "d:\agent10g/bin/emctl secure agent" to secure agent
Service "OracleAgentORA01V" create SUCCESS
The above will create a virtual agent service named OracleAgentORA01V which will be "bound" to ORA01V virtual IP and use d:\agent10g as a location for virtual agent's state files. Note that I'm using port 1830 since port 3872 is used by a "real" agent. You can specify AgentListenOnAllNICs=FALSE in your emd.properties file (for all agents in the cluster) if you want virtual and real agents share the same port as this will stop agents from trying to listen on all network adapters on the node.

Secure the agent in case your OMS is running in the secure mode:
C:\>d:\agent10g/bin/emctl secure agent
Oracle Enterprise Manager 10g Release 5 Grid Control 10.2.0.5.0.
Copyright (c) 1996, 2009 Oracle Corporation. All rights reserved.
Agent is already stopped... Done.
Securing agent... Started.
Enter Agent Registration Password :
Securing agent... Successful.
Deploy virtual agent on the passive node

The same deployment command won't work on a passive node simply because drive D: is not there. As a workaround which will allow us to create a virtual agent service on the passive node we will use a local drive C: for initial deployment:
C:\>emctl deploy agent -n OracleAgentORA01V c:\agent10g ORA01V:1830 ORA01B:1830
Creating shared install...
Source location: C:\oracle\product\10.2.0\agent10g
Destination (shared install) : c:\agent10g
DeployMode : agent

Creating directories...
Creating targets.xml...
Creating emctl control program...
Creating emtgtctl control program...
Setting log and trace files locations for Agent ...
Secure agent found. New agent should be configured for secure mode

Source Agent operating in secure mode.
Run "c:\agent10g/bin/emctl secure agent" to secure agent
Service "OracleAgentORA01V" create SUCCESS
However, this is not what we want as all virtual agents should be sharing the same cluster drive D: instead. To fix the location of the agent state directory, launch regedit.exe and navigate to
HKLM\SOFTWARE\ORACLE\SYSMAN\OracleAgentORA01V
registry key. Under that key you'll find EMSTATE field with c:\agent10g as its value. Modify this value to be d:\agent10g instead. You can remove original folder as well.

Done! I found that this virtual agent will be fully operational once the group failovers to the passive node (don't forget to create a cluster resource for a virtual agent) and will be using shared state directory.

Saturday, February 07, 2009

Consistent gets from cache (fastpath) 2

Not so long time ago I wrote an article about interesting optimization in 11G which appears as consistent gets from cache (fastpath). One thing I've pointed there is that this optimization can operate only if we're accessing the same block over and over again. This may bring us to some interesting observations how certain type of queries are behaving in 11G.

HASH GROUP BY vs SORT GROUP BY

Consider the following example:
SQL> create table dept
2 (
3 dept_id number primary key,
4 dept_name varchar2(100)
5 ) organization index;

Table created

SQL> insert /*+ append */ into dept
2 select level, dbms_random.string('x', 100)
3 from dual
4 connect by level <= 10000;

10000 rows inserted

SQL> create table emp
2 (
3 emp_id number primary key,
4 dept_id references dept (dept_id),
5 emp_name varchar2(100)
6 );

Table created

SQL> insert /*+ append */ into emp
2 select level, trunc(dbms_random.value(1, 10000)), dbms_random.string('x', 100)
3 from dual
4 connect by level <= 100000;

100000 rows inserted

SQL> commit;

Commit complete

SQL> exec dbms_stats.gather_table_stats(user, 'dept');

PL/SQL procedure successfully completed

SQL> exec dbms_stats.gather_table_stats(user, 'emp');

PL/SQL procedure successfully completed
Let's say we want to output department names along with how many employees are there:
SQL> select /*+ gather_plan_statistics */ count(*)
2 from (
3 select /*+ no_merge */ d.dept_name, count(*) cnt
4 from emp e, dept d
5 where e.dept_id=d.dept_id
6 group by d.dept_name
7 );

COUNT(*)
----------
9999

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
SQL_ID djysxbcmwwxj3, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*) from ( select /*+
no_merge */ d.dept_name, count(*) cnt from emp e, dept d where
e.dept_id=d.dept_id group by d.dept_name )

Plan hash value: 1432452646

----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.08 | 11636 | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.08 | 11636 | | | |
| 2 | VIEW | | 1 | 9999 | 9999 |00:00:00.06 | 11636 | | | |
| 3 | HASH GROUP BY | | 1 | 9999 | 9999 |00:00:00.06 | 11636 | 2058K| 999K| 2497K (0)|
| 4 | NESTED LOOPS | | 1 | 9999 | 9999 |00:00:00.02 | 11636 | | | |
| 5 | VIEW | VW_GBC_10 | 1 | 9999 | 9999 |00:00:00.02 | 1635 | | | |
| 6 | HASH GROUP BY | | 1 | 9999 | 9999 |00:00:00.02 | 1635 | 1207K| 1207K| 2496K (0)|
| 7 | TABLE ACCESS FULL| EMP | 1 | 100K| 100K|00:00:00.01 | 1635 | | | |
|* 8 | INDEX UNIQUE SCAN | SYS_IOT_TOP_15648 | 9999 | 1 | 9999 |00:00:00.01 | 10001 | | | |
----------------------------------------------------------------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

8 - access("ITEM_1"="D"."DEPT_ID")


27 rows selected.
Note that 11G did "Group by Placement" automatically (Jonathan Lewis wrote an article about it) and our query performed 11636 consistent gets.

Now, take a look at what happens if we rewrite the query to use a sort group by instead:
SQL> select /*+ gather_plan_statistics */ count(*)
2 from (
3 with e as
4 (
5 select dept_id, count(*) cnt
6 from emp e
7 group by dept_id
8 order by dept_id
9 )
10 select /*+ no_merge */ d.dept_name, e.cnt
11 from e, dept d
12 where e.dept_id=d.dept_id
13 );

COUNT(*)
----------
9999

SQL> select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------
SQL_ID 2utq5vammnwa4, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*) from ( with e as (
select dept_id, count(*) cnt from emp e group by dept_id order
by dept_id ) select /*+ no_merge */ d.dept_name, e.cnt from e, dept
d where e.dept_id=d.dept_id )

Plan hash value: 2732217545

---------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.10 | 2624 | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.10 | 2624 | | | |
| 2 | VIEW | | 1 | 9999 | 9999 |00:00:00.08 | 2624 | | | |
| 3 | NESTED LOOPS | | 1 | 9999 | 9999 |00:00:00.08 | 2624 | | | |
| 4 | VIEW | | 1 | 9999 | 9999 |00:00:00.08 | 1635 | | | |
| 5 | SORT GROUP BY | | 1 | 9999 | 9999 |00:00:00.08 | 1635 | 549K| 549K| 487K (0)|
| 6 | TABLE ACCESS FULL| EMP | 1 | 100K| 100K|00:00:00.01 | 1635 | | | |
|* 7 | INDEX UNIQUE SCAN | SYS_IOT_TOP_15648 | 9999 | 1 | 9999 |00:00:00.01 | 989 | | | |
---------------------------------------------------------------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

7 - access("E"."DEPT_ID"="D"."DEPT_ID")


27 rows selected.
What I did there is performed the same transformation myself but replaced hash group by with a sort group by. Note more than 4 times decrease in the amount of consistent gets. By looking at the plan statistics, you can tell why. Although both group by's performed the same amount of consistent gets (1635), it is a nested loops join with dep which produced all the difference -- 10001 for hash group by versus only 989 for a sort group by.

Unless you've heard about consistent gets from cache (fastpath) optimization, the above results may produce quite a bit of surprise for you. Note that due to a sorting, which has to be performed by our second query, the first query still performs better. However, in the environments which are wreaking havoc on CBC latches, the additional savings on number of consistent gets may have a potential to alleviate additional sorting expenses. In other words, the lesser are expenses for outer resultset sorting, the more appealing this could be.

Of course, before you even consider this as an optimization opportunity, keep in mind that you're relying on a specific feature which may narrow you down to a specific dot releases (or even patches) as it may change (or even completely disappear) in the next versions.

Sunday, February 01, 2009

AE enqueue

Just a quick follow up from my previous post.

The relation of AE enqueue to editions was mentioned a couple of times around the internet already. Any user session connected to a database holds AE enqueue in a shared mode...
SQL> select type, id1, lmode, sys_context('userenv', 'current_edition_id') edition_id
2 from v$lock
3 where type='AE'
4 and sid=sys_context('userenv', 'sid');

TYPE ID1 LMODE EDITION_ID
---- ---------- ---------- --------------------------------------------------------------------------------
AE 100 4 100
...and the first argument seems to be session's current_edition_id. I guess the lock mode will require an upgrade to exclusive mode during edition alterations.

Here is another interesting thing -- it looks like installing 11.1.0.7 patchset increments the current_edition_id. The value in 11.1.0.6 seems to be 99, but 11.1.0.7 changes it to 100 (you can observe it in sys.editon$ table as well). Does that mean that Oracle has any plans in doing patchset installation through edition-based redefinition (install the patchset online, short downtime is required only during switch to an upgraded edition) or is it simply a way to represent version change?

Updated the same day: it looks like in case your database was upgraded from a previous release, edition_id for ORA$BASE will be some other number as it represents ORA$BASE's object_id. This also means that my initial assumption about patchset installation changing edition_id is not correct as it is just whatever object_id is being available at the time. Before ORA$BASE edition is being created, 11.1.0.7 creates one more object (compared to 11.1.0.6), index I_SYN2, which explains advance in edition_id.

However, the idea seems to be interesting anyway...

Saturday, January 31, 2009

Edition-based redifinition (a speculation)

As you can see from this link, there is a feature called Edition-based redefinition mentioned in the 11GR1 documentation. Unfortunately, all it says is that this feature is unavailable as of now so there is really not much you can tell about it.

A speculation

I'm usually not a great fun of doing any sort of a guesswork, however, in this case it might be an interesting exercise to observe some bits and pieces presented in the current release of 11GR1 in order to see whether it can give us some clues as to what expect from this new feature as well as what the potential underpinning might look like. As well as prepare yourself for some implications...

SYS.OBJ$

Perhaps this will be the first thing what you'll notice as soon as 11G's data dictionary is concerned. For example, this is how dba_synonyms view is defined in 10GR2:
create or replace view dba_synonyms
(owner, synonym_name, table_owner, table_name, db_link)
as
select u.name, o.name, s.owner, s.name, s.node
from sys.user$ u, sys.syn$ s, sys.obj$ o
where o.obj# = s.obj#
and o.type# = 5
and o.owner# = u.user#;
Now, take a look at the same view's definition in 11GR1:
create or replace view dba_synonyms
(owner, synonym_name, table_owner, table_name, db_link)
as
select u.name, o.name, s.owner, s.name, s.node
from sys.user$ u, sys.syn$ s, sys."_CURRENT_EDITION_OBJ" o
where o.obj# = s.obj#
and o.type# = 5
and o.owner# = u.user#;
For you see, the reference to SYS.OBJ$ was replaced with SYS."_CURRENT_EDITION_OBJ". This replacement occurs all around the place in the data dictionary which alone makes it interesting enough to attract a bit of attention.

SYS._CURRENT_EDITION_OBJ

What is _CURRENT_EDITION_OBJ? It's a view:
SQL> select object_type
2 from dba_objects
3 where object_name='_CURRENT_EDITION_OBJ';

OBJECT_TYPE
-------------------
VIEW
Let's take a look at this view's definition (I've omitted fields list for the sake of clarity):
select ...
from obj$ o, user$ u
where o.owner# = u.user#
and ( /* non-versionable object */
( o.type# not in (4,5,7,8,9,10,11,12,13,14,22,87,88)
or bitand(u.spare1, 16) = 0)
/* versionable object visible in current edition */
or ( o.type# in (4,5,7,8,9,10,11,12,13,14,22,87)
and ( (u.type# <> 2 and
sys_context('userenv', 'current_edition_name') = 'ORA$BASE')
or (u.type# = 2 and
u.spare2 = sys_context('userenv', 'current_edition_id'))
or exists (select 1 from obj$ o2, user$ u2
where o2.type# = 88
and o2.dataobj# = o.obj#
and o2.owner# = u2.user#
and u2.type# = 2
and u2.spare2 =
sys_context('userenv', 'current_edition_id'))
)
)
);
First of all, it mentions about (A) versionable objects and (B) current edition which is referenced through session's context sys_context('userenv', 'current_edition_id').

Versionable objects

Only the objects of the following types seems to be versionable:
  • VIEW
  • SYNONYM
  • PROCEDURE
  • FUNCTION
  • PACKAGE
  • PACKAGE BODY
  • TRIGGER
  • TYPE
  • TYPE BODY
  • LIBRARY
  • ASSEMBLY
Object type# = 88 seems to be something special as I couldn't find what it is from sql.bsq (nor dba_objects) and it is being referenced using somewhat special way: if obj# matches dataobj# of some entry with type# = 88 and this entry is part if the current edition, the object is considered to be a part of the current edition as well. Note that the join itself permits this special object to exist in some other schema, though I don't know whether this has any practical meaning or not.

New userenv attributes

There are two (at least) new userenv context attributes:
SQL> select sys_context('userenv', 'current_edition_name')
2 current_edition_name from dual;

CURRENT_EDITION_NAME
--------------------------------------------------------------------------------
ORA$BASE
SQL> select sys_context('userenv', 'current_edition_id')
2 current_edition_id from dual;

CURRENT_EDITION_ID
--------------------------------------------------------------------------------
100
Changing these will change data returned by _CURRENT_EDITION_OBJ view as well and, given that that view is referenced instead of obj$, it looks like these contexts will be the primary mechanism responsible for referencing one version (edition) of the object or the other.

New user type

Again, by looking at the view definition, we may spot that versionable objects are being concerned as far as user$.type# = 2. To remind you, 0 is a role and 1 is a regular user, so there will be something new, perhaps this new feature could be enabled on a per-user level and/or versioned objects will be owned by some other "ghost" schema (my comment about object with type 88 seems to be allowing this, at least technically).

Some interesting clues might be get from _CURRENT_EDITION_OBJ's field definition itself:
when (u.type# = 2) then
(select eo.name from obj$ eo where eo.obj# = u.spare2)
else
'ORA$BASE'
end
The objects which are owned by this new user type will have editition_id as their object_id and will be named after edition name. That also means that object_id is no longer unique:
create unique index i_obj1 on obj$(obj#, owner#, type#) (11GR1)
create unique index i_obj1 on obj$(obj#) (10GR2)
(from sql.bsq)
I wander whether this have some chances to break legacy code which references sys.obj$ by obj# and expect one row to be returned at most...

Does all that, again, points out that object versions will be kept under this special user and will be linked back using data_object_id? Is this also a reason why objects which do have "proper" data_object_id are not versionable?

A speculation (again)

Pleases note that everything said in this post is based on some (random) observations of preliminary data available in the current release (11.1.0.7). There are chances that some (if not all) of this data may became altered or even completely invalid when this feature will be officially announced.

However, one thing seems to be clear -- in order to support this new feature, some serious alterations of the data dictionary are taking place and you should keep yourself alerted...