Alex Fatkulin's Blog: 2010

Sunday, December 19, 2010

Oracle GoldenGate Trail File Size

When it comes to estimating how big your trail files will be, Oracle documentation suggests to use the following formula:

[log volume in one hour] x [number of hours downtime] x .4 = trail disk space

It is also described as being a conservative estimate so you are likely to archive a better mileage. What I found is that there are some corner cases which can produce some "anomaly" results.

Test Case

I'll start with a specially crafted schema and table names as well as data:

create user uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu identified by "u";

User created.

SQL> grant resource to uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu;

Grant succeeded.

SQL> create table uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu.tttttttttttttttttttttttttttttt
  2  (
  3     n number primary key
  4  );

Table created.

SQL> alter table uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu.tttttttttttttttttttttttttttttt
  2 add supplemental log data (primary key) columns;

Table altered.

I've just created a new trail file which is currently 974 bytes in size:

[oracle@gg1 dirdat]$ ls -l aa000000
-rw-rw-rw- 1 oracle oinstall 974 Dec 19 11:50 aa000000

What I'm going to do is insert some data into a table and then measure how much redo as well as trail data were generated:

SQL> set autot traceonly stat
SQL> insert into uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu.tttttttttttttttttttttttttttttt
      select power(10,10)+(level-1)*power(10,10)
        from dual
        connect by level <= 10000;  2    3    4

10000 rows created.


Statistics
----------------------------------------------------------
        0  recursive calls
      329  db block gets
       31  consistent gets
        0  physical reads
     364612  redo size
      821  bytes sent via SQL*Net to client
      917  bytes received via SQL*Net from client
        3  SQL*Net roundtrips to/from client
        2  sorts (memory)
        0  sorts (disk)
      10000  rows processed

SQL> commit;

Commit complete.

Let's take a look at the trail file size:

[oracle@gg1 dirdat]$ ls -l aa000000
-rw-rw-rw- 1 oracle oinstall 1619890 Dec 19 13:40 aa000000

That is roughly 4.4 times bigger than our redo size and 11 times bigger than Oracle's "conservative" estimate. Your storage provisioning might be in for a surprise.

Trail File

To understand the key factors which resulted in our trail file to be of such outrageous size let's use logdump and take a look inside:

Logdump 11 >open ./dirdat/aa000000
Current LogTrail is /u01/app/oracle/ggs/dirdat/aa000000
Logdump 12 >ghdr on
Logdump 13 >next 2

2010/12/19 13:39:16.632.818 FileHeader           Len   966 RBA 0
Name: *FileHeader*
 3000 01b6 3000 0008 4747 0d0a 544c 0a0d 3100 0002 | 0...0...GG..TL..1...
 0002 3200 0004 2000 0000 3300 0008 02f1 bdfb 2d3e | ..2... ...3.......->
 74f2 3400 0028 0026 7572 693a 6767 313a 7175 6164 | t.4..(.&uri:gg1:quad
 726f 3a63 6f6d 3a3a 7530 313a 6170 703a 6f72 6163 | ro:com::u01:app:orac
 6c65 3a67 6773 3600 0025 0023 2f75 3031 2f61 7070 | le:ggs6..%.#/u01/app
 2f6f 7261 636c 652f 6767 732f 6469 7264 6174 2f61 | /oracle/ggs/dirdat/a
 6130 3030 3030 3037 0000 0101 3800 0004 0000 0000 | a0000007....8.......

___________________________________________________________________
Hdr-Ind    :     E  (x45)     Partition  :     .  (x04)
UndoFlag   :     .  (x00)     BeforeAfter:     A  (x41)
RecLength  :    19  (x0013)   IO Time    : 2010/12/19 13:40:32.000.000
IOType     :     5  (x05)     OrigNode   :   255  (xff)
TransInd   :     .  (x00)     FormatType :     R  (x52)
SyskeyLen  :     0  (x00)     Incomplete :     .  (x00)
AuditRBA   :         45       AuditPos   : 37458516
Continued  :     N  (x00)     RecCount   :     1  (x01)

2010/12/19 13:40:32.000.000 Insert               Len    19 RBA 974
Name: UUUUUUUUUUUUUUUUUUUUUUUUUUUUUU.TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
After  Image:                                             Partition 4   G  b
 0000 000f 0000 000b 3130 3030 3030 3030 3030 30   | ........10000000000

The first record is a standard trail file header and is of little interest to us. We're going to take look at the second record. First of all, as you can see, we've got a fully qualified table name stored in there. We can see the inserted value as well. Now if all that stuff got stored in plain text that could explain it... let's check it out:

[oracle@gg1 dirdat]$ grep --binary-files=text -o -i \
> uuuuuuuuuuuuuuuuuuuuuuuuuuuuuu.tttttttttttttttttttttttttttttt \
> aa000000 \
> | wc -l
10000

Indeed, we've got a fully qualified table name appearing 10 thousand times on our trail file! The longer your schema and/or your table names are the bigger your trail file will be. The main reason Oracle Streams are using object identifiers (as well as Oracle's redo) is to avoid getting into exactly this kind of trouble. This makes GoldenGate configuration easier as it doesn't have to store the mapping information on the target database but you also have to pay the price with every row modification which gets captured.

The story doesn't end there, however. Let's take a look at the inserted data. How do you think that got stored?

[oracle@gg1 dirdat]$ grep --binary-files=text -o -i \
> 0000000000 \
> aa000000 \
> | wc -l
10000

That's right - all numbers got stored in plain text. The fundamental problem here is that trail file does not store data types. Everything just gets converted to strings. This helps dealing with heterogeneous data sources but also makes the storage for certain data types to be very inefficient. Again, you have to pay the price somewhere. If you gzip the above trail file you'll get a whopping 25x compression ratio.

You may ask yourself how column names are being stored? The answer is there are no column names in the trail file, just column positions. That's why GoldenGate requires column order to match between source and target databases, otherwise you'll get your data mapped to a wrong column. Add a lack of data types and you might be in for a very interesting results. If column order doesn't match then you'll have to use defgen to get your data across properly.

In a nutshell, watch out for tables/schemas with long names and lots of numeric data.

Monday, July 19, 2010

ASM Mirroring and Disk Partnership

Let's say you have a RAID10 disk array with 50 disks. I'm offering you to play the following game. You'll spin a wheel of fortune divided into 50 equal slots with numbers ranging from 1 to 50. After you spun the wheel for the first time you'll write the resulted number down. After that I'll offer you to spin the second wheel of fortune equally divided into 49 slots where the number you've just pulled is missing (so you can't pull the same number twice).

After that you'll end up with two (mutually exclusive) random numbers in a range from 1 to 50. What I'm going to ask you do next is pull the disks with corresponding numbers out of your RAID10 array. What are the odds of your entire array going down?

In a classical RAID10 setup where every drive is being mirrored by a single other drive you can calculate the probability in a straightforward fashion. After you've pulled the first drive out (doesn't matter which one) there is only one specific disk out of 49 which you have to pull in order for the entire array to go bust. So your odds of taking the entire array down will be 1/49 or roughly 2%.

Now let's say that instead of a regular RAID10 array you've got a normal redundancy ASM disk group with two failure groups (25 disks each). What are the odds now?

ASM Mirroring

Before we're going to answer the above question we need to realize that ASM does not mirror disks the same way traditional RAID10 does. In fact it doesn't mirror disks at all. It mirrors extents instead. For all you know is that the extents from the disk you've just pulled out won't be mirrored in the same failure group. So that leaves us with 24 disks as safe. But what about the other 25 disks from the other failure group? How much of these disks are unsafe and will result in your normal redundancy disk group going south?

Disk Partnership

When mirroring extents ASM uses a concept called Disk Partnership. Every disk in a normal redundancy disk group has one or more partners which are used to mirror primary extents from that disk. This also means that the loss of any of the partner disks is fatal to the disk group as you'll nuke both the primary extent and it's mirror copy (keep in mind that we're talking about pulling both disks out before the disk group would be able to rebalance). At least now we know what we need next in order to solve the puzzle. We need to find out how many partners each of the disks in our array have.

Disk Partners

Let's say that the first disk we've pulled out was disk number 0. The following query can be used to find all partners for the disk number 0 in a first disk group:

SQL> select p.number_kfdpartner, d.FAILGROUP
        from x$kfdpartner p, v$asm_disk d
        where p.disk=0
                and p.grp=1
                and p.grp=group_number
                and p.number_kfdpartner=d.disk_number;  2    3    4    5    6

NUMBER_KFDPARTNER FAILGROUP
----------------- ------------------------------
               25 FG2
               26 FG2
               27 FG2
               29 FG2
               38 FG2
               46 FG2
               48 FG2
               49 FG2

8 rows selected.

Pulling any of the above disks out at the same time with the disk number 0 will be fatal for our normal redundancy disk group. In other words, once we pull the first disk out, there are other 8 disks out of 49 which are unsafe. That will bring our odds up (or down, depending in which outcome you're interested :) to 8/49 or a little bit more than 16%.

You can confirm that every disk has exactly 8 partners by running the following query:

SQL> select min(cnt), max(cnt) from (
select number_kfdpartner disk_number, count(*) cnt
        from x$kfdpartner
        where grp=1
        group by number_kfdpartner);  2    3    4    5

  MIN(CNT)   MAX(CNT)
---------- ----------
         8          8

Partner Disk Count

Is there any way to control the number of partner disks which ASM uses for extents mirroring? Turns out that there is. Note that it's a completely non supported operation so you shouldn't be playing with it.

The parameter which controls the maximum number of partner disks is called _asm_partner_target_disk_part. In 11GR2 that parameter has a default value of 8. I didn't have a chance to check it in a previous versions but supposedly it's default value there is 10(1). So at least we know that Oracle itself sometimes changes it between different releases.

All you need to do after changing this parameter is to rebalance the disk group:

SQL> alter system set "_asm_partner_target_disk_part"=2;

System altered.

SQL> alter diskgroup data rebalance;

Diskgroup altered.

SQL> select p.number_kfdpartner, d.FAILGROUP
        from x$kfdpartner p, v$asm_disk d
        where p.disk=0
                and p.grp=1
                and p.grp=group_number
                and p.number_kfdpartner=d.disk_number;  2    3    4    5    6
NUMBER_KFDPARTNER FAILGROUP
----------------- ------------------------------
               25 FG2
               38 FG2

So here it goes!

References

Oracle Automatic Storage Management: Under-the-Hood & Practical Deployment Guide
ASM Metadata and Internals

All test were performed with Oracle Grid Infrastructure 11.2.0.1

Tuesday, June 08, 2010

11GR2 Result Cache Scalability

Result Cache in 11GR1

Two years ago I wrote a series of posts where I explained some of the dynamics around Result Cache latch. To recap, the result cache memory in 11GR1 is backed up by a single RC latch. That in itself wouldn't be so much of an issue (at least relatively to what we've got in reality) had the latch allowed for shared mode gets in case all you have to do is read from the result cache memory.

Alas, the latch turned out to be without shared mode gets. It is going almost without saying that, as concurrency levels increased, that single latch was behaving more and more like to a hand brake (link to a test I've done back then on a 8-way Itanium 2).

Back to the future

When 11GR2 has been released I knew that at some point in time I'll need to go back and revisit this subject. What I did is a couple of quick and dirty runs which came back confirming the same single latch and no shared mode gets so it didn't look like something has really changed. At this point I've decided to revisit it a bit later. This a "bit later" happened just recently.

How bad can it get?

What I wanted to do is get an UltraSPARC T2 and face it against Core i7 980X on a different concurrency levels in order to see how bad it can get. T2 will require quite a lot of parallelism in order to keep up even with a single i7 core. But since all we've got is a single RC latch, I've expected T2 to choke on it quite fast as not only there will be a lot of processes competing for the same latch, the slow single-threaded performance will cause the latch to be held for a much longer periods of time. Performance degradation will be dare.

Result Cache in 11GR2

I used the same test described here as it is targeted at exploiting RC latch weakness and gives me the ability to compare with the old results. I've used 250K lookup iterations. The performance was measured as a total number of lookups performed per second and RC latch statistics were captured for analysis.

Since 980X has 6 cores and 12 threads, the tests were done with 1 to 12 processes running at the same time which also gave an opportunity to see how well HT will scale. Note that I plan to do some further testing on T2 with up to 64 threads but for now I've tested up to 12 threads only as I couldn't get a test window big enough.

UltraSPARC T2 Results

# of processes	Buffer Cache	% linear	Result Cache	% linear
1	4426	100	4555	100
2	8930	100.88	9124	100.15
3	13465	101.41	13731	100.48
4	17886	101.03	18179	99.77
5	22290	100.72	22715	99.74
6	26615	100.22	27012	98.84
7	30659	98.96	30804	96.61
8	34347	97	34910	95.8
9	38389	96.37	39029	95.2
10	42772	96.64	43126	94.68
11	46840	96.21	46936	93.68
12	50667	95.4	50590	92.55

When I saw these numbers for the first time I was quite surprised just how good these results are! UlstraSPARC T2 end up being far from choking and, as a matter of fact, the only position where Result Cache had to give up is the last one. If you reference the results I've obtained on 8-way Itanium 2 in 11GR1 you'll see that Result Cache gave up much earlier and scaled a lot worse.

This certainly looks promising so let's take a look at the RC latch statistic:

# of processes	Gets	Misses	Sleeps	Wait Time
1	500001	0	0	0
2	1000002	40253	1	0
3	1500003	50404	0	0
4	2000004	165116	9	464
5	2500005	211559	5	182
6	3000006	437898	8	6877
7	3500007	805752	52	16556
8	4000008	1214762	20	2980
9	4500009	1775372	188	3140
10	5000010	2244964	491	29568
11	5500011	2552323	664	28011
12	6000012	3019903	1226	60005

There is one astonishing fact about the above number. Let's get some efficiency metrics in place for comparison between these numbers and the ones I've got in 11GR1. I'll use a data point with eight parallel processes as it's the highest reference point I can get across both data sets.

First of all, the number of gets per execution remained the same and equals two gets per exec. If we were going to calculate % miss per get we'll get 28.62% in 11GR1 and 50.33% in 11GR2. In other words, roughly every second get request has resulted in a miss in 11GR2 and every third in 11GR1. It may appear as if this got worse but it's really a consequence from something else.

If we calculate % sleep per miss we'll get 31.36% in 11GR1 but only 0.04% in 11GR2! In other words, the amount of times a process had to go to sleep has drastically decreased. In almost all of the cases the process was able to acquire a latch during a spin without going into a sleep. This also explains why % miss per get in 11GR2 went up and shows that a lowering in efficiency for a single metric does not necessarily indicates a problem, it might happen because some other correlated metric has in fact improved.

There is certainly a sign of a great improvement but what is it? Most likely the improvement is related to the optimization of how long the latch is required to be held. The time required to hold the latch became so small that, in most of the cases, the process is able to acquire it during spinning before being required to go to sleep (i.e. less than _spin_count iterations).

Core i7 980X Results

# of processes	Buffer Cache	% linear	Result Cache	% linear
1	40064	100	43554	100
2	78989	98.58	84602	97.12
3	121753	101.3	127768	97.79
4	159490	99.52	166667	95.67
5	194704	97.2	204583	93.94
6	229709	95.56	240770	92.13
7	231788	82.65	244755	80.28
8	233918	72.98	246305	70.69
9	250836	69.57	260718	66.51
10	267094	66.67	275330	63.22
11	280326	63.61	290084	60.55
12	290416	60.41	293830	56.22

Here Result Cache won across all the positions. We need about 10 processes running on UltraSPARC T2 in order to beat a single process running on i7 980X. Performance gains declined rapidly once we got over six concurrent processes but still we were able to realize some additional performance with 12 threads being about 22% faster than 6 threads.

Latch statistics:

# of processes	Gets	Misses	Sleeps	Wait Time
1	500001	0	0	0
2	1000002	40456	0	0
3	1500003	117893	5	71
4	2000004	209399	0	0
5	2500005	381160	0	0
6	3000006	517745	11	179
7	3500007	913125	20	555
8	4000008	1355226	26	11914
9	4500009	1834112	13	1017
10	5000010	2602801	42	1607
11	5500011	3196415	145	3451
12	6000012	3730467	184	123954

Essentially we're looking at the same phenomena with the amount of sleeps being significantly lower compared to what we observed in 11GR1. With six concurrent processes % miss per get is 17.26% and % sleep per miss is 0.002%! This allowed Result Cache to stay ahead with up to (and including) 12 concurrent processes running.

UltraSPARC T2 vs i7 980X

We'll wrap up with a nice graph showing result cache performance on both UltraSPARC T2 and Core i7 980X:

i7 980X starts almost where 12 UltraSPARC T2 processes ends. Would T2 be able to narrow the gap with more parallel threads? I'll certainly find out.

Conclusion

There is an enormous improvement when it comes to Result Cache scalability in 11GR2. Still it's slower than if we had shared mode gets (or multiple child latches or, even better, both) but it gets very, very close.

Monday, June 07, 2010

Row cache objects latch contention

A data loading process was running on UltraSPARC T2 CPU. To take advantage of the platform architecture (or, I'd rather say, to avoid it's limitations) the loading process has been design to run with a massive amount of parallel query slaves in order to extract the maximum output from CMT architecture.

Every time this data loading process executed, it experienced strange slowdowns on seemingly random points in time. Performance drops were quite substantial, which prompted to do an additional investigation. Upon a closer examination of ASH data, it turned out that all slowdowns were due to latch: row cache objects contention.

Row cache objects latch protects the dictionary cache. The first thing was to figure out whether most of the contention was contributed by a particular row cache objects child latch:

SQL> select latch#, child#, sleeps
  2   from v$latch_children
  3   where name='row cache objects'
  4    and sleeps > 0
  5   order by sleeps desc;
 
    LATCH#     CHILD#     SLEEPS
---------- ---------- ----------
       270          1   24241645
       270          5        523
       270          4         52

The first child certainly doesn't look good when we take the amount of sleeps experienced by it, compared to all the other child latches. Once we have the troublesome child latch identified, we can move on and see which type of dictionary cache it protects:

SQL> select distinct s.kqrstcln latch#,r.cache#,r.parameter name,r.type,r.subordinate#
from v$rowcache r,x$kqrst s
where r.cache#=s.kqrstcid
order by 1,4,5;  2    3    4

 LATCH# CACHE# NAME                              TYPE        SUBORDINATE#
------- ------ --------------------------------- ----------- ------------
      1      3 dc_rollback_segments              PARENT
      2      1 dc_free_extents                   PARENT
      3      4 dc_used_extents                   PARENT
      4      2 dc_segments                       PARENT
      5      0 dc_tablespaces                    PARENT
      6      5 dc_tablespace_quotas              PARENT
      7      6 dc_files                          PARENT
      8     10 dc_users                          PARENT
      8      7 dc_users                          SUBORDINATE            0
      8      7 dc_users                          SUBORDINATE            1
      8      7 dc_users                          SUBORDINATE            2
      9      8 dc_objects                        PARENT
      9      8 dc_object_grants                  SUBORDINATE            0
     10     17 dc_global_oids                    PARENT
     11     12 dc_constraints                    PARENT
     12     13 dc_sequences                      PARENT
     13     16 dc_histogram_defs                 PARENT
     13     16 dc_histogram_data                 SUBORDINATE            0
     13     16 dc_histogram_data                 SUBORDINATE            1
     14     32 kqlsubheap_object                 PARENT
     15     19 dc_table_scns                     PARENT
     15     19 dc_partition_scns                 SUBORDINATE            0
     16     18 dc_outlines                       PARENT
     17     14 dc_profiles                       PARENT
     18     47 realm cache                       PARENT
     18     47 realm auth                        SUBORDINATE            0
     19     48 Command rule cache                PARENT
     20     49 Realm Object cache                PARENT
     20     49 Realm Subordinate Cache           SUBORDINATE            0
     21     46 Rule Set Cache                    PARENT
     22     34 extensible security user and rol  PARENT
     23     35 extensible security principal pa  PARENT
     24     37 extensible security UID to princ  PARENT
     25     36 extensible security principal na  PARENT
     26     33 extensible security principal ne  PARENT
     27     38 XS security class privilege       PARENT
     28     39 extensible security midtier cach  PARENT
     29     44 event map                         PARENT
     30     45 format                            PARENT
     31     43 audit collector                   PARENT
     32     15 global database name              PARENT
     33     20 rule_info                         PARENT
     34     21 rule_or_piece                     PARENT
     34     21 rule_fast_operators               SUBORDINATE            0
     35     23 dc_qmc_ldap_cache_entries         PARENT
     36     52 qmc_app_cache_entries             PARENT
     37     53 qmc_app_cache_entries             PARENT
     38     27 qmtmrcin_cache_entries            PARENT
     39     28 qmtmrctn_cache_entries            PARENT
     40     29 qmtmrcip_cache_entries            PARENT
     41     30 qmtmrctp_cache_entries            PARENT
     42     31 qmtmrciq_cache_entries            PARENT
     43     26 qmtmrctq_cache_entries            PARENT
     44      9 qmrc_cache_entries                PARENT
     45     50 qmemod_cache_entries              PARENT
     46     24 outstanding_alerts                PARENT
     47     22 dc_awr_control                    PARENT
     48     25 SMO rowcache                      PARENT
     49     40 sch_lj_objs                       PARENT
     50     41 sch_lj_oids                       PARENT

60 rows selected.

The first child protects dc_rollback_segments. We can confirm it by referencing data in v$rowcache:

SQL> select parameter, gets
  2   from v$rowcache
  3   order by gets desc;
 
PARAMETER                              GETS
-------------------------------- ----------
dc_rollback_segments              310995555
dc_tablespaces                     76251831
dc_segments                         3912096
dc_users                            2307601
dc_objects                          1460725
dc_users                             608659
dc_histogram_defs                    250666
global database name                  67475
dc_histogram_data                     43098
dc_histogram_data                     14364
dc_global_oids                        14320
outstanding_alerts                     2956
dc_profiles                            2555
dc_awr_control                         1925
dc_object_grants                        745
dc_files                                532
dc_constraints                          201
sch_lj_oids                             158
dc_sequences                            156
dc_table_scns                            20
sch_lj_objs                              18
dc_qmc_ldap_cache_entries                 0
qmc_app_cache_entries                     0
qmc_app_cache_entries                     0
qmtmrcin_cache_entries                    0
qmtmrctn_cache_entries                    0
qmtmrcip_cache_entries                    0
qmtmrctp_cache_entries                    0
qmtmrciq_cache_entries                    0
qmtmrctq_cache_entries                    0
qmrc_cache_entries                        0
qmemod_cache_entries                      0
SMO rowcache                              0
dc_users                                  0
dc_partition_scns                         0
dc_users                                  0
realm auth                                0
Realm Subordinate Cache                   0
rule_or_piece                             0
rule_info                                 0
audit collector                           0
format                                    0
event map                                 0
extensible security midtier cach          0
XS security class privilege               0
extensible security principal ne          0
extensible security principal na          0
extensible security UID to princ          0
extensible security principal pa          0
extensible security user and rol          0
Rule Set Cache                            0
Realm Object cache                        0
Command rule cache                        0
realm cache                               0
dc_outlines                               0
kqlsubheap_object                         0
dc_tablespace_quotas                      0
dc_used_extents                           0
rule_fast_operators                       0
dc_free_extents                           0
 
60 rows selected

The next step is to see whether latch miss source can give us some more hints regarding the issue:

SQL> select "WHERE", sleep_count, location
  2   from v$latch_misses
  3   where parent_name='row cache objects'
  4    and sleep_count > 0;
 
WHERE               SLEEP_COUNT LOCATION
------------------- ----------- ------------------------------
kqrpre: find obj       20612167 kqrpre: find obj
kqrpup                        7 kqrpup
kqrcmt: while loop            1 kqrcmt: while loop
kqrcmt: clear flag            1 kqrcmt: clear flag
kqreqd                  1026837 kqreqd
kqreqd: reget           2602576 kqreqd: reget
 
6 rows selected

Now if you take kqrpre: find obj and plug it into a search on My Oracle Support you'll quickly yield Bug 5749075 High Requests on dc_rollback_segments. Among other things, this note points out at the unusually high number of undo segments being created due to cleanup not able to work properly...

SQL> select count(*) from dba_rollback_segs;
 
  COUNT(*)
----------
     14838

...and this seems to be the case. The only difference is that the issue has been observed on 11GR2 and the bug has been filled against the older versions. Though it was still worth checking in case we were seeing a regression. Indeed, after getting rid of that many undo segments by simply recreating the undo tablespace, the issue, thought not completely vanished, manifested itself a lot less making it's impact relatively insignificant to the process throughput.

Wednesday, May 12, 2010

The case of a slow lookup

Take a look at the runtime difference for the same anonymous PL/SQL block between 11GR1 and 11GR2:

Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set timing on
SQL> begin
        for i in 1 .. 1000000
        loop
                for cur in (select n from z_t where n=mod(i,10000)+1)
                loop
                        null;
                end loop;
        end loop;
end;  2    3    4    5    6    7    8    9
 10  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:31.03

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set timing on
SQL> begin
        for i in 1 .. 1000000
        loop
                for cur in (select n from z_t where n=mod(i,10000)+1)
                loop
                        null;
                end loop;
        end loop;
end;  2    3    4    5    6    7    8    9
 10  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:46.23

11GR2 end up being about 50% slower.

A little background

I've been running some tests and the numbers I was getting back just weren't making any sense. I've been using 11GR2 but all I was getting were marginally better results compared to the tests I've done a couple of years ago using 11GR1 on a much less powerful hardware. Something definitely didn't look right so I geared up to find out what it is. The above test case is something I arrived at, which is simple but demonstrates the issue. This test case was run using the same hardware so the only difference was release of Oracle database being used.

tkprof results

To see whether there are any low hanging fruits I've traced the same loop with 250K iterations (to cut a bit on a trace file size):

11GR1

SQL ID: 85f6s9ndbynfc
Plan Hash: 3532178995
SELECT N
FROM
 Z_T WHERE N=MOD(:B1 ,10000)+1


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 249995      2.51       2.24          0          0          0           0
Fetch   249995      1.54       1.68          0     499990          0      249995
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   499991      4.06       3.92          0     499990          0      249995

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 54     (recursive depth: 1)

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  INDEX UNIQUE SCAN PK_Z_T (cr=2 pr=0 pw=0 time=0 us cost=1 size=13 card=1)(object id 61973)

11GR2

SQL ID: 85f6s9ndbynfc
Plan Hash: 3532178995
SELECT N
FROM
 Z_T WHERE N=MOD(:B1 ,10000)+1


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute 249999      4.07       4.06          0          0          0           0
Fetch   249999      2.62       2.67          0     499998          0      249999
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   499999      6.69       6.74          0     499998          0      249999

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 57     (recursive depth: 1)

Rows     Row Source Operation
-------  ---------------------------------------------------
      1  INDEX UNIQUE SCAN PK_Z_T (cr=2 pr=0 pw=0 time=0 us cost=1 size=13 card=1)(object id 69265)

These statistics looks almost identical but with one big difference. 11GR2 is about 70% slower.

What's going on?

Certainly you wouldn't expect the newest release of Oracle database to be so much slower compared to the previous version. As a first step I've traced both loops which resulted in discovering some timing improvements which I've blogged about just recently. But the additional overhead introduced by times syscalls was not even remotely enough to justify the difference. As a next step I've compared session statistics between 11GR1 and 11GR2 which lead to session cached cursors overflow issue discovery. Apart from that the only difference was significantly greater amount of CPU time consumed by 11GR2 (as evident by the trace file as well).

Now what?

Certainly some more diagnostic were required. What I did is a couple of stack dumps while the loop was running to see whether there will be something unusual... and there were something indeed:

[oracle@ora11gr2 trace]$ pstack 1052
#0  0x0000003b83c99247 in times () from /lib64/libc.so.6
#1  0x00000000085556c7 in sltrgatime64 ()
#2  0x0000000005521a44 in auddft ()
#3  0x0000000008438ee3 in audStatement ()
#4  0x00000000083fc299 in opiexe ()
#5  0x0000000008405f57 in opipls ()
#6  0x00000000083f78b9 in opiodr ()
#7  0x00000000084892af in __PGOSF141_rpidrus ()
#8  0x00000000085ee820 in skgmstack ()
#9  0x000000000848a759 in rpiswu2 ()
#10 0x0000000008489e9d in rpidrv ()
#11 0x0000000008476438 in psddr0 ()
#12 0x0000000008476048 in psdnal ()
#13 0x000000000288b3aa in pevm_BFTCHC ()
#14 0x000000000287ab84 in pfrinstr_BFTCHC ()
#15 0x0000000008645105 in pfrrun_no_tool ()
#16 0x00000000086439ea in pfrrun ()
#17 0x000000000864a7d6 in plsql_run ()
#18 0x0000000002870d07 in peicnt ()
#19 0x0000000002867649 in kkxexe ()
#20 0x00000000083fe337 in opiexe ()
#21 0x00000000019da690 in kpoal8 ()
#22 0x00000000083f78b9 in opiodr ()
#23 0x0000000008583187 in ttcpip ()
#24 0x0000000001702661 in opitsk ()
#25 0x0000000001707282 in opiino ()
#26 0x00000000083f78b9 in opiodr ()
#27 0x00000000016fe708 in opidrv ()
#28 0x0000000001b7183f in sou2o ()
#29 0x00000000009d3f05 in opimai_real ()
#30 0x0000000001b76ace in ssthrdmain ()
#31 0x00000000009d3e71 in main ()

Look at the lines #2 and #3. Are these some auditing functions getting in the way? I verified audit_trail parameter which turned out to be sat to DB:

SQL> show parameter audit_trail

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
audit_trail                          string      DB

Though I didn't have any auditing enabled for any of the statements let alone the table I was selecting from. But after I've confirmed that audit_trail was set to NONE on 11GR1 database, disabling audit completely looked like something worth trying.

After I've repeated the test with audit_trail sat to NONE, I end up getting virtually identical results compared to 11GR1:

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set timing on
SQL> begin
        for i in 1 .. 1000000
        loop
                for cur in (select n from z_t where n=mod(i,10000)+1)
                loop
                        null;
                end loop;
        end loop;
end;  2    3    4    5    6    7    8    9
 10  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:32.42

...and two audit functions has disappeared from the stack dump as well.

It looks like the impact of having auditing enabled end up being high enough to draw some attention in this particular case even though it wasn't sat to capture anything about our test. On a side note this can be considered an invitation to instrument some auditing functionality as well.

Tuesday, May 11, 2010

Timing improvements in Oracle 11GR2 trace

I've been doing some research recently (I might blog about it a bit later if something worthwhile comes out of it) which resulted in a byproduct I found interesting to blog about.

In a nutshell, in one of the steps I've run the following PL/SQL block to trace the results of a very tight loop:

begin
 dbms_monitor.session_trace_enable(waits => false, binds => false);
 for i in 1 .. 100000
 loop
  for cur in (select null from dual)
  loop
   null;
  end loop;
 end loop;
 dbms_monitor.session_trace_disable;
end;

While comparing raw trace files between 11.1.0.7 and 11.2.0.1 I've noticed something interesting. Here is an excerpt from 11.1.0.7 raw trace:

PARSING IN CURSOR #2 len=21 dep=1 uid=54 oct=3 lid=54 tim=1273621434711518
 hv=2339989301 ad='86044a88' sqlid='bg2ng0u5rkttp'
SELECT NULL FROM DUAL
END OF STMT
EXEC #2:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,plh=1388734953,tim=1273621434711518
FETCH #2:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=1,dep=1,og=1,plh=1388734953,tim=1273621434711518
CLOSE #2:c=0,e=0,dep=1,type=3,tim=1273621434711518

Notice how tim= values are the same across parse, exec, fetch and close; and e= values are zero for fetch, close and execute.

Now take a look at a similar excerpt from 11GR2:

PARSING IN CURSOR #3 len=21 dep=1 uid=57 oct=3 lid=57 tim=1273621440328909
 hv=2339989301 ad='844d5a20' sqlid='bg2ng0u5rkttp'
SELECT NULL FROM DUAL
END OF STMT
EXEC #3:c=0,e=9,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,plh=1388734953,tim=1273621440328908
FETCH #3:c=0,e=6,p=0,cr=0,cu=0,mis=0,r=1,dep=1,og=1,plh=1388734953,tim=1273621440328947
CLOSE #3:c=0,e=1,dep=1,type=3,tim=1273621440328968

Notice how tim= value changes from line to line and e= has a corresponding value associated with it.

Before we move on you may ask how 11.1.0.7 calculates the time if all the values appear to be zero? The answer is that every so often you'll see a line like that:

FEXEC #5:c=1000,e=20960,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,plh=1388734953,tim=1273621434873334

Sure we didn't spent that much time fetching a single row. This value comes from the fact that we've just dumped all the time associated with multiple executions in this single line.

Now for 11GR2 to calculate the time with that additional precision will require a heck more times syscalls. To confirm this I've ran strace for processes in both 11GR1 and 11GR2 and here is what I came up with:

11GR1

[oracle@ora11gr1 tmp]$ grep -c times 4279.out
116

11GR2

[oracle@ora11gr2 tmp]$ grep -c times 5318.out
200149

Indeed, 11GR2 end up calling times twice per iteration while 11GR1 was doing one call per about thousand iterations. In total 11GR2 end up doing roughly 200K more syscalls. Next I became curious about how much overhead all these additional syscalls introduce and in order to find out I've run a small program:

[oracle@ora11gr2 tmp]$ cat do_times.c
#include 

int main()
{
        int i;
        struct tms t;
        clock_t c;
        for (i=1;i<=200000;i++)
        {
                c=times(&t);
        }

        return 0;
}
[oracle@ora11gr2 tmp]$ gcc do_times.c
[oracle@ora11gr2 tmp]$ time ./a.out

real    0m0.096s
user    0m0.037s
sys     0m0.059s

We're talking about 0.01 second for 200K times syscalls, in other words - not much.

It's definitely nice to see all these small improvements being done.

Alex Fatkulin's Blog

Sunday, December 19, 2010

Oracle GoldenGate Trail File Size

Monday, July 19, 2010

ASM Mirroring and Disk Partnership

Tuesday, June 08, 2010

11GR2 Result Cache Scalability

Monday, June 07, 2010

Row cache objects latch contention

Wednesday, May 12, 2010

The case of a slow lookup

Tuesday, May 11, 2010

Timing improvements in Oracle 11GR2 trace

About Me

Blog Archive

My Blog List

Alex Fatkulin's Blog

Sunday, December 19, 2010

Oracle GoldenGate Trail File Size

Monday, July 19, 2010

ASM Mirroring and Disk Partnership

Tuesday, June 08, 2010

11GR2 Result Cache Scalability

Monday, June 07, 2010

Row cache objects latch contention

Wednesday, May 12, 2010

The case of a slow lookup

Tuesday, May 11, 2010

Timing improvements in Oracle 11GR2 trace

About Me

Subscribe To

Blog Archive

My Blog List