Wednesday, May 12, 2010

The case of a slow lookup

Take a look at the runtime difference for the same anonymous PL/SQL block between 11GR1 and 11GR2:
Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set timing on
SQL> begin
for i in 1 .. 1000000
loop
for cur in (select n from z_t where n=mod(i,10000)+1)
loop
null;
end loop;
end loop;
end; 2 3 4 5 6 7 8 9
10 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:31.03

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set timing on
SQL> begin
for i in 1 .. 1000000
loop
for cur in (select n from z_t where n=mod(i,10000)+1)
loop
null;
end loop;
end loop;
end; 2 3 4 5 6 7 8 9
10 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:46.23
11GR2 end up being about 50% slower.

A little background

I've been running some tests and the numbers I was getting back just weren't making any sense. I've been using 11GR2 but all I was getting were marginally better results compared to the tests I've done a couple of years ago using 11GR1 on a much less powerful hardware. Something definitely didn't look right so I geared up to find out what it is. The above test case is something I arrived at, which is simple but demonstrates the issue. This test case was run using the same hardware so the only difference was release of Oracle database being used.

tkprof results

To see whether there are any low hanging fruits I've traced the same loop with 250K iterations (to cut a bit on a trace file size):

11GR1
SQL ID: 85f6s9ndbynfc
Plan Hash: 3532178995
SELECT N
FROM
Z_T WHERE N=MOD(:B1 ,10000)+1


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 249995 2.51 2.24 0 0 0 0
Fetch 249995 1.54 1.68 0 499990 0 249995
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 499991 4.06 3.92 0 499990 0 249995

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 54 (recursive depth: 1)

Rows Row Source Operation
------- ---------------------------------------------------
1 INDEX UNIQUE SCAN PK_Z_T (cr=2 pr=0 pw=0 time=0 us cost=1 size=13 card=1)(object id 61973)

11GR2
SQL ID: 85f6s9ndbynfc
Plan Hash: 3532178995
SELECT N
FROM
Z_T WHERE N=MOD(:B1 ,10000)+1


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 249999 4.07 4.06 0 0 0 0
Fetch 249999 2.62 2.67 0 499998 0 249999
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 499999 6.69 6.74 0 499998 0 249999

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 57 (recursive depth: 1)

Rows Row Source Operation
------- ---------------------------------------------------
1 INDEX UNIQUE SCAN PK_Z_T (cr=2 pr=0 pw=0 time=0 us cost=1 size=13 card=1)(object id 69265)

These statistics looks almost identical but with one big difference. 11GR2 is about 70% slower.

What's going on?

Certainly you wouldn't expect the newest release of Oracle database to be so much slower compared to the previous version. As a first step I've traced both loops which resulted in discovering some timing improvements which I've blogged about just recently. But the additional overhead introduced by times syscalls was not even remotely enough to justify the difference. As a next step I've compared session statistics between 11GR1 and 11GR2 which lead to session cached cursors overflow issue discovery. Apart from that the only difference was significantly greater amount of CPU time consumed by 11GR2 (as evident by the trace file as well).

Now what?

Certainly some more diagnostic were required. What I did is a couple of stack dumps while the loop was running to see whether there will be something unusual... and there were something indeed:
[oracle@ora11gr2 trace]$ pstack 1052
#0 0x0000003b83c99247 in times () from /lib64/libc.so.6
#1 0x00000000085556c7 in sltrgatime64 ()
#2 0x0000000005521a44 in auddft ()
#3 0x0000000008438ee3 in audStatement ()

#4 0x00000000083fc299 in opiexe ()
#5 0x0000000008405f57 in opipls ()
#6 0x00000000083f78b9 in opiodr ()
#7 0x00000000084892af in __PGOSF141_rpidrus ()
#8 0x00000000085ee820 in skgmstack ()
#9 0x000000000848a759 in rpiswu2 ()
#10 0x0000000008489e9d in rpidrv ()
#11 0x0000000008476438 in psddr0 ()
#12 0x0000000008476048 in psdnal ()
#13 0x000000000288b3aa in pevm_BFTCHC ()
#14 0x000000000287ab84 in pfrinstr_BFTCHC ()
#15 0x0000000008645105 in pfrrun_no_tool ()
#16 0x00000000086439ea in pfrrun ()
#17 0x000000000864a7d6 in plsql_run ()
#18 0x0000000002870d07 in peicnt ()
#19 0x0000000002867649 in kkxexe ()
#20 0x00000000083fe337 in opiexe ()
#21 0x00000000019da690 in kpoal8 ()
#22 0x00000000083f78b9 in opiodr ()
#23 0x0000000008583187 in ttcpip ()
#24 0x0000000001702661 in opitsk ()
#25 0x0000000001707282 in opiino ()
#26 0x00000000083f78b9 in opiodr ()
#27 0x00000000016fe708 in opidrv ()
#28 0x0000000001b7183f in sou2o ()
#29 0x00000000009d3f05 in opimai_real ()
#30 0x0000000001b76ace in ssthrdmain ()
#31 0x00000000009d3e71 in main ()
Look at the lines #2 and #3. Are these some auditing functions getting in the way? I verified audit_trail parameter which turned out to be sat to DB:
SQL> show parameter audit_trail

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
audit_trail string DB

Though I didn't have any auditing enabled for any of the statements let alone the table I was selecting from. But after I've confirmed that audit_trail was set to NONE on 11GR1 database, disabling audit completely looked like something worth trying.

After I've repeated the test with audit_trail sat to NONE, I end up getting virtually identical results compared to 11GR1:
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set timing on
SQL> begin
for i in 1 .. 1000000
loop
for cur in (select n from z_t where n=mod(i,10000)+1)
loop
null;
end loop;
end loop;
end; 2 3 4 5 6 7 8 9
10 /

PL/SQL procedure successfully completed.

Elapsed: 00:00:32.42
...and two audit functions has disappeared from the stack dump as well.

It looks like the impact of having auditing enabled end up being high enough to draw some attention in this particular case even though it wasn't sat to capture anything about our test. On a side note this can be considered an invitation to instrument some auditing functionality as well.

1 comment:

  1. Anonymous4:30 AM

    Bug 9182633: PERFORMANCE REGRESSION IN 11.2

    ReplyDelete