Sometime ago I was involved in tuning a large scale production system which experienced quite a lot of latch contention. In fact the contention was bad enough as to render the entire system unusable during peak hours.
A typical AWR report during times with moderate workload looked like this:
Event Waits Time(s) Avg wait (ms) % DB time Wait Class DB CPU 21,805 36.95 latch: row cache objects 14,524,462 11,552 1 19.57 Concurrency db file sequential read 2,697,778 8,988 3 15.23 User I/O gc current block 3-way 4,202,356 3,599 1 6.10 Cluster gc current block 2-way 3,670,293 2,330 1 3.95 Cluster
(in reality it was a 7-node RAC cluster but it is not really relevant to our subject).
It is not hard to spot latch: row cache objects on a second place consuming almost 20% of DB time. This event was rapidly escalating whether load increased quickly bringing the entire cluster into unusable state as far as any front-end application were concerned. The total number of gets during an hour (the interval with which this particular AWR report had been made) totaled 1,281,167,103 which averages a whopping 355,879 gets per second!
The first step was to take a look at the Dictionary Cache statistics and see if there were any obvious deviations:
Cache Get Requests Pct Miss dc_awr_control 72 9.72 dc_database_links 124,452 0.05 dc_files 46,900 17.14 dc_global_oids 555,599 0.07 dc_histogram_data 8,962,576 1.01 dc_histogram_defs 412,885,094 0.02 dc_object_grants 52,475 1.13 dc_objects 2,860,222 0.64 dc_profiles 61,189 0.01 dc_rollback_segments 392,885 0.00 dc_segments 657,653 7.72 dc_sequences 1,324 67.67 dc_table_scns 3,206 3.31 dc_tablespaces 1,253,322 0.06 dc_users 5,475,824 0.02 global database name 65,092 0.01 kqlsubheap_object 10,910 0.46 outstanding_alerts 721 95.98
dc_histogram_defs really stands out but so far provides no direct clues. Though the entire dictionary cache seems to be quite busy. Let's take a look at instance activity statistics:
Per Second Per Transaction Per Exec Per Call DB Time(s): 16.4 0.5 0.00 0.00 DB CPU(s): 6.1 0.2 0.00 0.00 Redo size: 87,543.5 2,398.6 Logical reads: 336,555.7 9,221.2 Block changes: 449.3 12.3 Physical reads: 1,018.3 27.9 Physical writes: 36.8 1.0 User calls: 41,250.4 1,130.2 Parses: 6,264.4 171.6 Hard parses: 27.1 0.7 W/A MB processed: 33,769,202.4 925,233.0 Logons: 18.3 0.5 Executes: 14,230.5 389.9 Rollbacks: 3.4 0.1 Transactions: 36.5
A couple of things to notice:
- The number of executions is not that high (at least for an IBM box with 56 CPU cores, that is).
- Execute to Parse ratio is not particularly good nor bad (6,264.4 parses per second compared to 14,230.5 executions).
- None of the above seems to be enough to justify 355,879 latch: row cache objects gets per second, even when you consider 27.1 hard parses/sec.
There had to be more going on. Looking at the SQL ordered by Parse Calls section of the report revealed that all the top parsing queries were coming from a database link. Which prompted me to ask additional questions...
It turned out that the system had a somewhat weird architecture. A 7-node RAC cluster served as a back end to a numerous front-end Oracle databases which were essentially acting as PL/SQL-engines, accessing all the data through the database link while dong all the processing. The architectural feasibility of such solution is debatable (to say the least) but let's not steer away from the main topic of our discussion.
I geared to find out whether there was some special case which could cause latch: row cache objects to shoot through the roof when we deal with a lot of queries executed through a database link. After a number of experiments I eventually found a cause which explains the title of this blog post.
Take a look at the following example:
SQL> declare 2 l_gets number; 3 begin 4 select gets into l_gets from v$latch@dblink where name='row cache objects'; 5 6 for i in 1 .. 1000 7 loop 8 for cur in (select null from dual@dblink) loop null; end loop; 9 end loop; 10 11 select sum(gets)-l_gets into l_gets from v$latch@dblink where name='row cache objects'; 12 dbms_output.put_line(to_char(l_gets)); 13 end; 14 / 79 PL/SQL procedure successfully completed
Accessing a remote table 1000 times through a database link only caused 79 row cache objects gets on the destination database. However, exactly the same example can produce quite a different result:
SQL> alter session set session_cached_cursors=0; Session altered SQL> SQL> declare 2 l_gets number; 3 begin 4 select gets into l_gets from v$latch@dblink where name='row cache objects'; 5 6 for i in 1 .. 1000 7 loop 8 for cur in (select null from dual@dblink) loop null; end loop; 9 end loop; 10 11 select sum(gets)-l_gets into l_gets from v$latch@dblink where name='row cache objects'; 12 dbms_output.put_line(to_char(l_gets)); 13 end; 14 / 3082 PL/SQL procedure successfully completed
Now we jumped from 79 gets to 3082 which is a 39x increase! What's really astonishing is the fact that it's a cursor cache miss on the source which wreaks havoc on the destination. Evidently, the solution to the problem had nothing to do with the database where the problem was observed. The "PL/SQL-databases" had an insufficient setting for session_cached_cursors and once that was corrected the issue disappeared.
It is somewhat ironic that that problem would have never appears had the data and PL/SQL be allowed to be kept in the same database (where they belong).
Alex,
ReplyDeleteDid you happen to sql trace a session which is hung during the issue time? It would have shown the recursive sql for populating the row cache.
Also any stats gathering running on big tables with lot of partitions during issue time? Stats gathering on tables/indexes with large partitions tends to first load all histogram data into row cache before their cleanout. This could cause contention for loading of histograms for other tables.
regards
srivenu
Hi Alex
ReplyDeleteGood post. How did you determine that you had to set session_cached_cursors to 0 to reproduce the issue.
Did you compare the init.ora of source and dest and found this and tried by making them the same
Thank you
Kumar,
ReplyDeletein reality the source systems had session_cached_cursor set to a default value (which in 10g happens to be 50) but I didn't go by comparing initialization parameters anyway. What I did know is that remote execution were hitting row cache objects latches like there is no tomorrow but I did not know why. It took me sometime of what essentially was trial and error in order to discover the cause. The default value of 50 just happened to be not enough due to the vast amount of different queries submitted through the database links. I set session_cached_cursors=0 in my blog post for the sake of demonstrating the issue.
Alex, Forgot one more point, is it a HP server?
ReplyDeleteSrivenu,
ReplyDeleteyou will find the answer in the blog post, I mention that it's an IBM box. But the issue is not platform-related and have nothing to do with the vendor.
A very interesting post, I'll be following your blog from now on. Keep up the good work!
ReplyDeleteBest regards,
Nikolay