call count cpu elapsed disk query current rowsWhat's unusual about it is a single latch wait for 53 seconds without consuming any CPU time. This is something different from a usual
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 5001 0.53 53.71 0 0 0 500000
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 5003 0.53 53.71 0 0 0 500000
...
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
latch free 1 53.20 53.20
get -> miss -> spin -> sleeplatch acquisition algorithm we generally used to.
Sequence of events
To remind you, the sequence of events goes like this:
- First session: do select /*+ result_cache */ * from t which places a single big result into Result Cache memory.
First session: do select count(*) from v$result_cache_memory which grabs Result Cache latch for a long amount of time.
Second session: do select /*+ result_cache */ * from t which forces us to wait until Result Cache latch will be freed up by our first session.
Let's take a look at v$latch right after we execute the above three steps:
SQL> select gets, misses, sleepsAs we can see, the session indeed went into a sleep after a single miss. The first thing which may come into your mind is that I'm running this on a single CPU box where we don't spin if we miss a latch (due to _spin_count=1 being default on a single CPU boxes). This is not the case:
2 from v$latch
3 where name='Result Cache: Latch';
GETS MISSES SLEEPS
---------- ---------- ----------
53494 1 1
SQL> show parameter cpu_countAnd my DB's _spin_count has a default value of 2000 as well.
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
cpu_count integer 2
Let's look a bit further:
SQL> select location, sleep_count, wtr_slp_count, longhold_countNote LONGHOLD_COUNT column equals 1 for Result Cache: Serialization16. That means that we've holded a latch for the entire duration of someone else's sleep which, again, conforms to our observation. But what is the underlying mechanism for this?
2 from v$latch_misses
3 where parent_name='Result Cache: Latch'
4 and sleep_count+wtr_slp_count+longhold_count > 0;
LOCATION SLEEP_COUNT WTR_SLP_COUNT LONGHOLD_COUNT
------------------------------ ----------- ------------- --------------
Result Cache: Serialization12 0 1 0
Result Cache: Serialization16 1 0 1
Under the hood
Here is an excerpt from our second session's strace output:
...And the first session:
getrusage(RUSAGE_SELF, {ru_utime={0, 728889}, ru_stime={1, 317799}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={0, 728889}, ru_stime={1, 317799}, ...}) = 0
semop(98304, 0x7fbfff5f58, 1) = 0 -- enters the sleep here
times({tms_utime=72, tms_stime=131, tms_cutime=0, tms_cstime=0}) = 429496789
getrusage(RUSAGE_SELF, {ru_utime={0, 729889}, ru_stime={1, 317799}, ...}) = 0
...
...In other words:
mmap(0x2a972e7000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 9, 0) = 0x2a972e7000
mmap(0x2a972f7000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 9, 0) = 0x2a972f7000
semctl(98304, 31, SETVAL, 0x7f00000001) = 0 -- issues upon completion
getrusage(RUSAGE_SELF, {ru_utime={206, 80670}, ru_stime={2, 686591}, ...}) = 0
getrusage(RUSAGE_SELF, {ru_utime={206, 81670}, ru_stime={2, 686591}, ...}) = 0
...
- Second session: upon discovering that the latch is busy, it begins to sleep by issuing
semop(98304, 0x7fbfff5f58, 1) command.
First session: upon completion issues
semctl(98304, 31, SETVAL, 0x7f00000001) which wakes our second session up.
Though we do sleep under normal latch acquisition algorithm as well, the sleep is based on time, not the latch holder posting us.
Which algorithm?
The interesting question to answer is how the waiter determines which algorithm to use. It could be (A) a special "longhold" mode get for a latch or (B) the waiter might check for how long the latch was already held. Thus far I tend to think that it is more close to (A) (a special get mode or a separate flag somewhere) because if we execute above two sessions concurrently we always get an enormously unusual number of sleeps even if we place only one single-row result into Result Cache memory (thus making sure there are no significant delays selecting from v$result_cache_memory).
Maybe some of you could share other ideas as well.
Not sure why my pingback didn't appear, but here's a reply post:
ReplyDeletehttp://blog.tanelpoder.com/2009/01/20/reliable-latch-waits-and-a-new-blog/
Tanel, thanks for investigating this further. Very appreciated!
ReplyDeleteI've fixed the trackback thing, thanks.
ReplyDeleteMy trackback did not appeared also.
ReplyDeleteThe waits you discovered are common:
http://andreynikolaev.wordpress.com/2010/12/16/hidden-latch-wait-revolution-that-we-missed/