Two years ago I wrote a series of posts where I explained some of the dynamics around Result Cache latch. To recap, the result cache memory in 11GR1 is backed up by a single RC latch. That in itself wouldn't be so much of an issue (at least relatively to what we've got in reality) had the latch allowed for shared mode gets in case all you have to do is read from the result cache memory.
Alas, the latch turned out to be without shared mode gets. It is going almost without saying that, as concurrency levels increased, that single latch was behaving more and more like to a hand brake (link to a test I've done back then on a 8-way Itanium 2).
Back to the future
When 11GR2 has been released I knew that at some point in time I'll need to go back and revisit this subject. What I did is a couple of quick and dirty runs which came back confirming the same single latch and no shared mode gets so it didn't look like something has really changed. At this point I've decided to revisit it a bit later. This a "bit later" happened just recently.
How bad can it get?
What I wanted to do is get an UltraSPARC T2 and face it against Core i7 980X on a different concurrency levels in order to see how bad it can get. T2 will require quite a lot of parallelism in order to keep up even with a single i7 core. But since all we've got is a single RC latch, I've expected T2 to choke on it quite fast as not only there will be a lot of processes competing for the same latch, the slow single-threaded performance will cause the latch to be held for a much longer periods of time. Performance degradation will be dare.
Result Cache in 11GR2
I used the same test described here as it is targeted at exploiting RC latch weakness and gives me the ability to compare with the old results. I've used 250K lookup iterations. The performance was measured as a total number of lookups performed per second and RC latch statistics were captured for analysis.
Since 980X has 6 cores and 12 threads, the tests were done with 1 to 12 processes running at the same time which also gave an opportunity to see how well HT will scale. Note that I plan to do some further testing on T2 with up to 64 threads but for now I've tested up to 12 threads only as I couldn't get a test window big enough.
UltraSPARC T2 Results
# of processes | Buffer Cache | % linear | Result Cache | % linear |
---|---|---|---|---|
1 | 4426 | 100 | 4555 | 100 |
2 | 8930 | 100.88 | 9124 | 100.15 |
3 | 13465 | 101.41 | 13731 | 100.48 |
4 | 17886 | 101.03 | 18179 | 99.77 |
5 | 22290 | 100.72 | 22715 | 99.74 |
6 | 26615 | 100.22 | 27012 | 98.84 |
7 | 30659 | 98.96 | 30804 | 96.61 |
8 | 34347 | 97 | 34910 | 95.8 |
9 | 38389 | 96.37 | 39029 | 95.2 |
10 | 42772 | 96.64 | 43126 | 94.68 |
11 | 46840 | 96.21 | 46936 | 93.68 |
12 | 50667 | 95.4 | 50590 | 92.55 |
This certainly looks promising so let's take a look at the RC latch statistic:
# of processes | Gets | Misses | Sleeps | Wait Time |
---|---|---|---|---|
1 | 500001 | 0 | 0 | 0 |
2 | 1000002 | 40253 | 1 | 0 |
3 | 1500003 | 50404 | 0 | 0 |
4 | 2000004 | 165116 | 9 | 464 |
5 | 2500005 | 211559 | 5 | 182 |
6 | 3000006 | 437898 | 8 | 6877 |
7 | 3500007 | 805752 | 52 | 16556 |
8 | 4000008 | 1214762 | 20 | 2980 |
9 | 4500009 | 1775372 | 188 | 3140 |
10 | 5000010 | 2244964 | 491 | 29568 |
11 | 5500011 | 2552323 | 664 | 28011 |
12 | 6000012 | 3019903 | 1226 | 60005 |
There is one astonishing fact about the above number. Let's get some efficiency metrics in place for comparison between these numbers and the ones I've got in 11GR1. I'll use a data point with eight parallel processes as it's the highest reference point I can get across both data sets.
First of all, the number of gets per execution remained the same and equals two gets per exec. If we were going to calculate % miss per get we'll get 28.62% in 11GR1 and 50.33% in 11GR2. In other words, roughly every second get request has resulted in a miss in 11GR2 and every third in 11GR1. It may appear as if this got worse but it's really a consequence from something else.
If we calculate % sleep per miss we'll get 31.36% in 11GR1 but only 0.04% in 11GR2! In other words, the amount of times a process had to go to sleep has drastically decreased. In almost all of the cases the process was able to acquire a latch during a spin without going into a sleep. This also explains why % miss per get in 11GR2 went up and shows that a lowering in efficiency for a single metric does not necessarily indicates a problem, it might happen because some other correlated metric has in fact improved.
There is certainly a sign of a great improvement but what is it? Most likely the improvement is related to the optimization of how long the latch is required to be held. The time required to hold the latch became so small that, in most of the cases, the process is able to acquire it during spinning before being required to go to sleep (i.e. less than _spin_count iterations).
Core i7 980X Results
# of processes | Buffer Cache | % linear | Result Cache | % linear |
---|---|---|---|---|
1 | 40064 | 100 | 43554 | 100 |
2 | 78989 | 98.58 | 84602 | 97.12 |
3 | 121753 | 101.3 | 127768 | 97.79 |
4 | 159490 | 99.52 | 166667 | 95.67 |
5 | 194704 | 97.2 | 204583 | 93.94 |
6 | 229709 | 95.56 | 240770 | 92.13 |
7 | 231788 | 82.65 | 244755 | 80.28 |
8 | 233918 | 72.98 | 246305 | 70.69 |
9 | 250836 | 69.57 | 260718 | 66.51 |
10 | 267094 | 66.67 | 275330 | 63.22 |
11 | 280326 | 63.61 | 290084 | 60.55 |
12 | 290416 | 60.41 | 293830 | 56.22 |
Here Result Cache won across all the positions. We need about 10 processes running on UltraSPARC T2 in order to beat a single process running on i7 980X. Performance gains declined rapidly once we got over six concurrent processes but still we were able to realize some additional performance with 12 threads being about 22% faster than 6 threads.
Latch statistics:
# of processes | Gets | Misses | Sleeps | Wait Time |
---|---|---|---|---|
1 | 500001 | 0 | 0 | 0 |
2 | 1000002 | 40456 | 0 | 0 |
3 | 1500003 | 117893 | 5 | 71 |
4 | 2000004 | 209399 | 0 | 0 |
5 | 2500005 | 381160 | 0 | 0 |
6 | 3000006 | 517745 | 11 | 179 |
7 | 3500007 | 913125 | 20 | 555 |
8 | 4000008 | 1355226 | 26 | 11914 |
9 | 4500009 | 1834112 | 13 | 1017 |
10 | 5000010 | 2602801 | 42 | 1607 |
11 | 5500011 | 3196415 | 145 | 3451 |
12 | 6000012 | 3730467 | 184 | 123954 |
Essentially we're looking at the same phenomena with the amount of sleeps being significantly lower compared to what we observed in 11GR1. With six concurrent processes % miss per get is 17.26% and % sleep per miss is 0.002%! This allowed Result Cache to stay ahead with up to (and including) 12 concurrent processes running.
UltraSPARC T2 vs i7 980X
We'll wrap up with a nice graph showing result cache performance on both UltraSPARC T2 and Core i7 980X:
i7 980X starts almost where 12 UltraSPARC T2 processes ends. Would T2 be able to narrow the gap with more parallel threads? I'll certainly find out.
Conclusion
There is an enormous improvement when it comes to Result Cache scalability in 11GR2. Still it's slower than if we had shared mode gets (or multiple child latches or, even better, both) but it gets very, very close.
result cache,real cool feature in 11g !
ReplyDeleteYour blog was recommended by my google reader, and I must say I really like this post!!
ReplyDeleteKeep up the technical content!!!
Regards,
Magnus Fagertun
hi Alex,
ReplyDeleteLoved your post. Really like your older one too.I need to know if can run these test from application where its using hibernate for ORM.
Also can you share some scripts at hemkant.c@gmail.com which you used for your test runs.
Regards
Hemkant Chavan
As far I know (and able to try), Oracle gets RC latch in shared mode while process is selecting data in 11gR2 (at least in 11.2.0.3)
ReplyDeleteRegards
Pavol Babel
Yes you are correct -- I did another post on the subject later (http://afatkulin.blogspot.ca/2012/05/result-cache-latch-in-11gr2-shared-mode.html)
ReplyDeleteI found that blog after I had wrote my post. using "oradebug call kslgetsl_w" is nice trick, however it was not usable for me since I'm mainly on AIX and Solaris64 (AIX never permitted to use oradebug call with function names and Solaris 64bit port introduced bug where oradebug call corrupts input arguments). I've used my dummy old script, however tanel's latchrpofx would be much more smarter nowdays...
ReplyDeleteSELECT /*+ ORDERED USE_NL(lh) */
gen1.sample_id, t1.hsecs, t2.hsecs, lh.*
FROM (SELECT hsecs FROM v$timer) t1,
(SELECT rownum sample_id
FROM DUAL
CONNECT BY LEVEL <= 1000) gen1,
X$KSUPRLAT lh,
(SELECT hsecs FROM v$timer) t2
/
One more note, it seems Oracle grabs only one "RC latch" per fetch (aka RC Cache access), at least in 11.2.0.3 again.
RC latch in shared more is great improvement, however shared latch contention can have great impact, too. If you rewrote your testcase using RESULT_CACHE at PL/SQL function level, the scalability would be under 10% on most platforms (since there is no context switch between PL/SQL and SQL, the iteration is much faster, on the other hand shared latch contention much higher).
Regards
Pavol Babel
OCM 10g/11g