Saturday, July 15, 2017

Exadata 12c PX Adaptive Offloading

Here is yet another case when you may not see as much offloading on your Exadata as you expect.

I was recently investigating a simple select count(*) query producing a lot of buffered read events:
select count(*) from whs.trans partition (sys_p10236650) f

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  resmgr:pq queued                                1        0.00          0.00
  Disk file operations I/O                        1        0.00          0.00
  cell single block physical read                41        0.00          0.04
  enq: KO - fast object checkpoint                3        0.00          0.00
  reliable message                                1        0.00          0.00
  enq: PS - contention                           24        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.06
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.08          1.57
  PX Deq: Signal ACK EXT                         24        0.01          0.01
  PX Deq: Slave Session Stats                    24        0.00          0.00
  cursor: pin S wait on X                        23        0.01          0.18
  PX Deq: Execution Msg                         362        0.25          5.12
  cell multiblock physical read                3402        0.03          9.07
  gc current grant busy                           5        0.00          0.00
  latch: gc element                              60        0.00          0.00
  gc cr grant 2-way                             136        0.00          0.03
  gc cr multi block request                     131        0.00          0.06
  cell smart table scan                        7714        0.02          2.27
  library cache: mutex X                         15        0.00          0.00
  library cache pin                               3        0.00          0.00
  latch: cache buffers lru chain                 13        0.00          0.00
  latch: ges resource hash list                   1        0.00          0.00
  gc current block 2-way                          1        0.00          0.00
  cell list of blocks physical read               2        0.00          0.01
  gc cr grant congested                           2        0.00          0.00
  latch: object queue header operation            2        0.00          0.00
  latch: gcs resource hash                        3        0.00          0.00
********************************************************************************
As you can see we've got 3402 cell multiblock physical read and 41 cell single block physical read alongside 7714 cell smart table scan events. I looked at all the usual smart scan preventing suspects but came up with nothing.

Tracing

In order to further understand the source of buffered reads I've enabled the following traces:
alter session set events 'trace[nsmtio]';
alter session set "_px_trace" = high,granule,high,execution;
What I discovered in the trace was quite interesting. For smart scans the following sections were present in the trace:
2017-07-15 13:54:17.777305*:PX_Control:kxfx.c@10122:kxfxsGetNextGranule():  Receiving 1 granules
...
kxfxContAdaptOff: offload granule:1 offload enabled:1 ratio:0.800000 cellrate:0.000000 cacherate:0.000000 dr_amount:0 dr_time:0 bc_amount:0 bc_time:0 system offloading ratio:0.800000
...
NSMTIO: kxfxghwm:[DirectRead]: Buffer Cache Consumption rate of granule is less and direct read enabled.
...
WAIT #140442514780840: nam='cell smart table scan' ela= 208 cellhash#=459971463 p2=0 p3=0 obj#=36324502 tim=13309302766869
WAIT #140442514780840: nam='cell smart table scan' ela= 224 cellhash#=83369134 p2=0 p3=0 obj#=36324502 tim=13309302767638
WAIT #140442514780840: nam='cell smart table scan' ela= 157 cellhash#=3883045144 p2=0 p3=0 obj#=36324502 tim=13309302768329
...
And for buffered reads the trace had the following:
2017-07-15 13:54:18.151520*:PX_Control:kxfx.c@10122:kxfxsGetNextGranule():  Receiving 1 granules
...
kxfxContAdaptOff: offload granule:0 offload enabled:1 ratio:0.800000 cellrate:1320.421053 cacherate:0.000000 dr_amount:0 dr_time:0 bc_amount:0 bc_time:0 system offloading ratio:0.800000
...
NSMTIO: kxfxghwm:[GRANULE_AFFINITIZED]: Either size is small OR medium and can be cached.
...
WAIT #140442514780840: nam='cell multiblock physical read' ela= 2057 cellhash#=2689352169 diskhash#=3123344858 bytes=1048576 obj#=36324502 tim=13309303143810
WAIT #140442514780840: nam='cell multiblock physical read' ela= 3162 cellhash#=2689352169 diskhash#=3123344858 bytes=1048576 obj#=36324502 tim=13309303151338
WAIT #140442514780840: nam='cell multiblock physical read' ela= 1383 cellhash#=2689352169 diskhash#=3123344858 bytes=450560 obj#=36324502 tim=13309303156489
...
The above sections were alternating for each granule received. After receiving a granule the entire granule was processed using either smart scans or buffered reads. It also appeared that the decision was driven by the kxfxContAdaptOff (Control Adaptive Offloading?) function. Notice kxfxContAdaptOff emitting cellrate statistic into the trace as well. This got me thinking that it might be checking for storage cells workload (or something similar) and using it as one of the inputs to arrive at the decision which way to process a granule.

PX Adaptive Offloading

With a little bit of outside help kxfxContAdaptOff function lead us to two underscore parameters:
SQL> select ksppinm, ksppstvl, ksppstdfl, ksppdesc
  2   from x$ksppi x, x$ksppcv y
  3   where (x.indx = y.indx)
  4    and ksppinm like '_px_adaptive_offload%';
KSPPINM                         KSPPSTVL KSPPSTDFL KSPPDESC
------------------------------- -------- --------- --------------------------------------------------------------------------------
_px_adaptive_offload_threshold  10       10        threshold (GB/s) for PQ adaptive offloading of granules
_px_adaptive_offload_percentage 80       30        percentage for PQ adaptive offloading of granules
The description of these parameters appeared to match the behavior I was seeing. It appeared that there was a threshold and a percentage controlling how much granules were subjected to adaptive offloading.

_px_adaptive_offload_threshold

I have repeated my test using _px_adaptive_offload_threshold=0 and here are the events I've got:
Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  library cache pin                               2        0.00          0.00
  resmgr:pq queued                                1        0.00          0.00
  enq: KO - fast object checkpoint                3        0.00          0.00
  reliable message                                1        0.00          0.00
  enq: PS - contention                           39        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.02
  Disk file operations I/O                        1        0.00          0.00
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.23          4.46
  PX Deq: Signal ACK EXT                         24        0.06          0.07
  PX Deq: Slave Session Stats                    24        0.00          0.00
  library cache lock                              1        0.00          0.00
  PX Deq: Execution Msg                         362        0.47         14.54
  cell multiblock physical read               14536        0.04         44.74
  gc current grant busy                          24        0.00          0.01
  gc cr grant 2-way                             626        0.00          0.11
  gc cr multi block request                     553        0.00          0.18
  cell single block physical read              3459        0.02          2.69
  latch: gcs resource hash                       29        0.00          0.01
  latch: gc element                             149        0.00          0.02
  latch: cache buffers lru chain                 42        0.00          0.01
  latch: cache buffers chains                     4        0.00          0.00
  latch: object queue header operation            3        0.00          0.00
  gc cr grant congested                           1        0.00          0.00
********************************************************************************
Notice complete absence of smart scans -- everything got processed using buffered reads! This confirmed that I was on the right track. And here are the events with _px_adaptive_offload_threshold=1000:
Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  resmgr:pq queued                                1        0.00          0.00
  enq: KO - fast object checkpoint                3        0.02          0.02
  reliable message                                1        0.00          0.00
  enq: PS - contention                           29        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.03
  Disk file operations I/O                        1        0.00          0.00
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.09          1.37
  PX Deq: Signal ACK EXT                         24        0.08          0.09
  PX Deq: Slave Session Stats                    24        0.00          0.00
  PX Deq: Execution Msg                         362        0.45         16.56
  cell multiblock physical read                2018        0.10          7.04
  gc current grant busy                           5        0.00          0.00
  gc cr grant 2-way                              57        0.00          0.00
  gc cr multi block request                      41        0.00          0.01
  cell smart table scan                        8641        0.10          5.23
  cell single block physical read               602        0.02          0.58
  library cache pin                               2        0.00          0.00
  latch: cache buffers lru chain                  2        0.00          0.00
  latch: gc element                               4        0.00          0.00
  latch: object queue header operation            1        0.00          0.00
  library cache: mutex X                          1        0.00          0.00
********************************************************************************
Notice that while I've got more smart scans the buffered reads were still there as well. Interestingly enough I could not fully eliminate buffered reads no matter how high I've set this parameter to. This likely means that there are more variables at play here and/or that _px_adaptive_offload_percentage dictates that a certain amount of granules always gets processed using buffered reads.

I'm not entire sure what _px_adaptive_offload_percentage does at the moment as setting it to one value or the other did not appear to make any changes. While this is a guess on my part the default value (80) matches ratio:0.800000 reported by kxfxContAdaptOff and it appears to be hard coded in the function.

Final thoughts

Clearly 12c got some new code introduce which can dynamically make a decision on how to process each PX granule. The unfortunate part is none of this appears to be documented and it looks like quite a big change which warrants some sort of an explanation.

Updated a little bit later...

It appears that setting _px_adaptive_offload_threshold=0 disables Adaptive Offloading after all. The reason I've got only buffered reads when I set it to 0 was due to the system running with Auto DOP so In-Memory PQ kicked in. When set together with _parallel_cluster_cache_policy=adaptive it reverted everything to the expected behavior -- everything got processed using smart scans.

1 comment:

  1. Mike Eubanks9:14 a.m.

    Excellent evaluation. We are certainly seeing a much larger impact of cell single block reads on performance since moving to 12c.

    ReplyDelete