I was recently investigating a simple select count(*) query producing a lot of buffered read events:
select count(*) from whs.trans partition (sys_p10236650) f Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ resmgr:pq queued 1 0.00 0.00 Disk file operations I/O 1 0.00 0.00 cell single block physical read 41 0.00 0.04 enq: KO - fast object checkpoint 3 0.00 0.00 reliable message 1 0.00 0.00 enq: PS - contention 24 0.00 0.00 PX Deq: Join ACK 48 0.00 0.00 PX Deq: Parse Reply 24 0.01 0.06 SQL*Net message to client 2 0.00 0.00 SQL*Net message from client 2 0.00 0.00 PX Deq: Execute Reply 338 0.08 1.57 PX Deq: Signal ACK EXT 24 0.01 0.01 PX Deq: Slave Session Stats 24 0.00 0.00 cursor: pin S wait on X 23 0.01 0.18 PX Deq: Execution Msg 362 0.25 5.12 cell multiblock physical read 3402 0.03 9.07 gc current grant busy 5 0.00 0.00 latch: gc element 60 0.00 0.00 gc cr grant 2-way 136 0.00 0.03 gc cr multi block request 131 0.00 0.06 cell smart table scan 7714 0.02 2.27 library cache: mutex X 15 0.00 0.00 library cache pin 3 0.00 0.00 latch: cache buffers lru chain 13 0.00 0.00 latch: ges resource hash list 1 0.00 0.00 gc current block 2-way 1 0.00 0.00 cell list of blocks physical read 2 0.00 0.01 gc cr grant congested 2 0.00 0.00 latch: object queue header operation 2 0.00 0.00 latch: gcs resource hash 3 0.00 0.00 ********************************************************************************As you can see we've got 3402 cell multiblock physical read and 41 cell single block physical read alongside 7714 cell smart table scan events. I looked at all the usual smart scan preventing suspects but came up with nothing.
Tracing
In order to further understand the source of buffered reads I've enabled the following traces:
alter session set events 'trace[nsmtio]'; alter session set "_px_trace" = high,granule,high,execution;What I discovered in the trace was quite interesting. For smart scans the following sections were present in the trace:
2017-07-15 13:54:17.777305*:PX_Control:kxfx.c@10122:kxfxsGetNextGranule(): Receiving 1 granules ... kxfxContAdaptOff: offload granule:1 offload enabled:1 ratio:0.800000 cellrate:0.000000 cacherate:0.000000 dr_amount:0 dr_time:0 bc_amount:0 bc_time:0 system offloading ratio:0.800000 ... NSMTIO: kxfxghwm:[DirectRead]: Buffer Cache Consumption rate of granule is less and direct read enabled. ... WAIT #140442514780840: nam='cell smart table scan' ela= 208 cellhash#=459971463 p2=0 p3=0 obj#=36324502 tim=13309302766869 WAIT #140442514780840: nam='cell smart table scan' ela= 224 cellhash#=83369134 p2=0 p3=0 obj#=36324502 tim=13309302767638 WAIT #140442514780840: nam='cell smart table scan' ela= 157 cellhash#=3883045144 p2=0 p3=0 obj#=36324502 tim=13309302768329 ...And for buffered reads the trace had the following:
2017-07-15 13:54:18.151520*:PX_Control:kxfx.c@10122:kxfxsGetNextGranule(): Receiving 1 granules ... kxfxContAdaptOff: offload granule:0 offload enabled:1 ratio:0.800000 cellrate:1320.421053 cacherate:0.000000 dr_amount:0 dr_time:0 bc_amount:0 bc_time:0 system offloading ratio:0.800000 ... NSMTIO: kxfxghwm:[GRANULE_AFFINITIZED]: Either size is small OR medium and can be cached. ... WAIT #140442514780840: nam='cell multiblock physical read' ela= 2057 cellhash#=2689352169 diskhash#=3123344858 bytes=1048576 obj#=36324502 tim=13309303143810 WAIT #140442514780840: nam='cell multiblock physical read' ela= 3162 cellhash#=2689352169 diskhash#=3123344858 bytes=1048576 obj#=36324502 tim=13309303151338 WAIT #140442514780840: nam='cell multiblock physical read' ela= 1383 cellhash#=2689352169 diskhash#=3123344858 bytes=450560 obj#=36324502 tim=13309303156489 ...The above sections were alternating for each granule received. After receiving a granule the entire granule was processed using either smart scans or buffered reads. It also appeared that the decision was driven by the kxfxContAdaptOff (Control Adaptive Offloading?) function. Notice kxfxContAdaptOff emitting cellrate statistic into the trace as well. This got me thinking that it might be checking for storage cells workload (or something similar) and using it as one of the inputs to arrive at the decision which way to process a granule.
PX Adaptive Offloading
With a little bit of outside help kxfxContAdaptOff function lead us to two underscore parameters:
SQL> select ksppinm, ksppstvl, ksppstdfl, ksppdesc 2 from x$ksppi x, x$ksppcv y 3 where (x.indx = y.indx) 4 and ksppinm like '_px_adaptive_offload%'; KSPPINM KSPPSTVL KSPPSTDFL KSPPDESC ------------------------------- -------- --------- -------------------------------------------------------------------------------- _px_adaptive_offload_threshold 10 10 threshold (GB/s) for PQ adaptive offloading of granules _px_adaptive_offload_percentage 80 30 percentage for PQ adaptive offloading of granulesThe description of these parameters appeared to match the behavior I was seeing. It appeared that there was a threshold and a percentage controlling how much granules were subjected to adaptive offloading.
_px_adaptive_offload_threshold
I have repeated my test using _px_adaptive_offload_threshold=0 and here are the events I've got:
Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ library cache pin 2 0.00 0.00 resmgr:pq queued 1 0.00 0.00 enq: KO - fast object checkpoint 3 0.00 0.00 reliable message 1 0.00 0.00 enq: PS - contention 39 0.00 0.00 PX Deq: Join ACK 48 0.00 0.00 PX Deq: Parse Reply 24 0.01 0.02 Disk file operations I/O 1 0.00 0.00 SQL*Net message to client 2 0.00 0.00 SQL*Net message from client 2 0.00 0.00 PX Deq: Execute Reply 338 0.23 4.46 PX Deq: Signal ACK EXT 24 0.06 0.07 PX Deq: Slave Session Stats 24 0.00 0.00 library cache lock 1 0.00 0.00 PX Deq: Execution Msg 362 0.47 14.54 cell multiblock physical read 14536 0.04 44.74 gc current grant busy 24 0.00 0.01 gc cr grant 2-way 626 0.00 0.11 gc cr multi block request 553 0.00 0.18 cell single block physical read 3459 0.02 2.69 latch: gcs resource hash 29 0.00 0.01 latch: gc element 149 0.00 0.02 latch: cache buffers lru chain 42 0.00 0.01 latch: cache buffers chains 4 0.00 0.00 latch: object queue header operation 3 0.00 0.00 gc cr grant congested 1 0.00 0.00 ********************************************************************************Notice complete absence of smart scans -- everything got processed using buffered reads! This confirmed that I was on the right track. And here are the events with _px_adaptive_offload_threshold=1000:
Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ resmgr:pq queued 1 0.00 0.00 enq: KO - fast object checkpoint 3 0.02 0.02 reliable message 1 0.00 0.00 enq: PS - contention 29 0.00 0.00 PX Deq: Join ACK 48 0.00 0.00 PX Deq: Parse Reply 24 0.01 0.03 Disk file operations I/O 1 0.00 0.00 SQL*Net message to client 2 0.00 0.00 SQL*Net message from client 2 0.00 0.00 PX Deq: Execute Reply 338 0.09 1.37 PX Deq: Signal ACK EXT 24 0.08 0.09 PX Deq: Slave Session Stats 24 0.00 0.00 PX Deq: Execution Msg 362 0.45 16.56 cell multiblock physical read 2018 0.10 7.04 gc current grant busy 5 0.00 0.00 gc cr grant 2-way 57 0.00 0.00 gc cr multi block request 41 0.00 0.01 cell smart table scan 8641 0.10 5.23 cell single block physical read 602 0.02 0.58 library cache pin 2 0.00 0.00 latch: cache buffers lru chain 2 0.00 0.00 latch: gc element 4 0.00 0.00 latch: object queue header operation 1 0.00 0.00 library cache: mutex X 1 0.00 0.00 ********************************************************************************Notice that while I've got more smart scans the buffered reads were still there as well. Interestingly enough I could not fully eliminate buffered reads no matter how high I've set this parameter to. This likely means that there are more variables at play here and/or that _px_adaptive_offload_percentage dictates that a certain amount of granules always gets processed using buffered reads.
I'm not entire sure what _px_adaptive_offload_percentage does at the moment as setting it to one value or the other did not appear to make any changes. While this is a guess on my part the default value (80) matches ratio:0.800000 reported by kxfxContAdaptOff and it appears to be hard coded in the function.
Final thoughts
Clearly 12c got some new code introduce which can dynamically make a decision on how to process each PX granule. The unfortunate part is none of this appears to be documented and it looks like quite a big change which warrants some sort of an explanation.
Updated a little bit later...
It appears that setting _px_adaptive_offload_threshold=0 disables Adaptive Offloading after all. The reason I've got only buffered reads when I set it to 0 was due to the system running with Auto DOP so In-Memory PQ kicked in. When set together with _parallel_cluster_cache_policy=adaptive it reverted everything to the expected behavior -- everything got processed using smart scans.