Saturday, July 15, 2017

Exadata 12c PX Adaptive Offloading

Here is yet another case when you may not see as much offloading on your Exadata as you expect.

I was recently investigating a simple select count(*) query producing a lot of buffered read events:
select count(*) from whs.trans partition (sys_p10236650) f

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  resmgr:pq queued                                1        0.00          0.00
  Disk file operations I/O                        1        0.00          0.00
  cell single block physical read                41        0.00          0.04
  enq: KO - fast object checkpoint                3        0.00          0.00
  reliable message                                1        0.00          0.00
  enq: PS - contention                           24        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.06
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.08          1.57
  PX Deq: Signal ACK EXT                         24        0.01          0.01
  PX Deq: Slave Session Stats                    24        0.00          0.00
  cursor: pin S wait on X                        23        0.01          0.18
  PX Deq: Execution Msg                         362        0.25          5.12
  cell multiblock physical read                3402        0.03          9.07
  gc current grant busy                           5        0.00          0.00
  latch: gc element                              60        0.00          0.00
  gc cr grant 2-way                             136        0.00          0.03
  gc cr multi block request                     131        0.00          0.06
  cell smart table scan                        7714        0.02          2.27
  library cache: mutex X                         15        0.00          0.00
  library cache pin                               3        0.00          0.00
  latch: cache buffers lru chain                 13        0.00          0.00
  latch: ges resource hash list                   1        0.00          0.00
  gc current block 2-way                          1        0.00          0.00
  cell list of blocks physical read               2        0.00          0.01
  gc cr grant congested                           2        0.00          0.00
  latch: object queue header operation            2        0.00          0.00
  latch: gcs resource hash                        3        0.00          0.00
********************************************************************************
As you can see we've got 3402 cell multiblock physical read and 41 cell single block physical read alongside 7714 cell smart table scan events. I looked at all the usual smart scan preventing suspects but came up with nothing.

Tracing

In order to further understand the source of buffered reads I've enabled the following traces:
alter session set events 'trace[nsmtio]';
alter session set "_px_trace" = high,granule,high,execution;
What I discovered in the trace was quite interesting. For smart scans the following sections were present in the trace:
2017-07-15 13:54:17.777305*:PX_Control:kxfx.c@10122:kxfxsGetNextGranule():  Receiving 1 granules
...
kxfxContAdaptOff: offload granule:1 offload enabled:1 ratio:0.800000 cellrate:0.000000 cacherate:0.000000 dr_amount:0 dr_time:0 bc_amount:0 bc_time:0 system offloading ratio:0.800000
...
NSMTIO: kxfxghwm:[DirectRead]: Buffer Cache Consumption rate of granule is less and direct read enabled.
...
WAIT #140442514780840: nam='cell smart table scan' ela= 208 cellhash#=459971463 p2=0 p3=0 obj#=36324502 tim=13309302766869
WAIT #140442514780840: nam='cell smart table scan' ela= 224 cellhash#=83369134 p2=0 p3=0 obj#=36324502 tim=13309302767638
WAIT #140442514780840: nam='cell smart table scan' ela= 157 cellhash#=3883045144 p2=0 p3=0 obj#=36324502 tim=13309302768329
...
And for buffered reads the trace had the following:
2017-07-15 13:54:18.151520*:PX_Control:kxfx.c@10122:kxfxsGetNextGranule():  Receiving 1 granules
...
kxfxContAdaptOff: offload granule:0 offload enabled:1 ratio:0.800000 cellrate:1320.421053 cacherate:0.000000 dr_amount:0 dr_time:0 bc_amount:0 bc_time:0 system offloading ratio:0.800000
...
NSMTIO: kxfxghwm:[GRANULE_AFFINITIZED]: Either size is small OR medium and can be cached.
...
WAIT #140442514780840: nam='cell multiblock physical read' ela= 2057 cellhash#=2689352169 diskhash#=3123344858 bytes=1048576 obj#=36324502 tim=13309303143810
WAIT #140442514780840: nam='cell multiblock physical read' ela= 3162 cellhash#=2689352169 diskhash#=3123344858 bytes=1048576 obj#=36324502 tim=13309303151338
WAIT #140442514780840: nam='cell multiblock physical read' ela= 1383 cellhash#=2689352169 diskhash#=3123344858 bytes=450560 obj#=36324502 tim=13309303156489
...
The above sections were alternating for each granule received. After receiving a granule the entire granule was processed using either smart scans or buffered reads. It also appeared that the decision was driven by the kxfxContAdaptOff (Control Adaptive Offloading?) function. Notice kxfxContAdaptOff emitting cellrate statistic into the trace as well. This got me thinking that it might be checking for storage cells workload (or something similar) and using it as one of the inputs to arrive at the decision which way to process a granule.

PX Adaptive Offloading

With a little bit of outside help kxfxContAdaptOff function lead us to two underscore parameters:
SQL> select ksppinm, ksppstvl, ksppstdfl, ksppdesc
  2   from x$ksppi x, x$ksppcv y
  3   where (x.indx = y.indx)
  4    and ksppinm like '_px_adaptive_offload%';
KSPPINM                         KSPPSTVL KSPPSTDFL KSPPDESC
------------------------------- -------- --------- --------------------------------------------------------------------------------
_px_adaptive_offload_threshold  10       10        threshold (GB/s) for PQ adaptive offloading of granules
_px_adaptive_offload_percentage 80       30        percentage for PQ adaptive offloading of granules
The description of these parameters appeared to match the behavior I was seeing. It appeared that there was a threshold and a percentage controlling how much granules were subjected to adaptive offloading.

_px_adaptive_offload_threshold

I have repeated my test using _px_adaptive_offload_threshold=0 and here are the events I've got:
Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  library cache pin                               2        0.00          0.00
  resmgr:pq queued                                1        0.00          0.00
  enq: KO - fast object checkpoint                3        0.00          0.00
  reliable message                                1        0.00          0.00
  enq: PS - contention                           39        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.02
  Disk file operations I/O                        1        0.00          0.00
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.23          4.46
  PX Deq: Signal ACK EXT                         24        0.06          0.07
  PX Deq: Slave Session Stats                    24        0.00          0.00
  library cache lock                              1        0.00          0.00
  PX Deq: Execution Msg                         362        0.47         14.54
  cell multiblock physical read               14536        0.04         44.74
  gc current grant busy                          24        0.00          0.01
  gc cr grant 2-way                             626        0.00          0.11
  gc cr multi block request                     553        0.00          0.18
  cell single block physical read              3459        0.02          2.69
  latch: gcs resource hash                       29        0.00          0.01
  latch: gc element                             149        0.00          0.02
  latch: cache buffers lru chain                 42        0.00          0.01
  latch: cache buffers chains                     4        0.00          0.00
  latch: object queue header operation            3        0.00          0.00
  gc cr grant congested                           1        0.00          0.00
********************************************************************************
Notice complete absence of smart scans -- everything got processed using buffered reads! This confirmed that I was on the right track. And here are the events with _px_adaptive_offload_threshold=1000:
Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  resmgr:pq queued                                1        0.00          0.00
  enq: KO - fast object checkpoint                3        0.02          0.02
  reliable message                                1        0.00          0.00
  enq: PS - contention                           29        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.03
  Disk file operations I/O                        1        0.00          0.00
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.09          1.37
  PX Deq: Signal ACK EXT                         24        0.08          0.09
  PX Deq: Slave Session Stats                    24        0.00          0.00
  PX Deq: Execution Msg                         362        0.45         16.56
  cell multiblock physical read                2018        0.10          7.04
  gc current grant busy                           5        0.00          0.00
  gc cr grant 2-way                              57        0.00          0.00
  gc cr multi block request                      41        0.00          0.01
  cell smart table scan                        8641        0.10          5.23
  cell single block physical read               602        0.02          0.58
  library cache pin                               2        0.00          0.00
  latch: cache buffers lru chain                  2        0.00          0.00
  latch: gc element                               4        0.00          0.00
  latch: object queue header operation            1        0.00          0.00
  library cache: mutex X                          1        0.00          0.00
********************************************************************************
Notice that while I've got more smart scans the buffered reads were still there as well. Interestingly enough I could not fully eliminate buffered reads no matter how high I've set this parameter to. This likely means that there are more variables at play here and/or that _px_adaptive_offload_percentage dictates that a certain amount of granules always gets processed using buffered reads.

I'm not entire sure what _px_adaptive_offload_percentage does at the moment as setting it to one value or the other did not appear to make any changes. While this is a guess on my part the default value (80) matches ratio:0.800000 reported by kxfxContAdaptOff and it appears to be hard coded in the function.

Final thoughts

Clearly 12c got some new code introduce which can dynamically make a decision on how to process each PX granule. The unfortunate part is none of this appears to be documented and it looks like quite a big change which warrants some sort of an explanation.

Updated a little bit later...

It appears that setting _px_adaptive_offload_threshold=0 disables Adaptive Offloading after all. The reason I've got only buffered reads when I set it to 0 was due to the system running with Auto DOP so In-Memory PQ kicked in. When set together with _parallel_cluster_cache_policy=adaptive it reverted everything to the expected behavior -- everything got processed using smart scans.

Friday, July 07, 2017

12.2 ACFS compression, part I

One of the new 12.2 features is the ability to transparently compress ACFS filesystems.

Compression is enabled using acfsutil utility:
[root@raca1 ~]# acfsutil -h compress on
Usage: acfsutil [-h] compress on [-a ] 
                              - Set default compression algorithm
                                Currently only 'lzo' available
                 - Enable compression on volume
Clearly there is support for more compression algorithms to be added in the future but right now only lzo is supported.

Let's go ahead and enable compression on the ACFS filesystem I have:
[root@raca1 ~]# acfsutil compress on /u02/oradata
Compression status can be checked using acfsutil info fs:
[root@raca1 ~]# acfsutil info fs /u02/oradata
/u02/oradata
    ACFS Version: 12.2.0.1.0
    on-disk version:       46.0
    compatible.advm:       12.2.0.0.0
    ACFS compatibility:    12.2.0.0.0
    flags:        MountPoint,Available
    mount time:   Fri Jul  7 12:53:39 2017
    mount sequence number: 0
    allocation unit:       4096
    metadata block size:   4096
    volumes:      1
    total size:   8589934592  (   8.00 GB )
    total free:   6935588864  (   6.46 GB )
    file entry table allocation: 393216
    primary volume: /dev/asm/data-95
        label:
        state:                 Available
        major, minor:          250, 48641
        logical sector size:   4096
        size:                  8589934592  (   8.00 GB )
        free:                  6935588864  (   6.46 GB )
        metadata read I/O count:         23833
        metadata write I/O count:        39103
        total metadata bytes read:       257896448  ( 245.95 MB )
        total metadata bytes written:    421969920  ( 402.42 MB )
        ADVM diskgroup:        DATA
        ADVM resize increment: 67108864
        ADVM redundancy:       unprotected
        ADVM stripe columns:   8
        ADVM stripe width:     1048576
    number of snapshots:  0
    snapshot space usage: 0  ( 0.00 )
    replication status: DISABLED
    compression status: ENABLED
As a quick test I've created a new empty database inside ACFS filesystem. We can now check compressed file sizes:
[root@raca1 ora12cr2]# acfsutil compress info /u02/oradata/ora12cr2/sysaux01.dbf
Compression Unit size: 8192
Disk storage used:   (  88.70 MB )
Disk storage saved:  ( 461.31 MB )
Storage used is 16% of what the uncompressed file would use.
File is not scheduled for asynchronous compression.
[root@raca1 ora12cr2]# ls -l /u02/oradata/ora12cr2/sysaux01.dbf
-rw-r----- 1 oracle dba 576724992 Jul  7 13:45 /u02/oradata/ora12cr2/sysaux01.dbf
It is worth noting that not all file types are compressed. For example redo logs are left uncompressed:
[root@raca1 ora12cr2]# acfsutil compress info /u02/oradata/ora12cr2/redo01.log
The file /u02/oradata/ora12cr2/redo01.log is not compressed.
So far so good -- the feature indeed allows you to save some space by compressing the files. I will continue exploring this capability in the next part.

Monday, May 22, 2017

Oracle SuperCluster M7 SLOB LIO Tests vs Intel Xeon E5-2699 V4

Here are some SLOB LIO figures from a DB zone configured with 16 threads running on an Oracle SuperCluster M7 hardware. For comparison I've also included numbers from an Intel Xeon E5-2699 V4 CPU.

It makes sense to mention that this is not exactly a fair comparison -- a single SPARC M7 core has 8 threads associated with it so my zone is able to utilize a total of two SPARC M7 cores (16 threads total with 8 threads per core). E5-2699 V4 is currently top of the line Intel CPU core packed model with 22 cores. So we're comparing two SPARC M7 cores vs 16 E5-2699 cores. It does however help answer the question -- if you're running on a certain number of Intel cores what kind of performance can you expect when you move over to a heavy threaded M7 if you transfer your workload "as is"?

Below are the results:



The first thing worth mentioning is that M7 still exhibits a large deficit when it comes to single threaded performance -- a single SPARC M7 thread is about 60% performance compared to an E5-2699 V4 core (which is not even a top bin CPU when it comes to frequency). Throughput is a different story -- two M7 cores are almost able to match four E5-2699 V4 cores thanks to a heavily threaded design.