RAC local parallel query internals

RAC tuning tips
January 10, 2015

One of the greatest benefits of running parallel SQL statements in an Oracle RAC environment is the ability to distribute the parallel slaves over multiple nodes. However, doing so requires extra overhead than if the parallel slaves were all processed on the local node. There are times when the SQL statement will execute more quickly with multiple nodes at work and times when the SQL statement will execute more quickly when run on a single node and avoid that extra overhead. Remember, each SQL statement is different as well as the system configuration it runs on, and parallel SQL statements should be tested to determine if a single node or multi-node execution works best for the specific environment. You should also be considering the impact of multiple parallel processes all running on the same node that could be impacting other processes competing for the same resources.

The parallel_force_local initialization parameter is used to control parallel SQL in Oracle RAC environments distribution over nodes other than the local, originating host. The default value for this parameter is FALSE and it is a best practice to leave this parameter at its default value. If this parameter is set to TRUE, then all parallel operations will only run on the local node. If this parameter does need to be set, it is best to do so at the session with the following command.

alter session set parallel_force_local=true;

To illustrate if forcing the parallel SQL to a specific node is beneficial or not, consider the following query.

select

   /*+ parallel */

   o.order_id,

   max(d.order_date) as max_date,

   count(*) as num_details

from

   orders o

   join

   order_details d

      on o.order_id = d.order_id

group by

   o.order_id;

The DEGREE OF PARALLELISM for this statement is 8. The SQL statement execution was traced when running with parallel SQL across the cluster and when forcing the parallel slaves to all be local. The SQL statement completed in 35 seconds when executed across the cluster and in 20 seconds when forced locally. The wait events when executed across the cluster are shown below as captured from tkprof of a session trace file.

Event waited on                             Times   Max. Wait  Total Waited

----------------------------------------   Waited  ----------  ------------

reliable message                                2        0.01          0.02

enq: KO - fast object checkpoint                2        0.00          0.00

KJC: Wait for msg sends to complete             4        0.00          0.00

PX Deq: reap credit                           620        0.01          0.02

PX Deq: Join ACK                               12        0.00          0.01

IPC send completion sync                        8        0.00          0.00

PX Deq: Parse Reply                             8       12.35         12.35

resmgr:cpu quantum                              2        0.01          0.01

SQL*Net message to client                    1315        0.00          0.00

PX Deq: Execute Reply                         250        1.13         19.77

SQL*Net message from client                  1314        0.02          2.40

PX Deq: Signal ACK EXT                          8        0.00          0.00

PX Deq: Slave Session Stats                     8        0.00          0.00

enq: PS - contention                            4        0.00          0.00

When executed all within the same node, the wait events look like the following.

Event waited on                             Times   Max. Wait  Total Waited

----------------------------------------   Waited  ----------  ------------

reliable message                                1        0.00          0.00

enq: KO - fast object checkpoint                1        0.00          0.00

KJC: Wait for msg sends to complete             4        0.00          0.00

PX Deq: Join ACK                                4        0.00          0.00

enq: PS - contention                            1        0.00          0.00

PX Deq: Parse Reply                             4        0.00          0.00

SQL*Net message to client                    1315        0.00          0.00

PX Deq: Execute Reply                         115        2.52         20.57

resmgr:cpu quantum                              1        0.00          0.00

SQL*Net message from client                  1314        0.01          2.93

PX Deq: Signal ACK EXT                          4        0.00          0.00

PX Deq: Slave Session Stats                     4        0.00          0.00

Notice that the PX Deq: Execute Reply wait events have very similar timings. This wait event is essentially the time that the QC waits for the slave processes to finish their work. The PX Deq: Parse Reply wait event has a total time of zero when executed locally and a total wait time of 12.35 seconds when executed across the nodes. The PX Deq: Parse Reply wait event differences account for almost all of the 15-second difference in runtimes. This wait even occurs when the QC process is waiting for the parallel slaves to receive and parse their work from the QC. When the processes are all local, the wait time is hardly noticeable and it makes sense that when some slave processes are on another note, it takes additional time to complete this operation.

If the local node has plenty of CPU resources and I/O bandwidth available to it, you may want the parallel SQL statement to run locally. Doing so would be no different than running a parallel SQL statement on a single-instance database. If you need the parallel SQL statement to enjoy the scale out capabilities of multiple nodes, then the overhead of distributing the parallel slaves can be overcome by the quicker time to completion for each parallel slave. In the end, the only way to know if running the parallel SQL statement locally or across the cluster is to run the statement both ways.

The foundation of Oracle Real Application Clusters revolves around parallelism, and long-term Oracle professionals remember that the original name for RAC was OPS, for Oracle Parallel Server.

With RAC, it is possible for an intra-parallel operation to utilize the processors across the nodes, using the second argument in the Oracle PARALLEL hint. That gives an additional degree of parallelism while executing in parallel. For instance, a parallel query can be set up with ?Parallel Hint? to utilize the CPUs from the many RAC instances.

Because each node requires a parallel query coordinator, many shops use n-1 parallelism, to reserve a CPU for the query coordinator. If we have four instances, each with 4 CPU's, our query might use a degree of parallelism (DOP) of 3 (n-1, to reserve a CPU for the query coordinator process). MOSC note 280939.1 confirms that a separate parallel query coordinator is required on each remote node:

The slaves are controlled by the user background process (called query coordinator QC ). In RAC environment the slaves for one query maybe be are spawned on different Nodes. . . .

Parallel execution does not allocate slaves randomly across the available instances, but rather will start by allocating on the least loaded instance. The goal is to both minimize inter-node traffic and at the same time try to minimize any imbalance of work across the instances.

The INSTANCE_GROUPS/PARALLEL_INSTANCE_GROUP parameter can be used to restrict allocation of query slaves to specific instances in a RAC configuration and over-ride the automatic allocation across instances. This can improve the performance when there are problem with the inter-connect.

Hence, the query might look like this, with a DOP of three:

SELECT /*+ FULL(sales) PARALLEL(sales, 3,4) */
customer_name,
sum(purchase_amount)

from

sales;

In this example, the DOP is three and we use all four instances. The query is executed with a total of 16 processes, 4 on each instance, and one parallel query coordinator on each instance:

From the illustration it becomes clear that the RAC implementation of the query might run slower than an equivalent query on a monolithic server with the same hardware. Note the differences between this RAC query and the vanilla Oracle parallel query on the monolithic server:

One-fourth fewer processes reading the table rows - We must reserve a process for the parallel query coordinator on each node.
Overhead on the cache fusion layer - As each node delivers the result set, the rows must be transferred to the master node that is controlling the query.
Slower sorting - Because the master node only has 8 gig of RAM, the result set is too large to sort in-memory and a time-consuming disk sort is required.

For more on using Oracle Real Application Clusters in a data warehouse, click here.

Parallel Query for distributed instances

In a distributed environment, pieces of a table may reside on many remote servers. For example, assume that we have a distributed architecture where local customer tables are kept on each instance. You could access all of the remote rows in a single query, using inter-instance parallel execution. In this example, the query is executed from the north_carolina instance, accessing two remote instances in-parallel:

select customer_name, sum(purchase_amount) from sales

union

select customer_name, sum(purchase_amount) from sales@san_francisco

union

select customer_name, sum(purchase_amount) from sales@new_york

group by

customer_name;

In this case the north_carolina instance drives the distributed parallel query and it is the north_carolina instance that must gather and sort the result set.

As we see, Oracle offers a wealth of distributed parallel query options, each with its own unique characteristics.

The basic premise of parallel execution is to break down a large body of work into smaller units that will be performed simultaneously. After completion, the smaller units of work are merged into one final result.

Multiple CPU cores can be used to perform each unit of work. If a SQL statement takes X minutes to complete serially, then breaking the work in four equally sized pieces, each performed in parallel, would ideally complete the same SQL statement a little more than one quarter of the serial execution time.

Why a little more? The extra time is due to the effort required to put the results back together.

The number of concurrent processes is called the Degree of Parallelism (DEGREE OF PARALLELISM). The process performing the smaller units of work is called the parallel slave. The process that puts the results back together is called the Query Coordinator (QC).

The following diagram illustrates performing a disk I/O on a database file serially as many database administrators should be very familiar with.

Figure 7.1 Serial Disk Read

The user's server process reads the data from the file before it is passed on to the end user. The diagram above is simplistic in that it removes the SGA and other components from the picture, but this was intentionally done so as to focus on just the parallel aspects discussed in the next diagram.

Figure 7.2 Parallel Operations

In diagram 7.2 above, the same file is read, this time with a parallel degree of 4. Each slave will ideally be responsible for one-fourth of the I/O activity. The results are passed on to the Query Coordinator before the data is ready for the end user. Notice that this diagram shows an extra level of work, that of the QC process. This diagram should help illustrate why parallel processing is not always faster than serial processing. If the time saved by the slaves executing the disk I/O in parallel is more than the time taken by the QC to do its job, the overall execution will be faster than if run serially. At this point, it should be obvious that multiple slave processes reading from the same file can perform worse if the file system does not have enough resources to support the concurrent I/O operations from the slaves. A lack of sufficient resources will hamper parallel SQL statement execution.

The Query Coordinator (QC) has a few jobs to do. The QC acquires slaves processes from a pool. The QC then doles out pieces of the work to each slave process. As the slave processes complete their work, the QC combines the results. Finally, the QC returns the slave processes to the pool and sends the final result to the end user. Don't feel bad that the slave processes appear to be doing all the work. The QC has enough work to do on its own as well. All this extra work can add up. The ultimate goal is that the extra work performed by the QC is significantly less than the time saves by the slaves working in parallel, otherwise the parallel execution will not perform well.

Parallel execution can improve statement execution for SQL statements that involve full table scans or joins of large tables, creation of large indexes, and large Insert, Update and Delete statements.

Parallel execution is not limited to just completing disk I/O faster. Consider a SQL statement that involves aggregate functions such as min or max. Data needs to be sorted before the minimum or maximum value can be found. In this case, the parallel slaves will operate in two steps, or two slave sets. One that reads the data from disk, then another set to sort and aggregate the data, all before returning the results to the Query Coordinator.

If each of the four parallel threads finds the maximum value of its portion of data, then the QC only needs to determine the maximum of four values. The following diagram shows how the parallel slaves can work together to provide multiple levels to the parallel processing.

Figure 7.3 Multiple Parallel Levels

In the above diagram, the parallel slaves that are reading from disk are producers to the slaves on the next level. The slaves performing the sort/aggregation are consumers of the work performed by the producers. The producer does not need to complete its work before the consumer can start to use the results. As soon as the producer has data ready, it will pass it on to the consumer to promote ?Inter-Operation Parallelism?. This helps speed up the parallel processing. The parallel slaves on the same level participate in Intra-Operation Parallelism. The consumer processes are not limited to sort and aggregate operations. Join operations can also be performed at the consumer level. Lastly, the producers have the capability to send their work on to any producer, which leads to all of the arrows between the producers and consumers in the diagram above.

So if a parallel degree of 4 works well, then why not a parallel degree of 10, 20, or even 100? The more parallel slaves, the better, right? Unfortunately, the server's physical resources have limits. For slaves performing disk I/O, there needs to be sufficient bandwidth from the storage subsystem to database server. Sorting operations are CPU intensive and too many slaves may start to push the CPU cores to their limits. If a very high parallel degree is used, there may be few resources for other users of the database, including those that are not performing parallel operations. Each slave process will need access to its own PGA memory. Parallel processing can struggle if the host does not have sufficient resources to support it.

The final sentence of the previous paragraph is where Oracle RAC can come to the rescue. You can scale the workload of parallel processing by spreading the load over multiple nodes in the cluster, thereby providing more resources. The following diagram shows how parallel processing might work in an Oracle RAC environment. If the single server does not have enough resources devoted to running the operation with parallel degree of 4, the processing can be moved to a two-node RAC cluster and run with parallel degree of 2 on each node.

Figure 7.4 Parallel Operations on RAC

In the diagram above, each node has producer and consumer parallel slave processes. It should be noted that a producer is capable of sending results to any consumer across the network, to any RAC instance. The database engine does its best to distribute the work in an intelligent fashion in Oracle RAC to minimize the cross-instance communication between producers and consumers. The diagram above illustrates an ideal situation where there is no cross-instance traffic between producers and consumers but it is also likely that the producer slave processes will send results across the instance boundaries. As such, it is vital to have a well performing Cluster Interconnect to ensure optimal parallel execution in Oracle RAC systems.

The user's shadow process for the connection to the instance serves as the Query Coordinator. This process is either a dedicated or shared server process and can be identified from the spid column of the v$process view. Since the query coordinator will run on only one of the instances, the results from each parallel slave will need to be transferred across the Cluster Interconnect, whenever the slave is running on a different instance. Unlike single-instance databases, parallel operations in Oracle RAC can have poor performance if too much data needs to participate in global cache transfers to the QC's instance.

To reduce or eliminate global cache transfers for parallel operations in Oracle RAC, we can restrict the instances that will host parallel slave processes, a techniques that we will discussed in more detail later in this chapter. Keep in mind that doing segregating RAC instance for parallelism will reduce the resource available to the parallel execution.

Parallel operations have one other feature when performed in Oracle RAC databases. By definition, all parallel slaves are on one instance in a single instance-database, accessing a singular Buffer Cache. In Oracle RAC, there are multiple Buffer Cache memory areas involved. When a parallel operation is performed and the parallel_degree_policy initialization parameter is set to AUTO, Oracle will access data from the different Buffer Cache memory areas. Without this feature, an instance that does not have the data block in the Buffer Cache would be forced to wait for a global cache transfer operation to complete, which would slow down parallel processing.

When a SQL statement is performed with parallel operations, the user's shadow process starts its role as the Query Coordinator. The QC then obtains the required number of parallel slave processes. The parallel slaves perform their work. Results from each slave are sent back to the QC. The QC performs any processing that could not be completed by parallel slaves. The QC puts together the final results and sends the results to the end user.

When an instance starts, a pool of parallel slave processes is automatically started. Initially, the number of slave processes in the pool is defined by the parallel_min_servers initialization parameter. If all of the slave processes from the initial pool allocation are being used and another parallel operation starts, more parallel slave processes are created until a maximum of parallel_max_servers processes have been reached. At this point, no more parallel processes can be spawned.

For most Oracle RAC deployments, the resources available in each node are similar. For example, each node would have the same number of CPU cores. In this case, the minimum and maximum number of slave processes should be configured identically for all instances.

If a node has twice as many CPU cores as another node, then you may consider allowing twice the number of parallel slave processes.

http://www.dba-oracle.com/t_rac_tuning_parallel_query.htm

Phạm Văn Hiếu

Thứ Năm, 12 tháng 3, 2020

RAC local parallel query internals

RAC local parallel query internals

Oracle RAC and Inter-Instance parallelism