Powered By Blogger

Sunday, September 26, 2010

Identifying CPU Bottlenecks with vmstat

Waiting CPU resources can be shown in UNIX vmstat command output as the second column under the kthr (kernel thread state change) heading (see Listing 2-1). Tasks may be placed in the wait queue (“b”) if they are waiting on a resource, while other tasks appear in the run queue (“r”) column.

In short, the server is experiencing a CPU bottleneck when “r” is greater than the number of CPU’s on the server. To see the number of CPUs on the server, you can use one of the following UNIX commands.

Remember that we need to know the number of CPUs on our server because the vmstat runqueue value must never exceed the number of CPUs. A runqueue value of 32 is perfectly acceptable for a 36-CPU server, while a value of 32 would be a serious problem for a 24 CPU server.

In the example below, we run the vmstat utility. For our purposes, we are interested in the first two columns: the run queue “r”, and the kthr wait “b” column. In the listing below we see that there are an average of about eight new tasks entering the run queue every five seconds (the “r” column), while there are five other tasks that are waiting on resources (the “b” column). Also, a nonzero value in the (“b”) column may indicate a bottleneck.

root> vmstat 5 5

kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
7 5 220214 141 0 0 0 42 53 0 1724 12381 2206 19 46 28 7
9 5 220933 195 0 0 1 216 290 0 1952 46118 2712 27 55 13 5
13 5 220646 452 0 0 1 33 54 0 2130 86185 3014 30 59 8 3
6 5 220228 672 0 0 0 0 0 0 1929 25068 2485 25 49 16 10

The rule for identifying a server with CPU resource problems is quite simple. Whenever the value of the runqueue “r” column exceeds the number of CPUs on the server, tasks are forced to wait for execution. There are several solutions to managing CPU overload, and these alternatives are presented in their order of desirability:

1. Add more processors (CPUs) to the server.

2. Load balance the system tasks by rescheduling large batch tasks to execute during off-peak hours.

3. Adjust the dispatching priorities (nice values) of existing tasks.

To understand how dispatching priorities work, we must remember that incoming tasks are placed in the execution queue according to their nice value. Tasks with a low nice value are scheduled for execution above those tasks with a higher nice value. Now that we can see when the CPUs are overloaded, let’s look into vmstat further and see how we can tell when the CPUs are running at full capacity.

Identifying High CPU Usage with vmstat

We can also easily detect when we are experiencing a busy CPU on the Oracle database server. Whenever the “us” (user) column plus the “sy” (system) column times approach 100%, the CPUs are operating at full capacity .

Please note that it is not uncommon to see the CPU approach 100 percent even when the server is not overwhelmed with work. This is because the UNIX internal dispatchers will always attempt to keep the CPUs as busy as possible. This maximizes task throughput, but it can be misleading for a neophyte.

Remember, it is not a cause for concern when the user + system CPU values approach 100 percent. This just means that the CPUs are working to their full potential. The only metric that identifies a CPU bottleneck is when the run queue (“r” value) exceeds the number of CPUs on the server.

root> vmstat 5 1

kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
0 0 217485 386 0 0 0 4 14 0 202 300 210 20 75 3 2

The approach of capturing server information along with Oracle information provides the Oracle9iAS administrator with a complete picture of the operation of the system.

Monitoring RAM Memory Consumption

In the UNIX environment, RAM memory is automatically managed by the operating system. In system with “virtual” memory, a special disk called swap is used to hold chunks of RAM that cannot fit within the available RAM on the server. In this fashion, a virtual memory server can allow tasks to allocate memory above the RAM capacity on the server. As the server is used, the operating system will move some memory pages out to the swap disk in case the server exceeds its physical capacity. This is called a page-out operation. Remember, page-out operations occur even when the database server has not exceeded the RAM capacity.

RAM memory shortages are evidenced by page-in operations. Page-in operations cause Oracle9iAS slowdowns because tasks must wait until their memory region is moved back into RAM from the swap disk. There are several remedies for overloaded RAM memory:

  • Add RAM - Add additional RAM to the server

  • Reduce Oracle9iAS RAM - Reduce the size of the RAM regions by adjusting the parameters for each Oracle9iAS component

Next, let’s move on and take a look at how to build an easy UNIX server monitor by extending the Oracle STATSPACK tables.

Now that we see how to monitor the Oracle9iAS servers, let’s examine how we can use this data to perform server load balancing.