Sizing

Sizing post #4: How much pain are we willing to tolerate?

We have gone through 3 posts already and have learned how to standardize ASH data for sizing, reviewed some basic statistics like mean, median, maximum and minimum and the use of percentiles to calculate the CPU requirement for a single instance database.

In this post we want to show a way to figure out what is the appropriate percentile that we can use to size our database workload.

We have been using a single instance Oracle Database for our examples and we are going to continue to use single instance database workload to keep things simple for now. The example we are going to use is a different database workload with a lot less active sessions to make it even simpler to understand.

Table: CPU_STANDARD_EX2 (stands for example 2)

Lets grab some statistics on this database workload.

SELECT MIN(sample_time) START_TIME, MAX(sample_time) END_TIME, ROUND(MAX(sample_time) - MIN(sample_time)) Retention
FROM cpu_standard_ex2;

START_TIM END_TIME    RETENTION
--------- --------- -----------
04-AUG-16 19-AUG-16          15

We had our retention period set to 15 days thus we have data collected for 15 days starting Aug 4th and ending Aug 19th.

SELECT MIN(cpu_data) MINIMUM, MAX(cpu_data) MAXIMUM
FROM   cpu_standard_ex2;

    MINIMUM     MAXIMUM
----------- -----------
          0           9

This database has a minimum CPU utilization of 0 and a maximum of 9 active sessions.

SELECT AVG(cpu_data) AVERAGE, MEDIAN(cpu_data) MEDIAN
FROM   cpu_standard_ex2;

    AVERAGE      MEDIAN
----------- -----------
.3778628907           0

The average is not even 1 active session and the median is 0. The median is the 50th percentile and we can state that the most popular value in this database workload is 0.

To have a visualization we can summarize the data into hourly data points with the maximum value per hour.

SELECT   TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') THE_HOUR, MAX(cpu_data) CPU_DATA
FROM     cpu_standard_ex2
GROUP BY TRUNC(sample_time,'HH24')
ORDER BY 1;

THE_HOUR         CPU_DATA
------------- -----------
2016-08-04T06           1
2016-08-04T07           1
2016-08-04T08           1
2016-08-04T09           1
...
...
2016-08-18T21           2
2016-08-18T22           3
2016-08-18T23           2
2016-08-19T00           2
2016-08-19T01           2
2016-08-19T02           1
2016-08-19T03           1
2016-08-19T04           1

359 rows selected.

We create a table with the hourly cpu data to create a line chart in excel as explained in Sizing post #2: The not so good mean, the bad median, and the ugly minimum and maximum.

CREATE TABLE cpu_standard_ex2_hourly AS
SELECT   TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') THE_HOUR, MAX(cpu_data) CPU_DATA
FROM     cpu_standard_ex2
GROUP BY TRUNC(sample_time,'HH24')
ORDER BY 1;

Here is the chart for hourly data points for our example 2 database workload:

hourly_ex2

Because we are summarizing with the maximum, most of the data points will not be 0 but at least 1, however, remember that it does not mean that the whole hour was 1 active session. Lets take a look at some detail on a specific hour.

CREATE TABLE cpu_standard_ex2_aug_16 AS
SELECT   TO_CHAR(sample_time, 'YYYY-MM-DD"T"HH24:MI:SS') THE_HOUR, cpu_data
FROM     cpu_standard_ex2
WHERE    TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') = '2016-08-16T05' 
ORDER BY 1;

example2_10s

See how the majority of our data points are 0 with some at 1.

Lets get the hour where we had our maximum of 9.

SELECT * FROM
(
SELECT   TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') THE_HOUR, MAX(cpu_data) CPU_DATA
FROM     cpu_standard_ex2
GROUP BY TRUNC(sample_time,'HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <=1;

THE_HOUR         CPU_DATA
------------- -----------
2016-08-06T05           9

Generating table and chart:

CREATE TABLE cpu_standard_ex2_aug_06 AS
SELECT   TO_CHAR(sample_time, 'YYYY-MM-DD"T"HH24:MI:SS') THE_HOUR, cpu_data
FROM     cpu_standard_ex2
WHERE    TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') = '2016-08-06T05' 
ORDER BY 1;

example2_spike

There is a spike in CPU utilization up to 9 active sessions on 2016-08-06 at 05:31:20. This is a one 10 second sample that had 9 active sessions, around that time the cpu utilization was 1 and 2.

It won’t be realistic to size with 9 because the workload is usually 0, 1 or 2 and it would be too much resource to size to the maximum of 9.

If we remember from our previous post

97th percentile -> longest wait is 1.8 min.
95th percentile -> longest wait is  3  min.
90th percentile -> longest wait is  6  min.
80th percentile -> longest wait is 12  min.

If we use our 95th percentile because we are willing to tolerate a maximum of 3 min wait, the calculation will be:

SELECT * FROM
(
SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY cpu_data)) "95th PERCENTILE"
FROM   cpu_standard_ex2
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <= 1;

THE_HOUR      95th PERCENTILE
------------- ---------------
2016-08-04T23               2

A 90th percentile would result in 1.

SELECT * FROM
(
SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY cpu_data)) "90th PERCENTILE"
FROM   cpu_standard_ex2
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <= 1;

THE_HOUR      90th PERCENTILE
------------- ---------------
2016-08-04T22               1

In fact using 97th percentile we also get 2 CPUs, same as with 95th percentile. If we get the results for various percentile values we can build a table like the following:

100th percentile ->          no wait         -> 9 CPUs
 99th percentile -> longest wait is  36 sec. -> 4 CPUs
 97th percentile -> longest wait is 1.8 min. -> 2 CPUs
 95th percenitle -> longest wait is  3  min. -> 2 CPUs
 90th percentile -> longest wait is  6  min. -> 1 CPU
 80th percentile -> longest wait is 12  min. -> 1 CPU

Here is where we come back to our question of what percentile to use.

It depends on how much pain (wait) we are willing to tolerate.

If we choose 4 CPUs for our system our potentially longest wait on our worst hour would be 36 seconds; if we are willing to tolerate a wait of 1.8 min in our worst hour we can choose the 2 CPUs. For this particular example we observe that going from 36 seconds to 1.8 min. will save 2 CPUs out of 4 CPUs, meaning I would have to double the amount of resources to cover for the extra coverage.

Lets say we pick the 2 CPUs and now we want to know how many times the CPU utilization went above those 2 CPUs. We can build a table to hold the number of samples above a specific number of CPUs. We need a procedure to populate such a table and a query to show the result:

CREATE TABLE samples_above
(cpu_required      NUMBER,
 end_date          DATE,
 samples_above     NUMBER
);

CREATE OR REPLACE PROCEDURE get_samples_above
(
 p_cpu_req          IN  samples_above.cpu_required%TYPE
) IS
BEGIN
   INSERT INTO samples_above
   SELECT p_cpu_req, TRUNC(sample_time, 'HH24'), COUNT(cpu_data) samples_above
   FROM cpu_standard_ex2
   WHERE cpu_data > p_cpu_req
   GROUP BY TRUNC(sample_time, 'HH24');
   DBMS_OUTPUT.PUT_LINE(' '); 
   DBMS_OUTPUT.PUT_LINE('get_samples_above: CPU req ' || p_cpu_req);
   COMMIT;
END;
/

TRUNCATE TABLE samples_above;

SET SERVEROUTPUT ON
EXEC get_samples_above(2)

SELECT   cpu_required, TO_CHAR(end_date, 'YYYY-MM-DD"T"HH24') end_date_hour, samples_above
FROM     samples_above
ORDER BY cpu_required, samples_above DESC;

CPU_REQUIRED END_DATE_HOUR SAMPLES_ABOVE
------------ ------------- -------------
           2 2016-08-16T09             6
           2 2016-08-17T22             4
           2 2016-08-16T23             3
           2 2016-08-08T22             3
           2 2016-08-07T00             3
           2 2016-08-12T22             2
           2 2016-08-18T22             2
           2 2016-08-15T00             2
           2 2016-08-13T17             1
           2 2016-08-14T00             1
           2 2016-08-10T23             1
           2 2016-08-10T22             1
           2 2016-08-14T17             1
           2 2016-08-09T22             1
           2 2016-08-07T11             1
           2 2016-08-06T06             1
           2 2016-08-06T16             1
           2 2016-08-06T05             1
           2 2016-08-06T14             1
           2 2016-08-05T22             1
           2 2016-08-12T23             1
           2 2016-08-08T01             1
           2 2016-08-08T07             1
           2 2016-08-06T09             1
           2 2016-08-06T02             1
           2 2016-08-15T22             1
           2 2016-08-07T09             1
           2 2016-08-13T18             1

28 rows selected.

How do we interpret the output?

The busiest hour is 2016-08-16T09 with 6 samples above 2. we are referring to standardized ASH/AWR samples, the ones that are 10 seconds equidistant. There are 360 samples in one hour. If we have 6 samples above 2 CPUs and we assume that those samples are continuous, our longest wait on that hour would be 60 seconds.

If the samples above are not continuous then the waits would be less time. Lets find out if our samples above were continuous or spread within the hour.

CREATE TABLE how_long
(cpu_required    NUMBER,
 end_date        DATE,
 how_long_sec    NUMBER,
 how_long_min    NUMBER,
 max_sample      NUMBER
);

CREATE OR REPLACE PROCEDURE get_how_long
(
 p_cpu_req          IN  samples_above.cpu_required%TYPE
) IS
  CURSOR c_10s_t IS
    SELECT sample_time, cpu_data 
    FROM cpu_standard_ex2
    ORDER BY sample_time; 
  v_end_date            cpu_standard_ex2.sample_time%TYPE;  -- to fetch date
  v_store_end_date      cpu_standard_ex2.sample_time%TYPE;  -- to store the final list of dates where utilization went above #CPUs
  v_on_cpu              cpu_standard_ex2.cpu_data%TYPE;     -- to fetch on_cpu
  v_count               NUMBER := 0;  -- to count how many samples the CPU utilization stayed above #CPUs
  v_max_sample          NUMBER := 0;  -- to hold the max sample in each entry of org6_how_long_t
BEGIN
  OPEN c_10s_t;
  LOOP
    FETCH c_10s_t into v_end_date, v_on_cpu;
    EXIT WHEN c_10s_t%NOTFOUND;
    IF v_on_cpu > p_cpu_req THEN
      v_store_end_date := v_end_date;
      v_count      := 1;
      v_max_sample := v_on_cpu;  
      LOOP
        FETCH c_10s_t into v_end_date, v_on_cpu;
        EXIT WHEN c_10s_t%NOTFOUND;
        IF v_on_cpu > p_cpu_req THEN
          v_count := v_count + 1;
          IF v_on_cpu > v_max_sample THEN
            v_max_sample := v_on_cpu;
          END IF;
        ELSE
          INSERT INTO how_long VALUES(p_cpu_req, v_store_end_date, v_count*10, (v_count*10)/60, v_max_sample);
          EXIT;
        END IF;
      END LOOP;
    END IF;
  END LOOP;
  CLOSE c_10s_t; 
  DBMS_OUTPUT.PUT_LINE(' ');
  DBMS_OUTPUT.PUT_LINE('get_how_long: CPU req ' || p_cpu_req);
  DBMS_OUTPUT.PUT_LINE('get_how_long: Data in mn_how_long_t ');
  COMMIT;
END;
/

TRUNCATE TABLE how_long;
EXEC get_how_long(2);

SELECT cpu_required, to_char(end_date, 'YYYY-MM-DD"T"HH24:MI:SS') "TIME",
how_long_sec "SECONDS", round(how_long_min, 2) "MINUTES", 
max_sample "MAX_SAMPLE" 
FROM how_long
ORDER BY cpu_required, seconds DESC, max_sample DESC;

CPU_REQUIRED TIME                    SECONDS     MINUTES  MAX_SAMPLE
------------ ------------------- ----------- ----------- -----------
           2 2016-08-16T09:47:40          20         .33           5
           2 2016-08-17T22:00:30          20         .33           4
           2 2016-08-12T22:00:40          20         .33           4
           2 2016-08-06T05:31:20          10         .17           9
           2 2016-08-16T09:49:30          10         .17           5
           2 2016-08-16T09:50:30          10         .17           4
           2 2016-08-16T09:48:10          10         .17           4
           2 2016-08-16T09:51:20          10         .17           4
           2 2016-08-07T00:58:10          10         .17           3
           2 2016-08-07T09:16:50          10         .17           3
...
...

Each row in the table tells us the time the samples were above the number of CPUs and how long it took to go down again below the number of CPUs.

Our assumption was not true, the samples above 2 were not continuous, otherwise, we would have seen 60 seconds at the top but we see 20 seconds instead. At most we had 2 continuous samples.

We got the number of samples above 2 CPUs and how long it takes up above those 2 CPUs.

We can get the same information for 1 CPU:

TRUNCATE TABLE samples_above;

SET SERVEROUTPUT ON
EXEC get_samples_above(1)

SELECT   cpu_required, TO_CHAR(end_date, 'YYYY-MM-DD"T"HH24') end_date_hour, samples_above
FROM     samples_above
ORDER BY cpu_required, samples_above DESC;

CPU_REQUIRED END_DATE_HOUR SAMPLES_ABOVE
------------ ------------- -------------
           1 2016-08-09T22            35
           1 2016-08-17T23            26
           1 2016-08-12T22            25
           1 2016-08-06T06            24
           1 2016-08-05T23            23
           1 2016-08-08T22            23
           1 2016-08-08T01            22
           1 2016-08-05T01            22
...
...
           1 2016-08-12T03             1
           1 2016-08-18T06             1
           1 2016-08-14T04             1
           1 2016-08-06T02             1

154 rows selected.

The first row shows that there are 35 samples above 1 CPU. There are 360 samples in one hour, thus 35 samples represent 9.7%, very close to 10% which is 6 minutes.

Tying back to our percentiles table:

100th percentile ->          no wait         -> 9 CPUs
 99th percentile -> longest wait is  36 sec. -> 4 CPUs
 97th percentile -> longest wait is 1.8 min. -> 2 CPUs
 95th percenitle -> longest wait is  3  min. -> 2 CPUs
 90th percentile -> longest wait is  6  min. -> 1 CPU   <<<<<<  80th percentile -> longest wait is 12  min. -> 1 CPU

The 1 CPU is when we use 90th percentile, and potentially our longest wait could be 6 min.

Having the system not able to provide enough CPU resource for 6 minutes seems too long (we are not saying that a session would have to wait 6 min before it gets connected or have a response back, just that we are requesting more CPUs than CPUs available), therefore, we would not say that 1 CPU is enough to support this database workload. We can investigate if the samples above 1 are continuous.

TRUNCATE TABLE how_long;
EXEC get_how_long(1);

SELECT * FROM
(
SELECT cpu_required, to_char(end_date, 'YYYY-MM-DD"T"HH24:MI:SS') "TIME", how_long_sec "SECONDS", round(how_long_min, 2) "MINUTES", max_sample "MAX_SAMPLE" 
FROM how_long
ORDER BY cpu_required, seconds DESC, max_sample DESC
)
WHERE ROWNUM <=10;

CPU_REQUIRED TIME                    SECONDS     MINUTES  MAX_SAMPLE
------------ ------------------- ----------- ----------- -----------
           1 2016-08-09T22:00:30         180           3           3
           1 2016-08-17T22:00:10         100        1.67           4
           1 2016-08-17T23:03:00          80        1.33           2
           1 2016-08-13T06:02:00          70        1.17           2
           1 2016-08-14T06:08:50          60           1           2
           1 2016-08-17T23:33:50          60           1           2
           1 2016-08-05T01:20:40          60           1           2
           1 2016-08-12T22:00:30          50         .83           4
           1 2016-08-06T16:00:50          50         .83           3
           1 2016-08-05T01:52:00          50         .83           2

10 rows selected.

The 35 samples above are not continuous, the maximum number of continuous CPU utilization above 1 is 18 samples (180 seconds).

In real life obviously we do not have a physical server with 1 or 2 CPUs to run and Oracle Database instance but the example is to understand that the percentile used depends on how much pain (wait) we are willing to tolerate.

If we had a VMWare cluster where we have Virtual Machines we could certainly accommodate this example database workload. Because we want to leave some headroom for unexpected spikes in CPU utilization we could start assigning 3 vCPUs to this database workload or even 4 to account also for background processes for VM management. The advantage when using VMs is that the resources can be reallocated.

Les try another example.

The data for the next example has been placed in table CPU_STANDARD_EX3

Lets grab some statistics on this database workload.

SELECT MIN(sample_time) START_TIME, MAX(sample_time) END_TIME, ROUND(MAX(sample_time) - MIN(sample_time)) Retention
FROM cpu_standard_ex3;

START_TIM END_TIME    RETENTION
--------- --------- -----------
03-AUG-16 18-AUG-16          15

We had our retention period set to 15 days thus we have data collected for 15 days starting Aug 3th and ending Aug 18th.

SELECT MIN(cpu_data) MINIMUM, MAX(cpu_data) MAXIMUM
FROM cpu_standard_ex3;

    MINIMUM     MAXIMUM
----------- -----------
          0          21

This database has a minimum CPU utilization of 0 and a maximum of 21 active sessions.

SELECT AVG(cpu_data) AVERAGE, MEDIAN(cpu_data) MEDIAN
FROM cpu_standard_ex3;

    AVERAGE      MEDIAN
----------- -----------
.3324691358           0

CREATE TABLE cpu_standard_ex3_hourly
AS SELECT TO_CHAR(TRUNC(sample_time,'HH24'), 'YYYY-MM-DD"T"HH24') THE_HOUR, MAX(cpu_data) CPU_DATA
FROM cpu_standard_ex3
GROUP BY TRUNC(sample_time,'HH24')
ORDER BY 1;

Using the hourly data to create a line chart:

example3_hourly

In order to find out the 95th percentile:

SELECT * FROM
(
SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY cpu_data)) "95th PERCENTILE"
FROM   cpu_standard_ex3
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <= 1;

THE_HOUR      95th PERCENTILE
------------- ---------------
2016-08-14T07               4

Using that query we can build the following table:

100th percentile ->          no wait        -> 21 CPUs
 99th percentile -> longest wait is  36 sec. -> 7 CPUs
 97th percentile -> longest wait is 1.8 min. -> 5 CPUs
 95th percenitle -> longest wait is  3  min. -> 4 CPUs
 90th percentile -> longest wait is  6  min. -> 2 CPUs
 80th percentile -> longest wait is 12  min. -> 2 CPUs

Lets pick the 95th percentile of 4 and get the samples above 4 CPUs.

Obtaining the samples above:

CREATE OR REPLACE PROCEDURE get_samples_above
(
 p_cpu_req          IN  samples_above.cpu_required%TYPE
) IS
BEGIN
   INSERT INTO samples_above
   SELECT p_cpu_req, TRUNC(sample_time, 'HH24'), COUNT(cpu_data) samples_above
   FROM cpu_standard_ex3
   WHERE cpu_data > p_cpu_req
   GROUP BY TRUNC(sample_time, 'HH24');
   DBMS_OUTPUT.PUT_LINE(' '); 
   DBMS_OUTPUT.PUT_LINE('get_samples_above: CPU req ' || p_cpu_req);
   COMMIT;
END;
/

TRUNCATE TABLE samples_above;

SET SERVEROUTPUT ON
EXEC get_samples_above(4)

SELECT   cpu_required, TO_CHAR(end_date, 'YYYY-MM-DD"T"HH24') end_date_hour, samples_above
FROM     samples_above
ORDER BY cpu_required, samples_above DESC;

CPU_REQUIRED END_DATE_HOUR SAMPLES_ABOVE
------------ ------------- -------------
           4 2016-08-13T07            12
           4 2016-08-06T07             8
           4 2016-08-09T13             6
           4 2016-08-07T07             5
           4 2016-08-10T07             5
           4 2016-08-08T19             5
           4 2016-08-14T07             5
           4 2016-08-16T07             5
           4 2016-08-05T07             5
           4 2016-08-17T07             5
           4 2016-08-10T05             4
           4 2016-08-11T01             4
...
...

How about 90th percentile?

TRUNCATE TABLE samples_above;

SET SERVEROUTPUT ON
EXEC get_samples_above(2)

SELECT   cpu_required, TO_CHAR(end_date, 'YYYY-MM-DD"T"HH24') end_date_hour, samples_above
FROM     samples_above
ORDER BY cpu_required, samples_above DESC;

CPU_REQUIRED END_DATE_HOUR SAMPLES_ABOVE
------------ ------------- -------------
           2 2016-08-14T07            33
           2 2016-08-06T07            31
           2 2016-08-11T01            27
           2 2016-08-07T07            27
           2 2016-08-08T19            25
           2 2016-08-12T01            25
           2 2016-08-13T07            25
           2 2016-08-05T07            24
           2 2016-08-09T01            23
           2 2016-08-08T07            22
...
...

It seems too busy, lets try 3 CPUs

TRUNCATE TABLE samples_above;

SET SERVEROUTPUT ON
EXEC get_samples_above(3)

SELECT   cpu_required, TO_CHAR(end_date, 'YYYY-MM-DD"T"HH24') end_date_hour, samples_above
FROM     samples_above
ORDER BY cpu_required, samples_above DESC;

CPU_REQUIRED END_DATE_HOUR SAMPLES_ABOVE
------------ ------------- -------------
           3 2016-08-14T07            20
           3 2016-08-05T07            17
           3 2016-08-13T07            16
           3 2016-08-06T07            15
           3 2016-08-07T07            15
           3 2016-08-08T07            14
           3 2016-08-10T07            12
           3 2016-08-04T05            11
           3 2016-08-11T01            10
           3 2016-08-12T05            10
           3 2016-08-17T07            10
...
...

The 3 CPUs correspond to the 93rd percentile.

SELECT * FROM
(
SELECT TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24') THE_HOUR
      ,ROUND(PERCENTILE_CONT(0.93) WITHIN GROUP (ORDER BY cpu_data)) "93th PERCENTILE"
FROM   cpu_standard_ex3
GROUP BY TO_CHAR(TRUNC(sample_time,'HH24'),'YYYY-MM-DD"T"HH24')
ORDER BY 2 DESC
)
WHERE ROWNUM <= 1;

THE_HOUR      93th PERCENTILE
------------- ---------------
2016-08-06T07               3

Lets see what we got here. The busiest hour has 20 samples above 3 CPUs, if all 20 samples are continuous the longest wait would be 200 seconds ~ 3 min. and 20 seconds.

In order to see the “how long” results we run the following:

CREATE OR REPLACE PROCEDURE get_how_long
(
 p_cpu_req          IN  samples_above.cpu_required%TYPE
) IS
  CURSOR c_10s_t IS
    SELECT sample_time, cpu_data 
    FROM cpu_standard_ex3
    ORDER BY sample_time; 
  v_end_date            cpu_standard_ex3.sample_time%TYPE;  -- to fetch date
  v_store_end_date      cpu_standard_ex3.sample_time%TYPE;  -- to store the final list of dates where utilization went above #CPUs
  v_on_cpu              cpu_standard_ex3.cpu_data%TYPE;     -- to fetch on_cpu
  v_count               NUMBER := 0;  -- to count how many samples the CPU utilization stayed above #CPUs
  v_max_sample          NUMBER := 0;  -- to hold the max sample in each entry of org6_how_long_t
BEGIN
  OPEN c_10s_t;
  LOOP
    FETCH c_10s_t into v_end_date, v_on_cpu;
    EXIT WHEN c_10s_t%NOTFOUND;
    IF v_on_cpu > p_cpu_req THEN
      v_store_end_date := v_end_date;
      v_count      := 1;
      v_max_sample := v_on_cpu;  
      LOOP
        FETCH c_10s_t into v_end_date, v_on_cpu;
        EXIT WHEN c_10s_t%NOTFOUND;
        IF v_on_cpu > p_cpu_req THEN
          v_count := v_count + 1;
          IF v_on_cpu > v_max_sample THEN
            v_max_sample := v_on_cpu;
          END IF;
        ELSE
          INSERT INTO how_long VALUES(p_cpu_req, v_store_end_date, v_count*10, (v_count*10)/60, v_max_sample);
          EXIT;
        END IF;
      END LOOP;
    END IF;
  END LOOP;
  CLOSE c_10s_t; 
  DBMS_OUTPUT.PUT_LINE(' ');
  DBMS_OUTPUT.PUT_LINE('get_how_long: CPU req ' || p_cpu_req);
  DBMS_OUTPUT.PUT_LINE('get_how_long: Data in mn_how_long_t ');
  COMMIT;
END;
/

TRUNCATE TABLE how_long;
EXEC get_how_long(3);

SELECT cpu_required, to_char(end_date, 'YYYY-MM-DD"T"HH24:MI:SS') "TIME", how_long_sec "SECONDS", round(how_long_min, 2) "MINUTES", max_sample "MAX_SAMPLE" 
FROM how_long
ORDER BY cpu_required, seconds DESC, max_sample DESC;

CPU_REQUIRED TIME                    SECONDS     MINUTES  MAX_SAMPLE
------------ ------------------- ----------- ----------- -----------
           3 2016-08-05T07:31:40          50         .83           7
           3 2016-08-07T07:31:40          40         .67           9
           3 2016-08-16T00:41:00          40         .67           7
           3 2016-08-11T05:46:30          40         .67           5
           3 2016-08-13T07:32:10          30          .5           7
           3 2016-08-16T03:31:50          30          .5           7
           3 2016-08-11T07:31:50          30          .5           7
           3 2016-08-10T07:31:50          30          .5           7
           3 2016-08-17T00:41:10          30          .5           6
           3 2016-08-13T07:37:20          30          .5           6
           3 2016-08-16T07:35:50          30          .5           5
           3 2016-08-14T07:44:10          30          .5           5
           3 2016-08-06T01:02:40          30          .5           5
           3 2016-08-04T05:33:10          30          .5           4
...
...
           3 2016-08-18T07:32:20          10         .17           4
           3 2016-08-18T07:35:20          10         .17           4
           3 2016-08-18T07:38:00          10         .17           4
           3 2016-08-18T07:42:50          10         .17           4

574 rows selected.

How about with 4 CPUs.

TRUNCATE TABLE how_long;
EXEC get_how_long(4);

SELECT cpu_required, to_char(end_date, 'YYYY-MM-DD"T"HH24:MI:SS') "TIME", how_long_sec "SECONDS", round(how_long_min, 2) "MINUTES", max_sample "MAX_SAMPLE" 
FROM how_long
ORDER BY cpu_required, seconds DESC, max_sample DESC;

CPU_REQUIRED TIME                    SECONDS     MINUTES  MAX_SAMPLE
------------ ------------------- ----------- ----------- -----------
           4 2016-08-13T07:32:10          30          .5           7
           4 2016-08-07T07:31:40          20         .33           9
           4 2016-08-13T07:32:50          20         .33           9
           4 2016-08-06T07:32:40          20         .33           9
           4 2016-08-10T05:46:40          20         .33           7
           4 2016-08-17T00:41:10          20         .33           6
           4 2016-08-08T07:31:40          20         .33           6
           4 2016-08-12T03:44:50          20         .33           6
           4 2016-08-13T07:37:30          20         .33           6
           4 2016-08-16T07:36:00          20         .33           5
           4 2016-08-14T05:46:50          20         .33           5
           4 2016-08-10T07:37:20          20         .33           5
           4 2016-08-13T05:33:00          20         .33           5
...
...
           4 2016-08-11T05:46:50          10         .17           5
           4 2016-08-11T07:31:50          10         .17           5
           4 2016-08-12T00:41:30          10         .17           5
           4 2016-08-12T05:32:10          10         .17           5

306 rows selected.

The 93rd percentile would be a good option, because the longest wait is not more than 50 seconds, however, there are many times when the system goes up above 3 CPUs (574 times), even with 4 CPUs the system goes up above 4 CPUs 306 times.

We could use the 4 CPUs as the number of CPUs for this database. If we wanted this system to have headroom of 50% we would add 4 more CPUs for a total of 8 CPUs. If we wanted to have this as the starting value of a VM for this database workload we could start with 5 (4 + 1 for VM overhead) and monitor to see if more are needed.

As a side note:
I am using this space to recommend the use of Oracle Resource Manager to control the spikes in CPU utilization.

“High database loads destabilize the server. Many operating system vulnerabilities and bugs are exposed at excessive CPU loads. In addition, excessive CPU loads can starve critical applications and monitoring tools running on the server.”

“One key to a healthy database is maintaining a healthy CPU load. Excessive CPU load can destabilize the server and expose operating system bugs. Excessive CPU load can also prevent critical Oracle background processes from running in a timely manner, resulting in failures such as database instance evictions on a RAC database. At such times, your response time may be so poor that you may be unable to debug the source of the excessive CPU load, for example by identifying and fixing a faulty SQL plan. Using Oracle Database Resource Manager, you can ensure that your database’s CPU load is always healthy, thus avoiding all of these problems. Resource Manager manages your CPU load so that all CPUs are fully utilized yet there is no thrashing since no (or very few) Oracle processes are waiting in the operating system run queue. Using a database feature called Instance Caging, you can configure Resource Manager to manage the CPU load to even lower levels.”

Reference: Using Oracle Database Resource Manager

In this blog post we have learned how to choose a percentile to decide how many CPUs are required for our database workload and review how many times the CPU utilization goes above our number of CPUs and for how long that utilization stays above our specified number of CPUs.

With this information we are able to decide how many CPUs we want to allocate for our database.

It is important to size correctly because if we undersusbscribe we risk the system to come to a halt and if we oversubscribe we are spending more money than needed on hardware and licenses.

If you are migrating to a new system and need to know the number of CPUs in the target system which is different from the actual system you need to translate the processing power of the actual server to the new server. If you want to learn how to do that stay tuned for the next blog post: “Sizing post #5: How to map our existing processing power to new hardware.”

Leave a comment