Optimizer · SQL Tuning

How a Frequency Histogram Works

This post is to show how a frequency histogram helps the optimizer to come up with a more optimal execution plan. I will show how the optimizer chooses a sub-optimal plan assuming uniform distribution of data where the data is skewed, then show how a frequency histogram helps improving the optimizer decisions. I am creating all objects used in this example from scratch so you have all the information to test on your own.

Lets create our table for testing. We are going to call the table MY_HISTOGRAM. We are going to have 100,000 rows in this table. There will be two columns on the table, the column ID which goes from 1 to 100,000 and SKEW column with values 1-20 for the first 20 rows and 100 for the rest 99,980.

create table MY_HISTOGRAM as select rownum id, 100 skew 
from dual connect by level <=100000;
update MY_HISTOGRAM set skew = id where rownum<=20;
commit;
select * from MY_HISTOGRAM;
Output:
         ID        SKEW
----------- -----------
          1           1
          2           2
          3           3
          4           4
          5           5
          6           6
          7           7
          8           8
          9           9
         10          10
         11          11
         12          12
         13          13
         14          14
         15          15
         16          16
         17          17
         18          18
         19          19
         20          20
         21         100
         22         100
...
...
      99999         100
     100000         100
100000 rows selected.
select skew, count(*) 
from MY_HISTOGRAM 
group by skew order by skew;
Output:

       SKEW    COUNT(*)
----------- -----------
          1           1
          2           1
          3           1
          4           1
          5           1
          6           1
          7           1
          8           1
          9           1
         10           1
         11           1
         12           1
         13           1
         14           1
         15           1
         16           1
         17           1
         18           1
         19           1
         20           1
        100       99980

21 rows selected.

We are going to gather statistics but we are not going to create a histogram.

exec dbms_stats.gather_table_stats(user,'MY_HISTOGRAM');
col column_name format a25
select column_name, num_distinct, density 
from user_tab_col_statistics 
where table_name='MY_HISTOGRAM';
Output:

COLUMN_NAME               NUM_DISTINCT     DENSITY
------------------------- ------------ -----------
ID                              100000      .00001
SKEW                                21 .0476190476

NUM_DISTINCT is the Number of Distinct Values (NDV) in a column.
When there is no histogram the DENSITY is calculated as 1/NUM_DISTINCT, for SKEW column it is 1/21 = 0.0476190476
Density represents the selectivity of a column and its range of possible values is 0 to 1.
The optimizer uses the DENSITY to estimate the number of rows returned by a query using a specific column in the predicate.
The number of rows returned when a specific predicate is applied is called Cardinality.

Cardinality = Selectivity * Num of Rows

Assuming a predicate of skew = 3 the estimated cardinality would be 4762 which is (1/21) * 100,000

select * from MY_HISTOGRAM where skew = 3;
set pages 0
select * from table(dbms_xplan.display_cursor);
set pages 14
Output:
SQL_ID  0fw0cu077y47w, child number 0
-------------------------------------
select * from MY_HISTOGRAM where skew = 3
Plan hash value: 1485722408
------------------------------------------------------------------------------------------
| Id  | Operation                 | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |              |       |       |    54 (100)|          |
|*  1 |  TABLE ACCESS STORAGE FULL| MY_HISTOGRAM |  4762 | 38096 |    54   (2)| 00:00:01 |
------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - storage("SKEW"=3)
       filter("SKEW"=3)

We can see the value 4762 in the column Rows, the estimated Cardinality.

We are now going to generate the Optimizer trace 10053 for this SQL statement

alter session set max_dump_file_size = unlimited;
execute DBMS_SQLDIAG.DUMP_TRACE(p_sql_id=>'0fw0cu077y47w',-
p_child_number=>0,p_component=>'Optimizer',-
p_file_id=>'TRACE_10053');

col value format a90
SELECT value FROM v$diag_info WHERE name='Default Trace File';

!vi  ora_130159_TRACE_10053.trc
TRACE info:

***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
 Table: MY_HISTOGRAM Alias: MY_HISTOGRAM
 #Rows: 100000 SSZ: 0 LGR: 0 #Blks: 191 AvgRowLen: 8.00 NEB: 0 ChainCnt: 0.00 SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1
 #IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000
=======================================
SPD: BEGIN context at query block level
=======================================
Query Block SEL$1 (#0)
Return code in qosdSetupDirCtx4QB: NOCTX
=====================================
SPD: END context at query block level
=====================================
Access path analysis for MY_HISTOGRAM
***************************************
SINGLE TABLE ACCESS PATH
 Single Table Cardinality Estimation for MY_HISTOGRAM[MY_HISTOGRAM]
 SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
 Column (#2): SKEW(NUMBER)
 AvgLen: 3 NDV: 21 Nulls: 0 Density: 0.047619 Min: 1.000000 Max: 100.000000
 Table: MY_HISTOGRAM Alias: MY_HISTOGRAM
 Card: Original: 100000.000000 Rounded: 4762 Computed: 4761.904762 Non Adjusted: 4761.904762
 Scan IO Cost (Disk) = 53.000000
 Scan CPU Cost (Disk) = 18360195.040000
 Total Scan IO Cost = 53.000000 (scan (Disk))
 + 0.000000 (io filter eval) (= 0.000000 (per row) * 100000.000000 (#rows))
 = 53.000000
 Total Scan CPU Cost = 18360195.040000 (scan (Disk))
 + 5000000.000000 (cpu filter eval) (= 50.000000 (per row) * 100000.000000 (#rows))
 = 23360195.040000
 Access Path: TableScan
 Cost: 54.059340 Resp: 54.059340 Degree: 0
 Cost_io: 53.000000 Cost_cpu: 23360195
 Resp_io: 53.000000 Resp_cpu: 23360195
 Best:: AccessPath: TableScan
 Cost: 54.059340 Degree: 1 Resp: 54.059340 Card: 4761.904762 Bytes: 0.000000

This is the information we are interested in:

AvgLen: 3 NDV: 21 Nulls: 0 Density: 0.047619 Min: 1.000000 Max: 100.000000
 Table: MY_HISTOGRAM Alias: MY_HISTOGRAM
 Card: Original: 100000.000000 Rounded: 4762 Computed: 4761.904762 Non Adjusted: 4761.904762

Density: 0.047619
Min: 1.000000
Max: 100.000000
Card: Original: 100000.000000
Card: Rounded: 4762

We want to see those in the Data Dictionary and we can select from user_tab_col_statistics for that purpose

set pages 9999
set linesize 132
col table_name format a20
col column_name format a20
col low_value format 999999
col high_value format 999999
select column_name, num_distinct, utl_raw.cast_to_number(low_value) low_value, 
utl_raw.cast_to_number(high_value) high_value
from user_tab_col_statistics
where table_name = 'MY_HISTOGRAM';
Output:

COLUMN_NAME          NUM_DISTINCT LOW_VALUE HIGH_VALUE
-------------------- ------------ --------- ----------
ID                         100000         1     100000
SKEW                           21         1        100
select column_name, density, num_nulls, num_buckets, histogram
from user_tab_col_statistics
where table_name = 'MY_HISTOGRAM';
Output:

COLUMN_NAME              DENSITY   NUM_NULLS NUM_BUCKETS       HISTOGRAM
-------------------- ----------- ----------- ----------- ---------------
ID                        .00001           0           1            NONE
SKEW                 .0476190476           0           1            NONE

Going back to our Execution plan for

select * from MY_HISTOGRAM where skew = 3;

------------------------------------------------------------------------------------------
| Id  | Operation                 | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |              |       |       |    54 (100)|          |
|*  1 |  TABLE ACCESS STORAGE FULL| MY_HISTOGRAM |  4762 | 38096 |    54   (2)| 00:00:01 |
------------------------------------------------------------------------------------------

The Optimizer assumes uniform distribution of data and estimates that the number of rows returned (Cardinality) for skew = 3 is 4762 which comes from (1/21) * 100,000. Formula: Cardinality = Selectivity * NUM_ROWS.
However, we know that we only have 1 row for skew = 3.
Because the Optimizer assumes uniform data distribution it may choose a sub-optimal execution plan in case we had an index on skew column and we were using skew = 100 as a predicate.

This time we are going to use a predicate skew = 100 which has 99,980 rows and which optimal plan would be a FULL TABLE SCAN instead of an INDEX RANGE SCAN.

Lets create the index on column skew

create index skew_idx on MY_HISTOGRAM(skew);
exec dbms_stats.gather_index_stats(user,'SKEW_IDX');
select * from MY_HISTOGRAM where skew = 100;
set pages 0
select * from table(dbms_xplan.display_cursor);
set pages 14
Output:
SQL_ID  86tvh52sjd7ka, child number 0
-------------------------------------
select * from MY_HISTOGRAM where skew = 100

Plan hash value: 928367158

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |       |       |    19 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| MY_HISTOGRAM |  4762 | 38096 |    19   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | SKEW_IDX     |  4762 |       |    10   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("SKEW"=100)

Because the optimizer thinks the value skew = 100 is only 4762 times in the table it chose an INDEX RANGE SCAN when the reality is that the value is 99,980 times and a FULL TABLE SCAN would be more optimal.

Once we create a histogram it will tell the optimizer the number of times a value appears in a column. For our previous examples it would tell for skew = 3 that the value is once and for skew = 100 the value is 99,980 times.

In 12c there are 4 types of histograms, for now we are going to use the Frequency Histogram.

Frequency Histograms are created when the number of distinct values in the column is less than 254 in 11g but less than 2048 in 12c.

To better understand how it is created and used lets do it step by step.

First the data is ordered, in our case the skew column has values from 1 to 20 then value 100.

select * from MY_HISTOGRAM;
Output:
ID                 SKEW
----------- -----------
          1           1 -> Bucket 1 has Endpoint 1
          2           2 -> Bucket 2 has Endpoint 2
          3           3 -> Bucket 3 has Endpoint 3
          4           4 -> Bucket 4 has Endpoint 4
          5           5 -> Bucket 5 has Endpoint 5
          6           6 -> Bucket 6 has Endpoint 6
          7           7 -> Bucket 7 has Endpoint 7
          8           8 -> Bucket 8 has Endpoint 8
          9           9 -> Bucket 9 has Endpoint 9
         10          10 -> Bucket 10 has Endpoint 10
         11          11 -> Bucket 11 has Endpoint 11
         12          12 -> Bucket 12 has Endpoint 12
         13          13 -> Bucket 13 has Endpoint 13
         14          14 -> Bucket 14 has Endpoint 14
         15          15 -> Bucket 15 has Endpoint 15
         16          16 -> Bucket 16 has Endpoint 16
         17          17 -> Bucket 17 has Endpoint 17
         18          18 -> Bucket 18 has Endpoint 18
         19          19 -> Bucket 19 has Endpoint 19
         20          20 -> Bucket 20 has Endpoint 20
         21         100 -> Bucket 21 has Endpoint 100
         22         100 -> Bucket 22 has Endpoint 100
...
...
      99999         100 -> Bucket 99999 has Endpoint 100
     100000         100 -> Bucket 100000 has Endpoint 100

At this step we can have more than 2048 buckets, then the buckets that hold the same value are compressed into the highest bucket with that value, in this case bucket 21 through 100000 are compressed into bucket 100000 for a total of 21 buckets.

Here is a graphical representation of our histogram

Screen Shot 2016-04-02 at 5.28.14 PM

We have 21 buckets, the first 20 buckets have a frequency of 1 that we can barely see in the chart and bucket 21 with data value of 100 has a frequency of 99,980.

Lets create a Frequency Histogram for our column skew and review the data dictionary views

exec dbms_stats.gather_table_stats(user,'MY_HISTOGRAM', method_opt=>'FOR COLUMNS SKEW');
set pagesize 100
select endpoint_number, endpoint_value
from user_tab_histograms 
where table_name='MY_HISTOGRAM' and column_name='SKEW';
Output:

ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
              1              1
              2              2
              3              3
              4              4
              5              5
              6              6
              7              7
              8              8
              9              9
             10             10
             11             11
             12             12
             13             13
             14             14
             15             15
             16             16
             17             17
             18             18
             19             19
             20             20
         100000            100

21 rows selected.

The column ENDPOINT_VALUE is the actual value in the table, the ENDPOINT_NUMBER is the the cumulative number of rows which is the highest number of bucket we got for each value in previous step, this is actually a cumulative frequency.
Having the data in this format if we need to know the frequency for one particular value we subtract the previous cumulative value from ENDPOPINT_NUMBER. For example to find the frequency of value 100 we would do 100,000-20 = 99,980
The advantage of having the data in this format is the case when we need to know for a range for example skew <=18 directly goes to its ENDPOINT_NUMBER which is 18.

Using the Oracle function LAG we can determine the frequencies for each value

col value format 99999999
select endpoint_value value,
 endpoint_number "CUMULATIVE FREQUENCY",
 endpoint_number - LAG(endpoint_number,1,0) over (order by endpoint_number) as frequency
from user_tab_histograms
where table_name = 'MY_HISTOGRAM' and column_name = 'SKEW';
Output:

    VALUE CUMULATIVE FREQUENCY   FREQUENCY
--------- -------------------- -----------
        1                    1           1
        2                    2           1
        3                    3           1
        4                    4           1
        5                    5           1
        6                    6           1
        7                    7           1
        8                    8           1
        9                    9           1
       10                   10           1
       11                   11           1
       12                   12           1
       13                   13           1
       14                   14           1
       15                   15           1
       16                   16           1
       17                   17           1
       18                   18           1
       19                   19           1
       20                   20           1
      100               100000       99980

21 rows selected.

Here is the information stored in the data dictionary

select column_name, density, histogram 
from user_tab_col_statistics 
where table_name='MY_HISTOGRAM' and column_name='SKEW';
Output:

COLUMN_NAME              DENSITY       HISTOGRAM
-------------------- ----------- ---------------
SKEW                     .000005       FREQUENCY

Density was 1/NDV before we had a histogram, now it is 1/(2*NUM_ROWS) = 1/(2*100000) = .000005

Now that we have collected statistics adding a frequency histogram lets run our query to see if now the optimizer chooses the FULL TABLE SCAN but before we will flush shared pool to get the sql parsed again. If testing in a production environment may not be a good idea to flush the shared pool. Make sure you are testing in a test system despite the redundancy.

alter system flush shared_pool;
select * from MY_HISTOGRAM where skew = 100;
set pages 0
select * from table(dbms_xplan.display_cursor);
set pages 14
Output:

SQL_ID  86tvh52sjd7ka, child number 0
-------------------------------------
select * from MY_HISTOGRAM where skew = 100

Plan hash value: 1485722408

------------------------------------------------------------------------------------------
| Id  | Operation                 | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |              |       |       |    54 (100)|          |
|*  1 |  TABLE ACCESS STORAGE FULL| MY_HISTOGRAM | 99980 |   781K|    54   (2)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("SKEW"=100)
       filter("SKEW"=100)

With our histogram in place we can see that the Estimated number of rows is 99,980 which is correct for our predicate skew = 100 therefore a FULL TABLE SCAN is chosen by the Optimizer.

Lets take a look at the 10053 optimizer trace

alter session set max_dump_file_size = unlimited;
execute DBMS_SQLDIAG.DUMP_TRACE(p_sql_id=>'86tvh52sjd7ka',-
p_child_number=>0,p_component=>'Optimizer',-
p_file_id=>'TRACE_10053_100');
col value format a90
SELECT value FROM v$diag_info WHERE name='Default Trace File';
!vi ora_130159_TRACE_10053_100.trc
TRACE info:

***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
 Table: MY_HISTOGRAM Alias: MY_HISTOGRAM
 #Rows: 100000 SSZ: 0 LGR: 0 #Blks: 191 AvgRowLen: 8.00 NEB: 0 ChainCnt: 0.00 SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1
 #IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000
Index Stats::
 Index: SKEW_IDX Col#: 2
 LVLS: 1 #LB: 196 #DK: 21 LB/K: 9.00 DB/K: 8.00 CLUF: 179.00 NRW: 100000.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 1 BSZ: 1
 KKEISFLG: 1
=======================================
SPD: BEGIN context at query block level
=======================================
Query Block SEL$1 (#0)
Return code in qosdSetupDirCtx4QB: NOCTX
=====================================
SPD: END context at query block level
=====================================
Access path analysis for MY_HISTOGRAM
***************************************
SINGLE TABLE ACCESS PATH
 Single Table Cardinality Estimation for MY_HISTOGRAM[MY_HISTOGRAM]
 SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
 Column (#2):
 NewDensity:0.000005, OldDensity:0.000005 BktCnt:100000.000000, PopBktCnt:99980.000000, PopValCnt:1, NDV:21
 Column (#2): SKEW(NUMBER)
 AvgLen: 3 NDV: 21 Nulls: 0 Density: 0.000005 Min: 1.000000 Max: 100.000000
 Histogram: Freq #Bkts: 21 UncompBkts: 100000 EndPtVals: 21 ActualVal: yes
 Table: MY_HISTOGRAM Alias: MY_HISTOGRAM
 Card: Original: 100000.000000 Rounded: 99980 Computed: 99979.500000 Non Adjusted: 99979.500000
 Scan IO Cost (Disk) = 53.000000
 Scan CPU Cost (Disk) = 18360195.040000
 Total Scan IO Cost = 53.000000 (scan (Disk))
 + 0.000000 (io filter eval) (= 0.000000 (per row) * 100000.000000 (#rows))
 = 53.000000
 Total Scan CPU Cost = 18360195.040000 (scan (Disk))
 + 5000000.000000 (cpu filter eval) (= 50.000000 (per row) * 100000.000000 (#rows))
 = 23360195.040000

Access Path: TableScan
 Cost: 54.059340 Resp: 54.059340 Degree: 0
 Cost_io: 53.000000 Cost_cpu: 23360195
 Resp_io: 53.000000 Resp_cpu: 23360195
 ****** Costing Index SKEW_IDX
 SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN
 SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER
 Access Path: index (AllEqRange)
 Index: SKEW_IDX
 resc_io: 375.000000 resc_cpu: 39663990
 ix_sel: 0.999795 ix_sel_with_filters: 0.999795
 Cost: 376.798685 Resp: 376.798685 Degree: 1
 Best:: AccessPath: TableScan
 Cost: 54.059340 Degree: 1 Resp: 54.059340 Card: 99979.500000 Bytes: 0.000000

***************************************

Things to see in the trace

AvgLen: 3 NDV: 21 Nulls: 0 Density: 0.000005 Min: 1.000000 Max: 100.000000
 Histogram: Freq #Bkts: 21 UncompBkts: 100000 EndPtVals: 21 ActualVal: yes
 Table: MY_HISTOGRAM Alias: MY_HISTOGRAM
 Card: Original: 100000.000000 Rounded: 99980 Computed: 99979.500000 Non Adjusted: 99979.500000

Density: 0.000005
Min: 1.000000
Max: 100.000000
Histogram: Freq
#Bkts: 21
UncompBkts: 100000
EndPtVals: 21
Card: Original: 100000.000000
Card Rounded: 99980

Also the Costing section

Costing for the Table Scan

Access Path: TableScan
 Cost: 54.059340 Resp: 54.059340 Degree: 0
 Cost_io: 53.000000 Cost_cpu: 23360195
 Resp_io: 53.000000 Resp_cpu: 23360195

Costing for the index

Access Path: index (AllEqRange)
 Index: SKEW_IDX
 resc_io: 375.000000 resc_cpu: 39663990
 ix_sel: 0.999795 ix_sel_with_filters: 0.999795
 Cost: 376.798685 Resp: 376.798685 Degree: 1

Comparing the cost of the table scan 54 against the cost of using the index 376 the best choice is the table scan.

Best:: AccessPath: TableScan
 Cost: 54.059340 Degree: 1 Resp: 54.059340 Card: 99979.500000 Bytes: 0.000000

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s