Oracle 11gR2 Improvements in Managing Extended Stats and Why Looking for “Useful” Column Groups Is Still Relevant.

July 26, 2011

In 11gR2 Oracle introduced DBMS_STATS.SEED_COL_USAGE ( procedure to show us which column groups are used by a workload, so we can add extended stats on those column groups.
Oracle added in another two useful DBMS_STATS subroutines  ( ):

 REPORT_COL_USAGE function  – reports column usage in friendly format, need to set serverout (SQLPlus)

RESET_COL_USAGE procedure – resets column usage statistics to unused

The whole process is nicely explained in

In short, the above techniques and procedures answer the question which column groups are used by a workload.

On the other hand, the procedure I developed tries to find out which column groups would benefit from creating extended statistics. It is an effort to separate the good from the useless extended stats.

Why would we care which column groups will help the optimizer when we know all columns groups that are used?

We should create extended stats on all column groups that are used by the workload?!?

That would be a good start, but there are some things to consider.

First, maintaining extended stats consumes resources and we should not expend resources for things or processes that bring no benefit. In short, we should get rid of the “useless” extended stats.

Second, there is limit of 20 extended stats for a table. Functional indexes and virtual column count towards that limit, so it is possible, particularly for fact tables in DW system, to reach the limit. If that is to happen, we would have to pick “the best” extended stats.

Knowing which extended stats would benefit the optimizer the most can help up to be truly proactive. We can create extended stats not only for column groups that are used by an existing workload, but also for column groups that are not used by a query yet.

I will be enhancing my procedure to take  advantage of this great new Oracle functionality…

Getting the Best of Oracle Extended Statistics

July 5, 2011

Oracle 11g introduced a great new feature – extended statistics. It allows DBAs to give the Oracle optimizer information about combined column selectivity when the assumption of column independence does not apply ( ).

 The DBA is supposed to identify when dependant/correlated columns are the reason for a performance issue and then add extended stats to deal with the identified problem.

How reactive!!!

Oracle Advisors are of limited help here. Granted, SQL Tuning Advisor can identify one or many problem SQL statements and propose profiles for them, but it will not lead you to the core problem – dependant/correlated columns. The approach only manages the symptoms instead of resolving the core issue.

Wouldn’t it be great to solve most of the problems related to dependant/correlated columns once and for all instead of repeatedly going though the “report (by a screaming user :() ->diagnose->fix” cycle?

Well, you can.

We can scan all columns that are at risk of causing dependant/correlated column issues and find out exactly which column pairs are dependant/correlated and could cause the CBO to make wildly incorrect cardinality estimates. The set of column pairs can be further reduced by looking only at column pairs that were actually used by existing queries. Additional custom restrictions can also be applied.

First, let’s start by explaining how to detect dependant/correlated columns and how to quantify the extent of the problem (if one exists).

Oracle implements extended statistics for column groups by creating a virtual column that corresponds to the concatenation of the two columns and then gathering statistics on that virtual column.  The virtual column is defined with the internal function sys_op_combined_hash (column1 , column2). We will use that function to represent the combined column and get basic statistics for it, but we will not be creating extended statistics as part of this process.  If cardcol1,col2  is the number of distinct values of the virtual combined column, and cardcol1 and cardcol2 are the number of distinct values for col1 and col2 respectively, then dependent columns error err can be defined as

                       err=(cardcol1 x cardcol2)/ cardcol1,col2 .

If two columns are independent then err would be 1. The bigger err, the more inadequate CBO’s assumption of independence, the bigger the discrepancy between estimated and actual intermediate data set size, the bigger potential for bad execution plans.

Next, let’s go though the implementation. The method calls for potentially reviewing all column pairs in a table, so for a table with n columns we might need to inspect n(n-1)/2 pairs. Since the resources, mostly CPU, required to do the computation can be significant for large n, we should try to filter out those columns that are not likely to lead us to pairs with high err value.

  • We should exempt columns with low cardinality because they can never produce high err values. The cardinality of the combined column is always bigger than the cardinality of any of the two column cardinalities – cardcol1,col2  > cardcol2 . Then

err=(cardcol1 x cardcol2)/ cardcol1,col2 < (cardcol1 x cardcol2)/ cardcol2= cardcol1.

        If cardcol1 is small (smaller than 10 for instance) then err would be small.

  • We should exempt columns with high cardinality compared to the number of records in the table. Oracle CBO is  aware that we cannot get less than one record from a table/set. When we use a search through a high cardinality column we reduce the cardinality of the set significantly, leaving it with only few records (something like 10 or 20). Since there are so few records left in the set there is limited damage if that number is incorrect (lower) due to dependant columns.

To improve performance, the script could utilize parallelism, provided it is set up at instance level. The script has also the ability to scan only a portion of the tables (SAMPLE), allowing further performance improvement.

Usage Notes:

This procedure should not be run against a production system, but against a non-production DB that is a recent copy of production. Any type of copy (logical or physical) is acceptable. For terabyte size DBs, the process can run multiple hours or even days. Reasonable setting of the input parameters could minimize the execution time. It is assumed that table statistics are current. No DML, DDL or stats gathering should be run while this procedure is running.

Deployment instructions:

  1. Select DB user to run the procedure and create COLUMN_PAIRS table using this script .
  2. Grant execute on DBA_SQL to the selected DB user.
  3. Deploy ext_stats_finder_i package using this scripts
  4. Make sure that the selected DB user has read-only privileges on the tables to be reviewed for correlated columns. Those tables could belong to a different schema.


  • p_db_user           varchar2         – the owner of the tables to be reviewed for correlated columns
  • p_min_card      number           – the minimal number of distinct values required to include a column for consideration. For example, if this value is 5, columns with less than 5 distinct values will not be reviewed at all.
  • p_max_card_percent  number    – the maximum percentage, defined by number of distinct values over number of record in the table,  allowed to include a column for consideration. For example, if this value is 60, then a column that has 78000 distinct values in a table that has 100000 records will not be considered.
  • p_parallelism    number                        – the number of parallel processes that will be used by the procedure. Parallel processing (max_parallel_servers, etc) must be configured on instance level in order for this option to work.
  • p_sample_percentage                            – what percentage of the table will be sampled by the procedure.  Note that the smaller the values the lower the precision of the results.


            All output is written in COLUMN_PAIRS table. The table is written to as the procedure goes though all tables.

            Column in COLUMN_PAIRS:

  • DB_USER_NAME                   – owner of the table
  • TABLE_NAME                         – the name of the table 
  • COLUMN_NAME_1               – the first column of the column pair
  • COLUMN_NAME_2               – the second column of the column pair
  • ERR                                              – dependant/correlated column error
  • CREATED_DATE                    – date when the records was inserted into the table
  • USER_BY_SQL                        – how may SQLs used both of the columns in the column pair. The value is approximate.




Oracle introduced new features related to Oracle extended stats. Please check my recent blog entry about that –“useful”-column-groups-is-still-relevant