Oracle 11g introduced a great new feature – extended statistics. It allows DBAs to give the Oracle optimizer information about combined column selectivity when the assumption of column independence does not apply (http://www.oracle.com/technetwork/articles/sql/11g-manage-101028.html ).
The DBA is supposed to identify when dependant/correlated columns are the reason for a performance issue and then add extended stats to deal with the identified problem.
Oracle Advisors are of limited help here. Granted, SQL Tuning Advisor can identify one or many problem SQL statements and propose profiles for them, but it will not lead you to the core problem – dependant/correlated columns. The approach only manages the symptoms instead of resolving the core issue.
Wouldn’t it be great to solve most of the problems related to dependant/correlated columns once and for all instead of repeatedly going though the “report (by a screaming user ) ->diagnose->fix” cycle?
Well, you can.
We can scan all columns that are at risk of causing dependant/correlated column issues and find out exactly which column pairs are dependant/correlated and could cause the CBO to make wildly incorrect cardinality estimates. The set of column pairs can be further reduced by looking only at column pairs that were actually used by existing queries. Additional custom restrictions can also be applied.
First, let’s start by explaining how to detect dependant/correlated columns and how to quantify the extent of the problem (if one exists).
Oracle implements extended statistics for column groups by creating a virtual column that corresponds to the concatenation of the two columns and then gathering statistics on that virtual column. The virtual column is defined with the internal function sys_op_combined_hash (column1 , column2). We will use that function to represent the combined column and get basic statistics for it, but we will not be creating extended statistics as part of this process. If cardcol1,col2 is the number of distinct values of the virtual combined column, and cardcol1 and cardcol2 are the number of distinct values for col1 and col2 respectively, then dependent columns error err can be defined as
err=(cardcol1 x cardcol2)/ cardcol1,col2 .
If two columns are independent then err would be 1. The bigger err, the more inadequate CBO’s assumption of independence, the bigger the discrepancy between estimated and actual intermediate data set size, the bigger potential for bad execution plans.
Next, let’s go though the implementation. The method calls for potentially reviewing all column pairs in a table, so for a table with n columns we might need to inspect n(n-1)/2 pairs. Since the resources, mostly CPU, required to do the computation can be significant for large n, we should try to filter out those columns that are not likely to lead us to pairs with high err value.
- We should exempt columns with low cardinality because they can never produce high err values. The cardinality of the combined column is always bigger than the cardinality of any of the two column cardinalities - cardcol1,col2 > cardcol2 . Then
err=(cardcol1 x cardcol2)/ cardcol1,col2 < (cardcol1 x cardcol2)/ cardcol2= cardcol1.
If cardcol1 is small (smaller than 10 for instance) then err would be small.
- We should exempt columns with high cardinality compared to the number of records in the table. Oracle CBO is aware that we cannot get less than one record from a table/set. When we use a search through a high cardinality column we reduce the cardinality of the set significantly, leaving it with only few records (something like 10 or 20). Since there are so few records left in the set there is limited damage if that number is incorrect (lower) due to dependant columns.
To improve performance, the script could utilize parallelism, provided it is set up at instance level. The script has also the ability to scan only a portion of the tables (SAMPLE), allowing further performance improvement.
This procedure should not be run against a production system, but against a non-production DB that is a recent copy of production. Any type of copy (logical or physical) is acceptable. For terabyte size DBs, the process can run multiple hours or even days. Reasonable setting of the input parameters could minimize the execution time. It is assumed that table statistics are current. No DML, DDL or stats gathering should be run while this procedure is running.
- Select DB user to run the procedure and create COLUMN_PAIRS table using this script .
- Grant execute on DBA_SQL to the selected DB user.
- Deploy ext_stats_finder_i package using this scripts
- Make sure that the selected DB user has read-only privileges on the tables to be reviewed for correlated columns. Those tables could belong to a different schema.
- p_db_user varchar2 – the owner of the tables to be reviewed for correlated columns
- p_min_card number – the minimal number of distinct values required to include a column for consideration. For example, if this value is 5, columns with less than 5 distinct values will not be reviewed at all.
- p_max_card_percent number – the maximum percentage, defined by number of distinct values over number of record in the table, allowed to include a column for consideration. For example, if this value is 60, then a column that has 78000 distinct values in a table that has 100000 records will not be considered.
- p_parallelism number – the number of parallel processes that will be used by the procedure. Parallel processing (max_parallel_servers, etc) must be configured on instance level in order for this option to work.
- p_sample_percentage – what percentage of the table will be sampled by the procedure. Note that the smaller the values the lower the precision of the results.
All output is written in COLUMN_PAIRS table. The table is written to as the procedure goes though all tables.
Column in COLUMN_PAIRS:
- DB_USER_NAME – owner of the table
- TABLE_NAME – the name of the table
- COLUMN_NAME_1 – the first column of the column pair
- COLUMN_NAME_2 – the second column of the column pair
- ERR - dependant/correlated column error
- CREATED_DATE - date when the records was inserted into the table
- USER_BY_SQL – how may SQLs used both of the columns in the column pair. The value is approximate.
Oracle 22.214.171.124.0 introduced new features related to Oracle extended stats. Please check my recent blog entry about that – http://iiotzov.wordpress.com/2011/07/26/oracle-11gr2-improvements-in-managing-extended-stats-and-why-looking-for-“useful”-column-groups-is-still-relevant