Dependent/correlated sets in Oracle – definition, problems and solutions.

To better understand dependent/correlated sets, let’s take a brief look at dependent/correlated columns.

Oracle works under the assumption that the data in each column is independent. If an equality predicate on column COL1 leaves 10% of the records, and an equality predicate on column COL2 leaves 20% of the records, then by default, the Oracle CBO would assume that predicates on both COL1 and COL2 would leave 2 % (10%x20%) of the records. If the data is COL1 is correlated with the data in COL2 then equality predicates on both COL1 and COL2 would leave significantly more than 2% of the records. That difference in estimated cardinality could cause huge troubles when estimating consequent execution steps.

Oracle recognized the problem and came with a comprehensive solution – extended statistics in Oracle 11g.

Now about dependent/correlated sets – two sets are dependent/correlated if they contain the same or similar records. Oracle assumes that sets are independent. If a predicate like this “COL1 IN (select.. from .. SET1 ) “  leaves 10% of the records, and a similar predicate “COL1 IN (select.. from .. SET2 ) “  leaves 20% of the records, then Oracle assumed that both predicates “COL1 IN (select.. from .. SET1 ) “ AND “COL1 IN (select.. from .. SET2 ) “   would leave 2%(10%x20%) of the records. If SET1 is identical or very similar to SET2, then the “COL1 IN (select.. from .. SET1 ) “ AND “COL1 IN (select.. from .. SET2 ) “ would leave significantly more than 2% of the records. Needless to say, the difference in estimated cardinality could cause huge troubles when estimating consequent execution steps.

The big challenge is that there is nothing out of the box that can help us in the above scenario.

Tables TAB1 and TAB2 would provide us with the correlated sets (SET1,SET2)

create table tab1 as
with generator as (
select
  rownum id
 from dual
 connect by
  rownum <= 4000
)
select
  id col1, 
  mod(id,1024) col2 , 
  mod(id,2) col3
from (
   select
     /*+ no_merge */
     rownum id
   from
     generator,
     generator
   where
     rownum <= 1000000
)
;

create table tab2 as
with generator as (
select
  rownum id
from dual
connect by
  rownum <= 4000
)
select
  id col1, 
  mod(id,1024) + 134 col2 , 
  mod(id,2) col3
from (
   select
     /*+ no_merge */
     rownum id
   from
     generator,
     generator
   where
rownum <= 1000000
)
;

TAB3 would be the table we are going to apply the filters with those correlated sets

create table tab3 as
with generator as (
select
  rownum id
from dual
connect by
  rownum <= 4000
)
select
  id col1
from (
   select
     /*+ no_merge */
     rownum id
   from
     generator,
     generator
   where
rownum <= 1000000000
)

Gather stats:

exec dbms_stats.gather_table_stats(NULL,'TAB1');

exec dbms_stats.gather_table_stats(NULL,'TAB2');

exec dbms_stats.gather_table_stats(NULL,'TAB3');

The query we are interested in is

select
    t3.col1
from
    tab3 t3 ,
    tab1 t1 ,
    tab2 t2
where
    t3.col1 = t1.col1
and t1.col2 in (66,166,316,416,516,616)
and t3.col1 = t2.col1
and t2.col2 in (200,300,450,550,650,750)
and t1.col3 = t2.col3

Please note that the sets that come from TAB1 and TAB2 are identical, so this clause

t3.col1 = t1.col1 and t3.col1 = t2.col1

would eliminate significantly fewer records that Oracle CBO’s estimate.

Due to the described dependent/correlated sets behaviors, the CBO (DBMS_XPLAN.DISPLAY_CURSOR) estimates that only 17 records will be returned by the query, even though the query returns 5861 records.

---------------------------------------------------------
| Id | Operation         |Name |Rows |Bytes |Cost (%CPU)|
---------------------------------------------------------
| 0  | SELECT STATEMENT  |     |     |      |8182 (100) |
|* 1 | HASH JOIN         |     |17   |510   |8182 (4)   |
|* 2 | TABLE ACCESS FULL |TAB2 |5859 |70308 |657 (4)    |
|* 3 | HASH JOIN         |     |5872 |103K  |7524 (4)   |
|* 4 | TABLE ACCESS FULL |TAB1 |5859 |70308 |653 (4)    |
| 5  | TABLE ACCESS FULL |TAB3 |16M  |91M   |6786 (2)   |
---------------------------------------------------------

Predicate Information (identified by operation id):
—————————————————

1 - access("T3"."COL1"="T2"."COL1" AND "T1"."COL3"="T2"."COL3")
2 - filter(("T2"."COL2"=200 OR "T2"."COL2"=300 OR "T2"."COL2"=450 OR
"T2"."COL2"=550 OR "T2"."COL2"=650 OR "T2"."COL2"=750))
3 - access("T3"."COL1"="T1"."COL1")
4 - filter(("T1"."COL2"=66 OR "T1"."COL2"=166 OR "T1"."COL2"=316 OR
"T1"."COL2"=416 OR "T1"."COL2"=516 OR "T1"."COL2"=616))

Dynamic sampling, even at the max level, actually makes the estimate worse

exec dbms_stats.delete_table_stats(NULL,'TAB1');

exec dbms_stats.delete_table_stats(NULL,'TAB2');

exec dbms_stats.delete_table_stats(NULL,'TAB3');

alter session set optimizer_dynamic_sampling=10

---------------------------------------------------------
| Id | Operation         |Name |Rows |Bytes |Cost (%CPU)|
---------------------------------------------------------
| 0  | SELECT STATEMENT  |     |     |      |8156 (100) |
|* 1 | HASH JOIN         |     |1    |91    |8156 (4)   |
|* 2 | TABLE ACCESS FULL |TAB2 |5861 |223K  |644 (3)    |
|* 3 | HASH JOIN         |     |5861 |297K  |7511 (4)   |
|* 4 | TABLE ACCESS FULL |TAB1 |5861 |223K  |640 (3)    |
| 5  | TABLE ACCESS FULL |TAB3 |16M  |198M  |6786 (2)   |
---------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("T3"."COL1"="T2"."COL1" AND "T1"."COL3"="T2"."COL3")
2 - filter(("T2"."COL2"=200 OR "T2"."COL2"=300 OR "T2"."COL2"=450 OR
"T2"."COL2"=550 OR "T2"."COL2"=650 OR "T2"."COL2"=750))
3 - access("T3"."COL1"="T1"."COL1")
4 - filter(("T1"."COL2"=66 OR "T1"."COL2"=166 OR "T1"."COL2"=316 OR
"T1"."COL2"=416 OR "T1"."COL2"=516 OR "T1"."COL2"=616))

Note
-----
- dynamic sampling used for this statement (level=10)

One way to resolve this is to force the execution order, so TAB3 is visited before TAB2 or TAB1.

Another solution is to create a virtual column identical to COL1

alter table tab3 add col1_corr generated always as (col1*1 ) not null ;

,gather stats

exec dbms_stats.gather_table_stats(NULL,'TAB1');

exec dbms_stats.gather_table_stats(NULL,'TAB2');

exec dbms_stats.gather_table_stats(NULL,'TAB3');

Then set the number of distinct records for that new virtual column to 1

exec dbms_stats.set_column_stats(NULL,'TAB3','COL1_CORR',distcnt=>1);

and use the new virtual column (COL1_CORR), instead of COL1,  for one of the predicates

select
t3.col1
from 
    tab3 t3 ,
    tab1 t1 ,
    tab2 t2
where
    t3.col1_corr = t1.col1
and t1.col2 in (66,166,316,416,516,616)
and t3.col1 = t2.col1
and t2.col2 in (200,300,450,550,650,750)
and t1.col3 = t2.col3

Now, the CBO expects 2936 records – much better than the original 17 records.

---------------------------------------------------------
| Id | Operation         |Name |Rows |Bytes |Cost (%CPU)|
---------------------------------------------------------
| 0  | SELECT STATEMENT  |     |     |      |8182 (100) |
|* 1 | HASH JOIN         |     |2936 |103K  |8182 (4)   |
|* 2 | TABLE ACCESS FULL |TAB1 |5859 |70308 |653 (4)    |
|* 3 | HASH JOIN         |     |5872 |137K  |7528 (4)   |
|* 4 | TABLE ACCESS FULL |TAB2 |5859 |70308 |657 (4)    |
| 5  | TABLE ACCESS FULL |TAB3 |16M  |183M  |6786 (2)   |
---------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("COL1"*1="T1"."COL1" AND "T1"."COL3"="T2"."COL3")
2 - filter(("T1"."COL2"=66 OR "T1"."COL2"=166 OR "T1"."COL2"=316 OR
"T1"."COL2"=416 OR "T1"."COL2"=516 OR "T1"."COL2"=616))
3 - access("T3"."COL1"="T2"."COL1")
4 - filter(("T2"."COL2"=200 OR "T2"."COL2"=300 OR "T2"."COL2"=450 OR
"T2"."COL2"=550 OR "T2"."COL2"=650 OR "T2"."COL2"=750))

Setting the number of distinct values for COL1_CORR to 1 forced the CBO to believe that the clause on COL1_CORR would not eliminate any records. That’s the only way (for me anyway) to tell the optimizer that the COL1_CORR clause would not further reduce the number of records.

Interestingly, I get the same result for any value of distcnt …

Advertisements

2 Responses to Dependent/correlated sets in Oracle – definition, problems and solutions.

  1. coskan says:

    below also does the same effect with still having the cardinality wrong.

    Having said that, if the order is the case and if you still can change the code then why note hint it with just leading as it will be more stable.

    select
    t3.col1
    from
    tab3 t3 ,
    tab1 t1 ,
    tab2 t2
    where
    t3.col1 +t2.col1*0 = t1.col1
    and t1.col2 in (66,166,316,416,516,616)
    and t3.col1 = t2.col1
    and t2.col2 in (200,300,450,550,650,750)
    13 and t1.col3 = t2.col3;

    • iiotzov says:

      Thanks for your comment!

      Hinting the query to force execution order is certainly a doable solution – I’ve done it many times.

      The “hinting” path has its challenges though:

      First, I have to review the whole query and come with an execution order for all tables, even though there are only three tables I am interested in. As far as I know, it is not possible to specify a hint for relative execution order, i.e. TAB3 before TAB2.

      Second, hinting is more difficult if the query is dynamically generated. In that case, the optimal execution order may vary depending on the dynamic clauses, so hinting the query to follow certain execution order could be risky.

      Iordan Iotzov

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: