Detecting Connection Imbalance in Oracle RAC

May 8, 2015

For Oracle RAC configurations that rely on spreading the load equally among instances, ensuring that inbound connections are balanced is quite important.

Since I was not able to find a suitable metric in the Oracle Enterprise Manager, I had to create one myself.

As I started thinking about it, some fundamental questions started popping up. How much difference in connection count should trigger an alert? Should I measure the percentage difference, the actual difference (number of sessions) , or something else?

Resisting the temptation to start coding convoluted logic, I reviewed what applied statistics has to offer and this is what I found – the Chi Square test !

Here is an example for homogeneity testing and here is one for independence testing.

The way I understand it, you can use that test to see if the connections are independently (uniformly) distributed across DB instances, or if there are some DB instances tend to get more or fewer connections than “expected”.

Another great thing about Chi Square test is that it is already implemented in the Oracle RDBMS.

The build-in Oracle function (STATS_CROSSTAB) can give us the value of Chi-squared (CHISQ_OBS) , the degree of freedom (CHISQ_DF), and the statistical significance (CHISQ_SIG).  What we are interested in is the statistical significance. A number less that 0.05 indicates that the data is likely not distributed uniformly.

Here is the query that can detect if a DB user prefers/avoids a DB instance:
—————————————————————————————–

SELECT COUNT(*) cnt
FROM
    (SELECT STATS_CROSSTAB(inst_id, username, 'CHISQ_SIG') p_value
    FROM gv$session
    )
WHERE p_value < 0.05

—————————————————————————————–

Detecting connection imbalance at client machine level is bit more tricky because each instance received a few connections from the server it is on.
That can be easily accounted for my excluding the servers that run the DBs:
—————————————————————————————–

SELECT COUNT (*) cnt
FROM
    (SELECT STATS_CROSSTAB(inst_id, machine, 'CHISQ_SIG') p_value
     FROM gv$session
     WHERE machine NOT LIKE '%dbp%.newsamerica.com'
     )
WHERE p_value < 0.05

—————————————————————————————–

These monitoring queries work without modification for any size RAC cluster. Adding or removing nodes is handled without issues, apart from a temporary imbalance that may come with adding nodes.

Advertisements

“Orphaned” LOB Segments

October 31, 2014

Recently, after doing reorganization in Oracle 11.2.0.2.0 DB, I came upon a LOB segment that does not link to a LOB.
That is, there was an entry in USER_SEGMENTS and USER_OBJECTS, but there was no entry in USER_TABLES, USER_LOBS and USER_INDEXES.
It turned out that the in the recycle bin (USER_RECYCLEBIN).  It appears that the DROP command did not “remove” all references to the dropped object.
The moral of the story is that if something is missing, look for it in the trash (the recycle bin, that is). Some things never change…


An Oracle Distributed Query CBO Bug Finally Fixed (After 7 Years)

April 9, 2014

Optimizing distributed queries in Oracle is inherently more difficult. The CBO not only has to account for the additional resources related with distributed processing in Oracle, such as networking, but also has to get reliable table/column statistics for the remote objects.

It is well documented that Oracle has(d) trouble passing information about histograms for distributed queries( http://jonathanlewis.wordpress.com/2013/08/19/distributed-queries-3/ ).

In addition, Oracle was not able to pass selectivity information for “IS NULL/NOT NULL” filters via a DB link, even though the value of records with NULL is already written to NUM_NULLS column in DBA_TAB_COLUMNS…
As a result of this bug, every query that has IS NULL against a remote table ended up with cardinality of 1, even if there were many NULL records in the table.

PLAN_TABLE_OUTPUT
SQL_ID  djpaw3d54d5uq, child number 0
-------------------------------------
select 
       * 
from 
       tab1@db_link_loop a , dual  
where 
       a.num_nullable is null

Plan hash value: 3027949496

-------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |     8 (100)|          |        |      |
|   1 |  NESTED LOOPS      |      |     1 |    11 |     8   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| DUAL |     1 |     2 |     2   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | TAB1 |     1 |     9 |     6   (0)| 00:00:01 | DB_LI~ | R->S |
-------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT ID,NUM,NUM_NULLABLE FROM TAB1 A 
WHERE NUM_NULLABLE IS  NULL (accessing 'DB_LINK_LOOP')

The behavior was due to MOS Bug 5702977 Wrong cardinality estimation for “is NULL” predicate on a remote table

Fortunatly, the bug is fixed in 12c and 11.2.0.4. A patch is available for 11.2.0.3 on certain platforms.


What is the Deal with AVG_LEAF_BLOCKS_PER_KEY (DBA_INDEXES) !?!

December 17, 2013

AVG_LEAF_BLOCKS_PER_KEY column in DBA_INDEXES should show the average number of leaf blocks in which each distinct value in the index appears. Common sense and Christian’s Antognini’s book Troubleshooting Oracle Performance (pg.136) suggest that the value of AVG_LEAF_BLOCKS_PER_KEY should by derived from the values of LEAF_BLOCKS and DISTINCT_KEYS .

Well, that does not appear to be the case in Oracle 11g!

This query can help you find the indexes that do not conform to the rule:

select 
       * 
from 
       dba_indexes
where 
       DISTINCT_KEYS > 0
and 
       AVG_LEAF_BLOCKS_PER_KEY  > 3*LEAF_BLOCKS/DISTINCT_KEYS
and 
       LEAF_BLOCKS > DISTINCT_KEYS

The discrepancy is certainly not a rounding error. I have an index where the value of the AVG_LEAF_BLOCKS_PER_KEY is 150 times larger than LEAF_BLOCKS/DISTINCT_KEYS.

The problem seems to be happens only when stats are gathered (DBMS_STATS) with AUTO sampling. Computing stats ( estimate_percent = NULL) resolves the issue.

Ironically, DBMS_STATS AUTO sampling delivers good estimates for LEAF_BLOCKS and DISTINCT_KEYS. If only Oracle computed AVG_LEAF_BLOCKS_PER_KEY by using LEAF_BLOCKS and DISTINCT_KEYS, everything would have been great…


Same SQL_ID, same execution plan (SQL_PLAN_HASH_VALUE), greatly different elapsed time – a simple way to troubleshoot this problem with ASH

November 27, 2012

A query runs fine in one environment, but is terrible in another. The usual suspects – different execution plan, hardware issues, etc are ruled out.

The next step is to look at the actual execution statistics and see where the difference is. Re-executing the statements with GATHER_PLAN_STATISTICS hint was not an option. SQL Monitoring was running, but the query was behind the retention period.

This is where the Oracle Active Session History (ASH), a separately licensed option, comes to play.

Using sql_plan_line_id column in dba_hist_active_sess_history ASH view to get load profile is not new – http://eastcoastoracle.org/PDF_files/2011/Presentations/BarunSQL_Performance_Tuning_with_Oracle_ASH_AWR_Real_World_Use_Cases_Public.pdf .

Similar technique can be used to easily find out from which plan step onwards the two execution plans start to differ.

At DB1:
select sql_plan_line_id , count(*)
from dba_hist_active_sess_history
where 
    sql_id = 'fmhbn1tn0c54z'
and sample_time
         between to_date('11/26/2012:10:00:00','MM/DD/YYYY:HH24:MI:SS’)
         and to_date('11/26/2012:11:00:00','MM/DD/YYYY:HH24:MI:SS’)
group by sql_plan_line_id
order by 1

"SQL_PLAN_LINE_ID" "COUNT(*)"
34                 135
35                 5
36                 2
37                 1
"" 2"",            2

At DB2:
select sql_plan_line_id , count(*)
from dba_hist_active_sess_history
where 
    sql_id = 'fmhbn1tn0c54z'
and sample_time
        between to_date('11/25/2012:09:00:00','MM/DD/YYYY:HH24:MI:SS’)
        and to_date('11/25/2012:10:00:00','MM/DD/YYYY:HH24:MI:SS’)
group by sql_plan_line_id
order by 1

"SQL_PLAN_LINE_ID" "COUNT(*)"
33                  467 <—— Deviation
34                  135
35                  5
36                  2
37                  1
"" 2"",             2

We see that the difference starts at line 33 of the execution plan, and we can focus on finding the root cause.

June 7, 2013:

Great note by Kerry Osborne – http://kerryosborne.oracle-guy.com/2013/06/sql-gone-bad-but-plan-not-changed/. Identical  PLAN_HASH_VALUEs are not a  guarantee that the underlying execution plans are identical.  This does not change the original post in any way, it is just something to be aware of.