How to Find the Optimal Configuration for Your Virtualized Environment

December 16, 2015

Thanks to the New York Oracle User Group for allowing me to present and to all who attended for their valuable questions and comments.
The presentation showcases a framework for finding optimal server configurations that can be useful to anyone who is dealing with sever consolidation. Those working with Oracle Enterprise Manager/Grid Control can get some practical tips as well.

The PDF version of the presentation is here.
The Power Point version of the presentation is here. Many PowerPoint slides have notes.

Advertisements

Detecting Connection Imbalance in Oracle RAC

May 8, 2015

For Oracle RAC configurations that rely on spreading the load equally among instances, ensuring that inbound connections are balanced is quite important.

Since I was not able to find a suitable metric in the Oracle Enterprise Manager, I had to create one myself.

As I started thinking about it, some fundamental questions started popping up. How much difference in connection count should trigger an alert? Should I measure the percentage difference, the actual difference (number of sessions) , or something else?

Resisting the temptation to start coding convoluted logic, I reviewed what applied statistics has to offer and this is what I found – the Chi Square test !

Here is an example for homogeneity testing and here is one for independence testing.

The way I understand it, you can use that test to see if the connections are independently (uniformly) distributed across DB instances, or if there are some DB instances tend to get more or fewer connections than “expected”.

Another great thing about Chi Square test is that it is already implemented in the Oracle RDBMS.

The build-in Oracle function (STATS_CROSSTAB) can give us the value of Chi-squared (CHISQ_OBS) , the degree of freedom (CHISQ_DF), and the statistical significance (CHISQ_SIG).  What we are interested in is the statistical significance. A number less that 0.05 indicates that the data is likely not distributed uniformly.

Here is the query that can detect if a DB user prefers/avoids a DB instance:
—————————————————————————————–

SELECT COUNT(*) cnt
FROM
    (SELECT STATS_CROSSTAB(inst_id, username, 'CHISQ_SIG') p_value
    FROM gv$session
    )
WHERE p_value < 0.05

—————————————————————————————–

Detecting connection imbalance at client machine level is bit more tricky because each instance received a few connections from the server it is on.
That can be easily accounted for my excluding the servers that run the DBs:
—————————————————————————————–

SELECT COUNT (*) cnt
FROM
    (SELECT STATS_CROSSTAB(inst_id, machine, 'CHISQ_SIG') p_value
     FROM gv$session
     WHERE machine NOT LIKE '%dbp%.newsamerica.com'
     )
WHERE p_value < 0.05

—————————————————————————————–

These monitoring queries work without modification for any size RAC cluster. Adding or removing nodes is handled without issues, apart from a temporary imbalance that may come with adding nodes.


What to Do about Tools Hoarding Parallel Slaves in Oracle

November 26, 2014

Most GUI tools use cursors with relatively small fetch size (around 50) to retrieve data from Oracle. They open the cursor, fetch some data, show it and then wait on user input. All resources related to the connection and the open session are held while the tool waits on the user. While those resources are usually trivial for serial SELECT statements, they can be significant for parallel SELECT statements.

Each parallel statement gets assigned a number of parallel slaves and those slaves are not freed until their respective cursor is closed, regardless of the amount of work the slaves do. Since there are limited number of parallel slaves available in an instance (PARALLEL_MAX_SERVERS init.ora parameter) the hoarding of parallel slaves can prevent future statements to be executed in parallel, severely impacting the performance of those statements.

This following blog post describes the situation quite well:
http://structureddata.org/2012/02/29/pitfalls-of-using-parallel-execution-with-sql-developer/

Since this behavior is not a bug, i.e. there is never going to be a fix, we need to find a way to manage it.

One solution is to disable parallelism for all sessions coming from GUI tools that use cursors and therefore could cause this problem. This is a radical step that would deprive the users from running anything in parallel.

The second option is to educate the users of the problems associated with open cursors and ask them to close all cursor as soon as possible. This approach is ideal when executed diligently, but in reality not all users are compliant.

The approach I would like to propose is to allow parallelism for all, but monitor those who do not close their open cursors. Here is the query that I use to monitor:

select count(*) from 
  (
      select * 
      from gv$px_session px_QC   
      where px_QC.qcinst_id IS NULL
      minus
      select * from gv$px_session px_QC   
      where px_QC.qcinst_id IS NULL
      and  exists
            (select * 
             from gv$px_session  px_Slaves , gv$session sess
             where px_QC.qcsid = px_Slaves.qcsid
             and px_Slaves.sid = sess.sid
             and (sess.wait_class = 'Idle'
                 or ( sess.seconds_in_wait < 600
                      and sess.wait_class = 'Idle'
                     )
                 )
             )
   )

This query returns zero if the parallel slaves are actively used and greater than zero if there is a set of parallel slaves that have been idle for 600 second. This query can be used to get the offending session and “talk” with the end user. It could be integrated with OEM using Metric Extensions or it could be part of a monitoring script that kills the offending sessions. The possibilities are endless.


Some more useful undocumented OEM 12c repository tables/views

June 28, 2013

As mentioned in a previous post, MGMT_ECM_HW contains a wealth of information about hosts in OEM.

The problem is that CLOCK_FREQ_IN_MHZ column in MGMT_ECM_HW is not properly populated. All hosts had the value of 100 there – an obvious mistake.

That prompted me to do a search through the OEM repository objects. This is what I found:

MGMT_ECM_HW_CPU is similar to MGMT_ECM_HW, but FREQ_IN_MHZ column is properly populated. No need to parse descriptions to get the CPU’s speed

MGMT_ECM_HW_NIC – lots of information about network cards

MGMT_ECM_OS – OS related information

MGMT_ECM_HW_IOCARD – information about IO peripheral devices

 


Hotsos 2013 and some more useful OEM queries

January 31, 2013

Busy preparing for Hotsos 2013. My presentation this year is “Working with Confidence: How Sure Is the Oracle CBO about Its Cardinality Estimates, and Why Does It Matter?” I know it is a mouthful. The presentation/white paper would be up to the point with lots of pictures though 🙂

The only thing I have for the blog now is a revised version of a query I posted the last time. The original version was way too crude.

select   a.rollup_timestamp ,

         count(*) ,

         sum(c.cpu_count) ,

         sum(a.average*c.cpu_count)/ sum(c.cpu_count) ,

         sum(a.maximum*c.cpu_count)/ sum(c.cpu_count) ,

         sum((a.average+3*a.standard_deviation)*c.cpu_count) / sum(c.cpu_count)

from

         mgmt$metric_hourly a ,

         mgmt$target b ,

         sysman.MGMT_ECM_HW c

where    a.metric_name = 'Load'

and      a.column_label = 'CPU Utilization (%)'

and      a.target_guid = b.target_guid

and      b.target_name like 'dev%'

and      c.hostname||'.'||c.domain = b.target_name

and      c.vendor_name = 'Intel Based Hardware'

group by a.rollup_timestamp

order by 5 desc

The query now takes into account the number of CPUs on a server. I use the undocumented MGMT_ECM_HW view – it contains a wealth of information.

The query assumes that all CPUs are identical – not perfect, but much better than the original.


More Useful OEM Repository Queries

December 31, 2012

I already included a number of queries against the OEM repository in my presentation.
Now, I would like to add two more.

The first query shows an approximation of the average and max CPU utilization for a group of servers (names starting with “dev”).
Since the query assumed that each server has the same “weight”, the results are not completely correct for most environments.
They are a good start for a consolidation and/or virtualization projects though.

select
      a.rollup_timestamp           time, 
      count(*)                     number_of_servers,
      sum(a.average)/count(*)      average_load ,
      sum(a.maximum)/count(*)      max_recorded,
      sum(a.average+3*a.standard_deviation)/count(*) max_derived
from
      mgmt$metric_hourly a ,
      mgmt$target b
where
      a.metric_name = 'Load'
  and a.column_label = 'CPU Utilization (%)'
  and a.target_guid = b.target_guid
  and b.target_name like 'dev%'
group by a.rollup_timestamp
order by max_derived desc

1/31/2013 – The query above is quite crude – I just posted a better query  – http://wp.me/p1DHW2-5J

The second query shows all databases that belong to the TEST OEM group and are in ARCHIVELOG mode. The query can be modified to look at all databases or to any subset.

select *
from 
sysman.MGMT$GROUP_DERIVED_MEMBERSHIPS O ,
    sysman.MGMT$TARGET T ,
    sysman.MGMT$AVAILABILITY_CURRENT st
where    
     o.composite_target_name = 'TEST'
and  o.member_target_type IN ('oracle_database', 'rac_database')
and  ( t.target_type ='rac_database'
       or (t.target_type ='oracle_database'
           and t.type_qualifier3 != 'RACINST'))
and  o. member_target_guid = t.target_guid
and  t.target_guid = st.target_guid
and  st.availability_status = 'Target Up'
and  (not exists 
      (select
             *
       from
            sysman.mgmt$metric_current i
       where 
             i.target_guid = o.member_target_guid 
       and   metric_name   = 'archFull'
       and   metric_Column = 'archTotal'
       and   metric_label  = 'Archive Area'
       and   column_label  = 'Total Archive Area (KB)'
       and   key_value     = 'NOARCHIVELOG'
    ));

This query is not strictly based on the documentation, but it work in OEM 12c.


How to Use the New OEM 12c Metric Extensions to Enforce Enterprise-Wide Custom Policies

August 8, 2012

Oracle Enterprise Manager’s User-Defined Metrics (UDMs) were a very suitable vehicle for enforcing enterprise-wide policies. My NYOUG 2011/VirtaThon presentation and white paper contain a comprehensive example – monitoring forced logging status of all production databases.

Since UDMs are gone in OEM 12c, we’ll see how to use Metric Extensions, the successor of UDMs, to accomplish the same goal.

Metric Extensions (MEs) represent a major improvement over UDMs in terms of code maintenance and ability to deploy to large number of targets. MEs, however, do not allow us to customize the SQL query for a specific target. Even though there is a good reason behind the behavior, it requires us to substantially change the architecture we have previously used.

In the original presentation, I used one UDM ( ForcedLogging ) that had different definitions for different targets.

In OEM 12c, we’ll need to have different Metric Extensions for each of the definitions and target types.

For most single instance DB targets, we’ll use the standard “ForcedLogging” ME that is defined with this query:


select force_logging from v$database

For most RAC DB targets, we’ll use “ForcedLoggingRAC” ME that is defined with the same query:


select force_logging from v$database

And if we want to ignore the “force logging” status of a database, then we can use ForcedLoggingIgnore” ME defined with this query:


select 'YES' from v$database

The query/Metalized View that actually enforces the enterprise-wide force logging policy gets a little more complex:


CREATE MATERIALIZED VIEW "SYSMAN"."UDM_FORCED_LOGGING" ("CNT") ORGANIZATION HEAP PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT) TABLESPACE "MGMT_TABLESPACE" BUILD IMMEDIATE USING INDEX REFRESH COMPLETE ON DEMAND START WITH sysdate+0 NEXT SYSDATE + 1/96 USING DEFAULT LOCAL ROLLBACK SEGMENT USING ENFORCED CONSTRAINTS DISABLE QUERY REWRITE
AS
SELECT COUNT( member_target_guid ) cnt
FROM SYSMAN.MGMT$GROUP_DERIVED_MEMBERSHIPS O ,
SYSMAN.MGMT$TARGET T ,
sysman.MGMT$AVAILABILITY_CURRENT st
WHERE o.composite_target_name = 'PRODUCTION'
AND o.member_target_type     IN ('oracle_database', 'rac_database')
AND ( t.target_type           ='rac_database'
OR (t.target_type             ='oracle_database'
AND t.type_qualifier3        != 'RACINST'))
AND o. member_target_guid     = t.target_guid
AND t.target_guid             = st.target_guid
AND st.availability_status    = 'Target Up'
AND (NOT EXISTS
(SELECT *
FROM sysman.mgmt$metric_current i
WHERE I.TARGET_GUID      = O.MEMBER_TARGET_GUID
AND metric_name          = 'ME$ForcedLogging'
AND column_label         = 'ForcedLogging'
AND Metric_Column        = 'ForcedLogging'
AND collection_timestamp > sysdate - 30/1440
AND value                = 'YES'
)
AND NOT EXISTS
(SELECT *
FROM sysman.mgmt$metric_current i
WHERE i.target_guid      = o.member_target_guid
AND metric_name          = 'ME$ForcedLoggingRAC'
AND column_label         = 'ForcedLogging'
AND Metric_Column        = 'ForcedLogging'
AND collection_timestamp > sysdate - 30/1440
AND value                = 'YES'
)
AND NOT EXISTS
(SELECT *
FROM sysman.mgmt$metric_current i
WHERE i.target_guid      = o.member_target_guid
AND metric_name          = 'ME$ForcedLoggingIgnore'
AND column_label         = 'ForcedLogging'
AND Metric_Column        = 'ForcedLogging'
AND collection_timestamp > sysdate - 30/1440
AND value                = 'YES'
));