How to Find the Optimal Configuration for Your Virtualized Environment

December 16, 2015

Thanks to the New York Oracle User Group for allowing me to present and to all who attended for their valuable questions and comments.
The presentation showcases a framework for finding optimal server configurations that can be useful to anyone who is dealing with sever consolidation. Those working with Oracle Enterprise Manager/Grid Control can get some practical tips as well.

The PDF version of the presentation is here.
The Power Point version of the presentation is here. Many PowerPoint slides have notes.


Detecting Connection Imbalance in Oracle RAC

May 8, 2015

For Oracle RAC configurations that rely on spreading the load equally among instances, ensuring that inbound connections are balanced is quite important.

Since I was not able to find a suitable metric in the Oracle Enterprise Manager, I had to create one myself.

As I started thinking about it, some fundamental questions started popping up. How much difference in connection count should trigger an alert? Should I measure the percentage difference, the actual difference (number of sessions) , or something else?

Resisting the temptation to start coding convoluted logic, I reviewed what applied statistics has to offer and this is what I found – the Chi Square test !

Here is an example for homogeneity testing and here is one for independence testing.

The way I understand it, you can use that test to see if the connections are independently (uniformly) distributed across DB instances, or if there are some DB instances tend to get more or fewer connections than “expected”.

Another great thing about Chi Square test is that it is already implemented in the Oracle RDBMS.

The build-in Oracle function (STATS_CROSSTAB) can give us the value of Chi-squared (CHISQ_OBS) , the degree of freedom (CHISQ_DF), and the statistical significance (CHISQ_SIG).  What we are interested in is the statistical significance. A number less that 0.05 indicates that the data is likely not distributed uniformly.

Here is the query that can detect if a DB user prefers/avoids a DB instance:
—————————————————————————————–

SELECT COUNT(*) cnt
FROM
    (SELECT STATS_CROSSTAB(inst_id, username, 'CHISQ_SIG') p_value
    FROM gv$session
    )
WHERE p_value < 0.05

—————————————————————————————–

Detecting connection imbalance at client machine level is bit more tricky because each instance received a few connections from the server it is on.
That can be easily accounted for my excluding the servers that run the DBs:
—————————————————————————————–

SELECT COUNT (*) cnt
FROM
    (SELECT STATS_CROSSTAB(inst_id, machine, 'CHISQ_SIG') p_value
     FROM gv$session
     WHERE machine NOT LIKE '%dbp%.newsamerica.com'
     )
WHERE p_value < 0.05

—————————————————————————————–

These monitoring queries work without modification for any size RAC cluster. Adding or removing nodes is handled without issues, apart from a temporary imbalance that may come with adding nodes.


What to Do about Tools Hoarding Parallel Slaves in Oracle

November 26, 2014

Most GUI tools use cursors with relatively small fetch size (around 50) to retrieve data from Oracle. They open the cursor, fetch some data, show it and then wait on user input. All resources related to the connection and the open session are held while the tool waits on the user. While those resources are usually trivial for serial SELECT statements, they can be significant for parallel SELECT statements.

Each parallel statement gets assigned a number of parallel slaves and those slaves are not freed until their respective cursor is closed, regardless of the amount of work the slaves do. Since there are limited number of parallel slaves available in an instance (PARALLEL_MAX_SERVERS init.ora parameter) the hoarding of parallel slaves can prevent future statements to be executed in parallel, severely impacting the performance of those statements.

This following blog post describes the situation quite well:
http://structureddata.org/2012/02/29/pitfalls-of-using-parallel-execution-with-sql-developer/

Since this behavior is not a bug, i.e. there is never going to be a fix, we need to find a way to manage it.

One solution is to disable parallelism for all sessions coming from GUI tools that use cursors and therefore could cause this problem. This is a radical step that would deprive the users from running anything in parallel.

The second option is to educate the users of the problems associated with open cursors and ask them to close all cursor as soon as possible. This approach is ideal when executed diligently, but in reality not all users are compliant.

The approach I would like to propose is to allow parallelism for all, but monitor those who do not close their open cursors. Here is the query that I use to monitor:

select count(*) from 
  (
      select * 
      from gv$px_session px_QC   
      where px_QC.qcinst_id IS NULL
      minus
      select * from gv$px_session px_QC   
      where px_QC.qcinst_id IS NULL
      and  exists
            (select * 
             from gv$px_session  px_Slaves , gv$session sess
             where px_QC.qcsid = px_Slaves.qcsid
             and px_Slaves.sid = sess.sid
             and (sess.wait_class = 'Idle'
                 or ( sess.seconds_in_wait < 600
                      and sess.wait_class = 'Idle'
                     )
                 )
             )
   )

This query returns zero if the parallel slaves are actively used and greater than zero if there is a set of parallel slaves that have been idle for 600 second. This query can be used to get the offending session and “talk” with the end user. It could be integrated with OEM using Metric Extensions or it could be part of a monitoring script that kills the offending sessions. The possibilities are endless.


Measuring the Benefits of Database Virtualization/Thin Provisioning Solutions

May 14, 2013

Overview

Database virtualization and thin provisioning are new and powerful ways to share/reuse storage among non-production database copies.

Solutions for reducing the disk footprint of non-production Oracle databases range from complete database virtualization (Delphix, etc), to solutions build around storage vendor’s features (EMC,NetApps, etc), to purely software solutions (cloneDB, etc).

Sharing read-only tablespaces – a “poor man’s” database virtualization – is less known approach for saving disk space and improving refresh times in non-production environments. Here is a link to the presentation I did at NYOUG (2012) about this topic.

What are we measuring and why?

In general, most solutions are quite effective, both in terms of storage and refresh time, in delivering a copy of the production database.

Once the copy is delivered, it is open for users and batch jobs, so it starts changing. Those changes belong to the new database and therefore cannot be shared with the other non-production databases. That means that they consume “real” disk space. The more DB changes since refresh, the lower the benefits of the DB virtualization/thin provisioning solution.

Measuring the amount of change after a refresh is important in understanding how much disk space would be needed for the non-production Oracle environments after DB virtualization/thin provisioning is in place. The number is essential in computing ROI of the project.

How are we measuring it?

One way to measure the change is to run an incremental RMAN backup and see how big the resulting backup file(s) is. In many cases, however, that is not doable.

The method described here works only on 11gR2. It only requires the SCN of the DB at the time of refresh.

The information can be find in V$DATABASE (RESETLOG_CHANGE#) for databases created by RMAN DUPLICATE.

If the data was transferred with DataPump, or other similar method, a suitable SCN can be found in V$ARCHIVED_LOG (FIRST_CHANGE#, NEXT_CHANGE#).

The new DBVerify feature allows us to utilize HIGH_SCN parameter to find out how many blocks where changes since HIGH_SCN.

Let’s see how many blocks were modified after SCN 69732272706.

It is really easy:

dbv userid=sys/?????????? file=/tmp/test.dbf HIGH_SCN=69732272706

Page 64635 SCN 1012797077 (16.1012797077) exceeds highest scn to check 1012795970 (16.1012795970)

Page 66065 SCN 1012796687 (16.1012796687) exceeds highest scn to check 1012795970 (16.1012795970)

Page 66187 SCN 1012796687 (16.1012796687) exceeds highest scn to check 1012795970 (16.1012795970)

Page 87759 SCN 1012796692 (16.1012796692) exceeds highest scn to check 1012795970 (16.1012795970)

DBVERIFY - Verification complete

Total Pages Examined : 93440

Total Pages Processed (Data) : 62512

Total Pages Failing (Data) : 0

Total Pages Processed (Index): 9399

Total Pages Failing (Index): 0

Total Pages Processed (Other): 6119

Total Pages Processed (Seg) : 1705

Total Pages Failing (Seg) : 0

Total Pages Empty : 13705

Total Pages Marked Corrupt : 0

Total Pages Influx : 0

Total Pages Encrypted : 0

Total Pages Exceeding Highest Block SCN Specified: 39

Highest block SCN : 1012797131 (16.1012797131)

In this case, only 39 blocks (out of 100K) were modified after SCN 69732272706.

Please note that even though DBVerify (dbv) can work with open DB files, it will not account for changed blocks that were not yet written to the file.

Also, as per MOS document 985284.1 – ” Why HIGH_SCN Option Doesn’t Work While Running DBV Against ASM Files?” – the HIGH_SCN flag works only for files that are not stored in ASM. Hope they fix that problem soon.


Hotsos 2013 and some more useful OEM queries

January 31, 2013

Busy preparing for Hotsos 2013. My presentation this year is “Working with Confidence: How Sure Is the Oracle CBO about Its Cardinality Estimates, and Why Does It Matter?” I know it is a mouthful. The presentation/white paper would be up to the point with lots of pictures though 🙂

The only thing I have for the blog now is a revised version of a query I posted the last time. The original version was way too crude.

select   a.rollup_timestamp ,

         count(*) ,

         sum(c.cpu_count) ,

         sum(a.average*c.cpu_count)/ sum(c.cpu_count) ,

         sum(a.maximum*c.cpu_count)/ sum(c.cpu_count) ,

         sum((a.average+3*a.standard_deviation)*c.cpu_count) / sum(c.cpu_count)

from

         mgmt$metric_hourly a ,

         mgmt$target b ,

         sysman.MGMT_ECM_HW c

where    a.metric_name = 'Load'

and      a.column_label = 'CPU Utilization (%)'

and      a.target_guid = b.target_guid

and      b.target_name like 'dev%'

and      c.hostname||'.'||c.domain = b.target_name

and      c.vendor_name = 'Intel Based Hardware'

group by a.rollup_timestamp

order by 5 desc

The query now takes into account the number of CPUs on a server. I use the undocumented MGMT_ECM_HW view – it contains a wealth of information.

The query assumes that all CPUs are identical – not perfect, but much better than the original.


More Useful OEM Repository Queries

December 31, 2012

I already included a number of queries against the OEM repository in my presentation.
Now, I would like to add two more.

The first query shows an approximation of the average and max CPU utilization for a group of servers (names starting with “dev”).
Since the query assumed that each server has the same “weight”, the results are not completely correct for most environments.
They are a good start for a consolidation and/or virtualization projects though.

select
      a.rollup_timestamp           time, 
      count(*)                     number_of_servers,
      sum(a.average)/count(*)      average_load ,
      sum(a.maximum)/count(*)      max_recorded,
      sum(a.average+3*a.standard_deviation)/count(*) max_derived
from
      mgmt$metric_hourly a ,
      mgmt$target b
where
      a.metric_name = 'Load'
  and a.column_label = 'CPU Utilization (%)'
  and a.target_guid = b.target_guid
  and b.target_name like 'dev%'
group by a.rollup_timestamp
order by max_derived desc

1/31/2013 – The query above is quite crude – I just posted a better query  – http://wp.me/p1DHW2-5J

The second query shows all databases that belong to the TEST OEM group and are in ARCHIVELOG mode. The query can be modified to look at all databases or to any subset.

select *
from 
sysman.MGMT$GROUP_DERIVED_MEMBERSHIPS O ,
    sysman.MGMT$TARGET T ,
    sysman.MGMT$AVAILABILITY_CURRENT st
where    
     o.composite_target_name = 'TEST'
and  o.member_target_type IN ('oracle_database', 'rac_database')
and  ( t.target_type ='rac_database'
       or (t.target_type ='oracle_database'
           and t.type_qualifier3 != 'RACINST'))
and  o. member_target_guid = t.target_guid
and  t.target_guid = st.target_guid
and  st.availability_status = 'Target Up'
and  (not exists 
      (select
             *
       from
            sysman.mgmt$metric_current i
       where 
             i.target_guid = o.member_target_guid 
       and   metric_name   = 'archFull'
       and   metric_Column = 'archTotal'
       and   metric_label  = 'Archive Area'
       and   column_label  = 'Total Archive Area (KB)'
       and   key_value     = 'NOARCHIVELOG'
    ));

This query is not strictly based on the documentation, but it work in OEM 12c.


NYOUG Sept. 2012 – Reduce Your Disk Footprint by Sharing Read-Only Tablespaces

September 13, 2012

Thanks to all who attended my presentation (Reduce Your Disk Footprint by Sharing Read-Only Tablespaces)

at NYOUG.

We had a great discussion!

The Power Point slides are here – I did some animation, so this is the best way to look at the presentation. Here is the PDF version of the slides. The white paper is here – it is much more detailed than the presentation.

If you think that the non-supported universal method can save you disk space, feel free to call Oracle and ask them to take a look at it.