Importance of Data Clustering when Deleting in Batches

July 27, 2015

The inspiration for this post comes from  the OTN discussion “Bulk Deletes from a Large Table”  where I volunteered the idea that we can improve the performance by taking into the account the clustering of the data to be deleted. Since my statement was rather general, I decided to create this post to fill in all the details.

First, we start with the table definition:

CREATE TABLE test
(
dt DATE ,
st NUMBER ,
other_num NUMBER ,
other_str VARCHAR(100)
)

The the first column of the table is a DATE, the second one status – a number between 0 and 100.

We want to populate the table with test data with special data clustering characteristics. We want to simulate the data distribution (clustering) we would get if users insert data in chronological order – i.e. the older data (the smallest dt value) gets inserted first, then slightly newer data, and so forth.
The following code fragment will not only fill the table with data, but it will also get us the desired data clustering.

BEGIN
  FOR i IN 1..200
  LOOP
    INSERT INTO test
  WITH v1 AS
    (SELECT rownum n FROM dual CONNECT BY level <= 10000
    )
  SELECT trunc(sysdate - i ),
    mod(rownum , 100) ,
    rownum ,
    'BLAHBLAH123'
  FROM v1,
    v1
  WHERE rownum <= 5000;
  COMMIT;
END LOOP;
END;
/

The data in the TEST table is clustered by DT. Records with the same DT are likely to be in the same block. That cannot be said for records with the same ST. Those records are scattered all over the table.

Now, let’s create indexes that would support equally well each of the purge methods and gather stats:

CREATE INDEX idx1 ON test
  (dt , st
  )
CREATE INDEX idx2 ON test
  (
    st ,
    dt
  )
  EXEC dbms_stats.gather_table_stats
  (
    '??????',
    'TEST'
  )

To better simulate memory pressure, let create a procedure that would flush the shared pool.
As SYS:

CREATE
PROCEDURE flush_bc
AS
BEGIN
  EXECUTE immediate 'alter system flush buffer_cache';
END;
/
GRANT EXECUTE ON flush_bc TO ??????;

The first purge technique accesses the data via DT. Since the data is clustered on DT, it is expected this technique to be faster.

BEGIN
  FOR i IN 100..200
  LOOP
    DELETE TEST WHERE DT BETWEEN TRUNC(SYSDATE - i ) AND TRUNC(SYSDATE - i +1 ) ;
    COMMIT;
    sys.flush_bc ;
  END LOOP;
END;
/ 

It takes 20263 physical reads.

The second one (please rebuild the TEST table before retrying) accesses the data via ST. The data for a ST is spread across many blocks, so this technique is expected to be slower.

BEGIN
  FOR i IN 0..101
  LOOP
    DELETE TEST WHERE ST = i AND DT < TRUNC(SYSDATE - 98 ) ;
    COMMIT;
    sys.flush_bc;
  END LOOP;
END;
/

took 239100 physical reads – more than 10 times the first one.

This test clearly shows that the first technique is better than the second one. The memory pressure in the test scenario is significant, so it is likely that the difference between the techniques would not be as great is most real world settings.

Advertisements