deryaoktay's weblog: Removing duplicate rows from a table

Monday, May 5, 2008

Removing duplicate rows from a table

Removing duplicate rows from a table has several ways. I prefer below method, because it is easy to implement. For table, schema1.table1 having columns col1, col2, col3 which will be primary key or unique key columns after removing duplicate entries will be accomplished by below SQL:

DELETE FROM schema1.table1 a
WHERE a.ROWID >
ANY (SELECT b.ROWID
FROM schema1.table1 b
WHERE a.col1 = b.col1
AND a.col2 = b.col2
AND a.col3 = b.col3);

1 comment:

H.Tonguç Yılmaz said...: Performans bakış açısı ile bir örnek;

SELECT /*+ parallel */
year_month, COUNT(subscriber_id), COUNT(DISTINCT subscriber_id)
FROM eul.dm_subscriber_history_01 t
GROUP BY year_month;

200711 60791328 60261922

avrupa@oracle $ cat tmp.sql
spool tmp.log

conn eul/eul

SET LINESIZE 2000
SET AUTOTRACE TRACEONLY
SET TIME ON
SET TIMING ON

ALTER SESSION ENABLE PARALLEL DDL ;
ALTER SESSION ENABLE PARALLEL DML ;

alter session set max_dump_file_size=unlimited ;
alter session set timed_statistics = true ;
alter session set STATISTICS_LEVEL = ALL ;
alter session set "_rowsource_execution_statistics" = true ;
alter session set tracefile_identifier = duplicates ;

SELECT /*+ parallel */
COUNT(*)
FROM eul.dm_subscriber_history_01
WHERE ROWID NOT IN (SELECT /*+ parallel */ a.rid
FROM (SELECT /*+ parallel */
MAX(ROWID) rid, year_month, subscriber_id, COUNT(1)
FROM eul.dm_subscriber_history_01
GROUP BY year_month, subscriber_id
HAVING COUNT(1) > 1) a);

/*
Elapsed: 00:08:37.55

1886131 consistent gets
2005144 physical reads
*/

SELECT /*+ parallel */
COUNT(*)
FROM eul.dm_subscriber_history_01
WHERE ROWID IN (SELECT /*+ parallel */ rid
FROM (SELECT /*+ parallel */
ROWID rid,
row_number() over(PARTITION BY year_month, subscriber_id ORDER BY ROWID) rn
FROM eul.dm_subscriber_history_01)
WHERE rn 1);

/*
Elapsed: 00:03:12.03

1472472 consistent gets
1067141 physical reads
*/

exit;

Silinecek kayıt miktarı tüm kayıtların bence ~%10 civarına ulaşınca direkt CTAS nologging paralel girmek daha doğru olabilir.

Kaynakça : http://www.oracle.com/technology/oramag/oracle/04-jul/o44asktom.html; May 5, 2008 at 10:06 AM

Monday, May 5, 2008

Removing duplicate rows from a table

1 comment:

Windows (powershell) counterparts of Linux commands