Deleting Duplicate Data From SQL Server Table

Duplicate data is a very well known and frequently occurring issue in any DBMS and dealing with deletion of duplicate data is something not that easy for one specially one who is dealing with crucial and huge data.

There are 2 scenarios which can come while deleting duplicate data -

1. when we have an identity column in our table containing duplicates.

The problem becomes much easier here and we can just keep the data with MIN or MAX keys removing all other.

--Create New Table

create table Person

(

ID int identity(1,1),

Name varchar(30),

age int

)

--Insert Data into Table with multiple Duplicate entries

insert into Person(Name,age) values('Ashish',24)

insert into Person(Name,age) values('Gaurav',26)

insert into Person(Name,age) values('Nayan',25)

--Check the data before deleting duplicates

select * from Person

DELETE FROM Person

WHERE ID NOT IN

(

SELECT MIN(ID) FROM Person

GROUP BY Name,age HAVING COUNT(1) > 1

UNION

SELECT MIN(ID) FROM Person

GROUP BY Name,age HAVING COUNT(1) = 1

)

--Check the data After deleting duplicates

select * from Person

Use Below Query --

2. when we don't have any unique identifier in our table and it's a bit tricky here -

there are 2 ways of doing it here again --

(i) using the Row_Number() function comes with only SQL Server 2005 and later versions

--Check if Table already exists

IF EXISTS(select * from INFORMATION_SCHEMA.TABLES where TABLE_NAME = 'Person')

BEGIN

--Drop if already exists

Drop Table Person

END

--Create New Table

create table Person

(

Name varchar(30),

age int

)

--Insert Data into Table with multiple Duplicate entries

insert into Person(Name,age) values('Ashish',24)

insert into Person(Name,age) values('Gaurav',26)

insert into Person(Name,age) values('Nayan',25)

--Check the data before deleting duplicates

select * from Person

;WITH CTE (Name,Age, DuplicateCount)

(

SELECT Name,Age,

ROW_NUMBER() OVER(PARTITION BY Name,Age ORDER BY Name) AS DuplicateCount

FROM Person

)

DELETE

FROM CTE

WHERE DuplicateCount > 1

--Check the data After deleting duplicates

select * from Person

(ii) Here is some piece of code which i will suggest as best to deal with this and can be used anywhere in even the older version of SQL Server like 2000 without using any special functions.

--Check if Table already exists

IF EXISTS(select * from INFORMATION_SCHEMA.TABLES where TABLE_NAME = 'Person')

BEGIN

--Drop if already exists

Drop Table Person

END

--Create New Table

create table Person

(

Name varchar(30),

age int

)

--Insert Data into Table with multiple Duplicate entries

insert into Person(Name,age) values('Ashish',24)

insert into Person(Name,age) values('Gaurav',26)

insert into Person(Name,age) values('Nayan',25)

--Check the data before deleting duplicates

select * from Person

--Declare 2 variables as @name (which is duplicate key) and @duplicateCount as Number of Duplicates

DECLARE @name varchar(30), @duplicateCount int

--Declare a Cursor to iterate through each duplicate name

DECLARE Cursor_Person CURSOR FOR

select Name,(count(1)-1) cnt

from Person

group by Name

having count(1) > 1

OPEN Cursor_Person

FETCH NEXT FROM Cursor_Person INTO @name, @duplicateCount

WHILE @@FETCH_STATUS = 0

BEGIN

--delete all duplicate entries for each duplicate key keeping only one row

delete top (@duplicateCount) from Person where Name = @name

FETCH NEXT FROM Cursor_Person INTO @name, @duplicateCount

END

CLOSE Cursor_Person

DEALLOCATE Cursor_Person

--Check the data After deleting duplicates

select * from Person

Please do comment if you find this post useful, suggestion are appreciated.
Happy Coding !!

Search This Blog

Ashish Agarwal

Deleting Duplicate Data From SQL Server Table

Comments

Post a Comment

Popular posts from this blog

SSIS vs. T-SQL

List Directory Files using T-SQL xp_cmdShell Stored Procedure