Unix Technical Forum

Cleaning data - getting rid of duplicate rows

This is a discussion on Cleaning data - getting rid of duplicate rows within the SQL Server forums, part of the Microsoft SQL Server category; --> Data Cleansing: In the example (SQL Server DDL below) there are two tables - ExampleCustomer, and ExampleCar. ExampleCar is ...


Go Back   Unix Technical Forum > Database Server Software > Microsoft SQL Server > SQL Server

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 03-01-2008, 03:39 PM
teddysnips@hotmail.com
 
Posts: n/a
Default Cleaning data - getting rid of duplicate rows

Data Cleansing:

In the example (SQL Server DDL below) there are two tables -
ExampleCustomer, and ExampleCar.

ExampleCar is a lookup table. The ExampleCustomer table has a foreign
key to the ExampleCar table. There should be two rows in there, "Ford
Focus", and "Ford Galaxy". This table populates a drop-down list in
the application, ordered by CarID. So a user is adding a new Customer
record, selects a car from the list, and hey Presto!

Someone came along and messed with the data, so now there are two rows
for each car. Yes, I know I should have set the CarType column
unique, but I didn't.

The user has done what users always do - work around the problem. So
she has added some customers using the first occurrence of the car in
the drop-down list, except for the last record when she unaccountably
selected the second occurrence.

I have been tasked with cleaning up the data. The object is to delete
all duplicate rows, but without leaving any orphaned references. At
the end of the exercise the rows in the tables should be:

CarID CarType
1 Ford Focus
3 Ford Galaxy

PersonName CarID
Adam Smith 1
Ben Smith 1
Colin Smith 3
Dave Smith 3
Edward Smith 3
Fred Smith 3

Note that the CarID for Fred Smith has been updated to the
predominating row for the Ford Galaxy type.

Is this clear? Can anyone see a way of doing this in a small number
of queries? I could think of a way of doing it using cursors etc.,
but the prevailing wisdom is that there's ALWAYS a way to do it
without using cursors.

Thanks

Edward


DDL
=========================
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
[ExampleCar]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ExampleCar]
GO

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
[ExampleCustomer]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ExampleCustomer]
GO

CREATE TABLE [dbo].[ExampleCar] (
[CarID] [int] IDENTITY (1, 1) NOT NULL ,
[CarType] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[ExampleCustomer] (
[PersonName] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL ,
[CarID] [int] NULL
) ON [PRIMARY]
GO

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Focus')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Focus')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Galaxy')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Galaxy')

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Adam Smith', 1)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Ben Smith', 1)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Colin Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Dave Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Edward Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Fred Smith', 4)

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 03-01-2008, 03:39 PM
jhofmeyr@googlemail.com
 
Posts: n/a
Default Re: Cleaning data - getting rid of duplicate rows

On 20 Aug, 12:47, teddysn...@hotmail.com wrote:
> Data Cleansing:
>
> In the example (SQL Server DDL below) there are two tables -
> ExampleCustomer, and ExampleCar.
>
> ExampleCar is a lookup table. The ExampleCustomer table has a foreign
> key to the ExampleCar table. There should be two rows in there, "Ford
> Focus", and "Ford Galaxy". This table populates a drop-down list in
> the application, ordered by CarID. So a user is adding a new Customer
> record, selects a car from the list, and hey Presto!
>
> Someone came along and messed with the data, so now there are two rows
> for each car. Yes, I know I should have set the CarType column
> unique, but I didn't.
>
> The user has done what users always do - work around the problem. So
> she has added some customers using the first occurrence of the car in
> the drop-down list, except for the last record when she unaccountably
> selected the second occurrence.
>
> I have been tasked with cleaning up the data. The object is to delete
> all duplicate rows, but without leaving any orphaned references. At
> the end of the exercise the rows in the tables should be:
>
> CarID CarType
> 1 Ford Focus
> 3 Ford Galaxy
>
> PersonName CarID
> Adam Smith 1
> Ben Smith 1
> Colin Smith 3
> Dave Smith 3
> Edward Smith 3
> Fred Smith 3
>
> Note that the CarID for Fred Smith has been updated to the
> predominating row for the Ford Galaxy type.
>
> Is this clear? Can anyone see a way of doing this in a small number
> of queries? I could think of a way of doing it using cursors etc.,
> but the prevailing wisdom is that there's ALWAYS a way to do it
> without using cursors.
>
> Thanks
>
> Edward
>
> DDL
> =========================
> if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
> [ExampleCar]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
> drop table [dbo].[ExampleCar]
> GO
>
> if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
> [ExampleCustomer]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
> drop table [dbo].[ExampleCustomer]
> GO
>
> CREATE TABLE [dbo].[ExampleCar] (
> [CarID] [int] IDENTITY (1, 1) NOT NULL ,
> [CarType] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
> NULL
> ) ON [PRIMARY]
> GO
>
> CREATE TABLE [dbo].[ExampleCustomer] (
> [PersonName] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS
> NULL ,
> [CarID] [int] NULL
> ) ON [PRIMARY]
> GO
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Focus')
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Focus')
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Galaxy')
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Galaxy')
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Adam Smith', 1)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Ben Smith', 1)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Colin Smith', 3)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Dave Smith', 3)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Edward Smith', 3)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Fred Smith', 4)



SELECT
ecust.PersonName
, (
SELECT MIN(CarID)
FROM [ExampleCar]
WHERE [CarType] IN (
SELECT [CarType]
FROM [ExampleCar]
WHERE CarID = ecust.CarID
)
) AS CarID
FROM [ExampleCustomer] ecust

Not pretty but that should get you started.

Good Luck
J

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 03-01-2008, 03:40 PM
Erland Sommarskog
 
Posts: n/a
Default Re: Cleaning data - getting rid of duplicate rows

(teddysnips@hotmail.com) writes:
> I have been tasked with cleaning up the data. The object is to delete
> all duplicate rows, but without leaving any orphaned references. At
> the end of the exercise the rows in the tables should be:
>
> CarID CarType
> 1 Ford Focus
> 3 Ford Galaxy
>
> PersonName CarID
> Adam Smith 1
> Ben Smith 1
> Colin Smith 3
> Dave Smith 3
> Edward Smith 3
> Fred Smith 3
>
> Note that the CarID for Fred Smith has been updated to the
> predominating row for the Ford Galaxy type.


UPDATE ExampleCustomer
SET CarID = Mincar.CarID
FROM ExampleCustomer C
JOIN ExampleCar Car ON C.CarID = Car.CarID
JOIN (SELECT CarType, CarID = MIN(CarID)
FROM ExampleCar
GROUP BY CarType) AS Mincar ON Mincar.CarType = Car.CarType

DELETE ExampleCar
FROM ExampleCar C
WHERE EXISTS (SELECT *
FROM ExampleCar C1
WHERE C1.CarType = C.CarType
AND C.CarID > C1.CarID)


There is no foreign declared in the example, but I hope that there is
in the real case.

Of course, risk is that your uzer have spelt the duplicates differently.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pro...ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinf...ons/books.mspx
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 01:12 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com