Find Duplicates

Question

Find Duplicates

Tom-386516

Mr or Mrs. 500

Points: 506
More actions
June 7, 2007 at 11:57 am

#62662

Hello,
I am working with sql 2000, and I am faced with a project that requires myself to find duplicate entries. To begin I will give an example of what I need to finish with and hopfully someone can tell me where to begin. I have been asked to find all the duplicate entries within a table of 18099597 rows of people and give the percentage of duplicates within that table. So duplicate people is what needs to be found along with the percentage. And with these people are of course identifying fields such as passport numbers, gender, data of birth etc. Where should I begin?

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Aaron Ingold SSCrazy Eights Points: 9070 More actions · Answer 1

I'd begin by deciding what columns have to match in order to consider two rows a duplicate of each other. Also ensure you have a candidate key in the table which doesn't include those columns that are in your must-match list. Then you can perform a self-join on the table specifying your matching criteria as the join criteria and a not-equal against the candidate key.

Tom-386516 Mr or Mrs. 500 Points: 506 More actions · Answer 2

I've come up with this query so far and it tells me some differences, but not the percentage. I read that I just have to devide the number to rows I get back from the query by the number of original rows in the table then multiply by 100. Can anyone confirm this?

select Last_Name, First_name, Date_of_Birth, Id_Num, count(*) AS NumOfTimes

from people_info

group by Last_name, First_Name, Date_of_Birth, Id_Num

having count(*) > 1

order by count(*) desc, Last_Name, First_Name

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 3

Close... I'd check just for dupe ID_Num's first... the rest of the columns are ancillary in nature...

Or, just take out the ID_Num to see if the same person has been registered twice...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

bnordberg SSCertifiable Points: 5379 More actions · Answer 4

I agree, the fewer number of fields you check, the less likely you are to exclude someone due to a spelling (or name change ...) error. If you feel the ID_num is a reliable field than just do your grouping on it, get your count and yes divide by the total number of records for a percentage.

Now for the real kick. This will tell you the number of exact dupes. What about inexact dupes (id_num misentered...)? I call thes fuzzy or inexact dupes and they are much harder to find and much harder to quantify. I should have an article out on how to use the fuzzy grouping task in SSIS to discover fuzzy dupes soon.

But you could also join the table back to itself on a soundex of the first name, lastname ... This will likely overexaggerate your dups, so its not my favorite.

Stefan Solender Grasshopper Points: 17 More actions · Answer 5

Try this:

SELECT NULL

FROM

b where

b. =

.

and b. =

.

GROU BY b., b.

HAVING

. > min(b.))

ORDER BY

.