Copying and comparing data

  • I have a query that I run daily against a database that is dumped to me nightly. I want to be able to populate a new database called Newclients and avoid duplicates. This is the query that I'm using:

    insert into msbtotal.dbo.newclients

    SELECT tcms_members.dbo.memberdata2.* FROM tcms_members.dbo.memberdata2 left outer join

    msbtotal.dbo.memberdata on tcms_members.dbo.memberdata2.id =

    msbtotal.dbo.memberdata.id where msbtotal.dbo.memberdata.id is

    null

    GO

    SET QUOTED_IDENTIFIER OFF

    GO

    SET ANSI_NULLS ON

    GO

    Here is some of my sample data:

    6008272 Doctor A Doctor A, MD CHP UT SOUTHWESTERN PSYCHIATRY RES (512) 333-3333 (512) 333-3333 Sep 25 1977 12:00AM 1212 MockingBird Lane Austin TX 78737 2010-10-01 09:52:38.000 2005-06-06 00:00:00.000 NULL

    6008272 Doctor A Doctor A, MD CHP UT SOUTHWESTERN PSYCHIATRY RES (512) 333-3333 (512) 333-3333 Sep 25 1977 12:00AM 1212 MockingBird Lane Austin TX 78737 2010-10-01 09:52:38.000 2005-06-06 00:00:00.000 NULL

    6008272 Doctor A Doctor A, MD CHP UT SOUTHWESTERN PSYCHIATRY RES (512) 333-3333 (512) 333-3333 Sep 25 1977 12:00AM 1212 MockingBird Lane Austin TX 78737 2010-10-01 09:52:38.000 2005-06-06 00:00:00.000 NULL

    6008272 Doctor A Doctor A, MD CHP UT SOUTHWESTERN PSYCHIATRY RES (512) 333-3333 (512) 333-3333 Sep 25 1977 12:00AM 1212 MockingBird Lane Austin TX 78737 2010-10-01 09:52:38.000 2005-06-06 00:00:00.000 NULL

    6009837 Some Person Some Person, MD, MPH P UT SOUTHWESTERN PSYCHIATRY RES (512) 555-5555 (512) 555-5555 Jun 26 1975 12:00AM 3501 Residence Ave. Austin TX 78731 2010-09-29 11:53:30.000 2006-06-09 00:00:00.000 2007-11-28 00:00:00.000

    6009837 Some Person Some Person, MD, MPH P UT SOUTHWESTERN PSYCHIATRY RES (512) 555-5555 (512) 555-5555 Jun 26 1975 12:00AM 3501 Residence Ave. Austin TX 78731 2010-09-29 11:53:30.000 2006-06-09 00:00:00.000 2007-11-28 00:00:00.000

    As you can see, there is duplicate data, and I need only one copy of each "new" record in my database. Can someone please assist me on this.

    Doug

  • several options

    1) change the data source to provide clean data to begin with (I know, most probably unlikely to happen, but that's the ideal scenario 😉 )

    2) SELECT DISTINCT tcms_members.dbo.memberdata2.* instead of SELECT tcms_members.dbo.memberdata2.*

    3) use GROUP BY your_list_of_columns

    I'd go with option 2 ...

    As a side note: instead of using INSERT INTO target SELECT * you should reference the target and source column names. This is a little more work at the beginning but it'll pay off later on (e.g. if either the source or the target will have additional columns or even a different order of columns)...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • maybe

    SELECT distinct tcms_members.dbo.memberdata2.* FROM tcms_members.dbo.memberdata2 left outer join

    msbtotal.dbo.memberdata on tcms_members.dbo.memberdata2.id =

    msbtotal.dbo.memberdata.id where msbtotal.dbo.memberdata.id is

    null

    or maybe if they aren't full duplicates something like

    select ... from

    (

    SELECT tcms_members.dbo.memberdata2.*, seq = rownumber() over(partition by tcms_members.dbo.memberdata2.id order by tcms_members.dbo.memberdata2.id )

    FROM tcms_members.dbo.memberdata2

    left outer join msbtotal.dbo.memberdata

    on tcms_members.dbo.memberdata2.id = msbtotal.dbo.memberdata.id

    where msbtotal.dbo.memberdata.id is null

    ) where seq = 1


    Cursors never.
    DTS - only when needed and never to control.

  • Thank you both for the suggestions. I am going to go with the SELECT DISTINCT option. Now the last question I have is how do I delete all the duplicates that I already have in my table?

  • Do you have any column in your table to identify a single row? (e.g. an identity column?)

    If so, you could use a subquery to get either the min or max id per group and delete all other rows.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply