How to avoid duplicating rows in query results when joining 2 tables

  • The below scenario represents my question in it's simplest form.

    I am trying to join an Item History table to an Invoice History table. Unfortunately the columns I have available to me in the Item History table for join criteria do not always define a unique record in the Invoice History table. I get repeating results included when there are multiple invoices generated for the same item against the same order and line for the same quantity on the same day. Each Invoice History transaction REALLY only corresponds to ONE Item History transaction, but I have insufficient information in Item History to be able to identify the specific corresponding Invoice History transaction

    How can I limit the rows returned by this query so that the Item History quantity change is only represented once even if there are multiple Invoice History transactions that meet the join criteria? It seems as though there should be a simple solution, but I am having a difficult time finding it.

    create table dbo.ItemHistory

    (Item varchar(30)

    ,OrderNo numeric(9,0)

    ,OrderLine numeric(4,0)

    ,TxDate datetime

    ,QtyChg numeric(10,0)

    )

    go

    create table dbo.InvoiceHistory

    (Item varchar(30)

    ,OrderNo numeric(9,0)

    ,OrderLine numeric(4,0)

    ,InvoiceNo numeric(9,0)

    ,InvoiceDate datetime

    ,InvoiceQty numeric(10,0)

    )

    go

    insert into

    dbo.ItemHistory

    values

    ('123',1,5,'2012-01-01',10)

    ,('123',1,5,'2012-01-01',10)

    ,('124',2,5,'2012-01-01',5)

    ,('125',3,5,'2012-01-01',25)

    go

    insert into

    dbo.InvoiceHistory

    values

    ('123',1,5, 1,'2012-01-01',10)

    ,('123',1,5, 2, '2012-01-01',10)

    ,('124',2,5, 3, '2012-01-01',5)

    ,('125',3,5, 4, '2012-01-01',25)

    go

    Here is the query I am struggling with. Note that the problem is with Item 123.

    select

    itm.Item

    ,itm.OrderNo

    ,itm.OrderLine

    ,CONVERT(varchar, itm.TxDate, 101) as TxDate

    ,itm.QtyChg

    ,inv.InvoiceNo

    ,CONVERT(varchar, inv.InvoiceDate, 101) as InvoiceDate

    ,inv.InvoiceQty

    from

    dbo.ItemHistory itm

    join

    dbo.InvoiceHistory inv

    on

    inv.Item = itm.Item

    and inv.OrderNo = itm.OrderNo

    and inv.OrderLine = itm.OrderLine

    and inv.InvoiceDate = itm.TxDate

    and inv.InvoiceQty = itm.QtyChg

    With the columns available to me for joining, the first two Item History rows each match to the first two Invoice History rows, giving me the cartesian product result of 4 rows returned. Rows 5 and 6 represent a one to one correspondence between the 2 tables and are not part of the problem.

    The result I am after is to have Invoice Numbers 1 and 2 each represented only once.

    ItemOrderNoOrderLineTxDateQtyChgInvoiceNoInvoiceDateInvoiceQty

    1231501/01/201210101/01/201210

    1231501/01/201210101/01/201210

    1231501/01/201210201/01/201210

    1231501/01/201210201/01/201210

    1242501/01/20125301/01/20125

    1253501/01/201225401/01/201225

  • Thank you for providing tables and data - this takes out a lot of the guesswork.

    Here is a relatively new and efficient way to solve this kind of problem.

    To your query I added

    selectitm.Item, itm.OrderNo ,itm.OrderLine... as rownum

    that adds a row, called rownum, that contains a sequence that re-starts with 1 whenever the Item, OrderNo and InvoiceNo grouping changes.

    Then I wrapped it in an outer query that selects only rows that have rownum = 1.

    Put all that together and here's the query that solves your problem:

    select Item, OrderNo, OrderLine, TxDate, QtyChg, InvoiceNo, InvoiceDate, InvoiceQty, rownum

    from (

    selectitm.Item, itm.OrderNo ,itm.OrderLine

    ,CONVERT(varchar, itm.TxDate, 101) as TxDate

    ,itm.QtyChg, inv.InvoiceNo

    ,CONVERT(varchar, inv.InvoiceDate, 101) as InvoiceDate

    ,inv.InvoiceQty

    ,ROW_NUMBER() OVER (PARTITION BY itm.Item, itm.OrderNo, inv.InvoiceNo

    ORDER BY itm.Item, itm.OrderNo, inv.InvoiceNo ) as rownum

    from dbo.ItemHistory itm join dbo.InvoiceHistory inv on

    (inv.Item = itm.Item

    and inv.OrderNo = itm.OrderNo

    and inv.OrderLine = itm.OrderLine

    and inv.InvoiceDate = itm.TxDate

    and inv.InvoiceQty = itm.QtyChg) ) X

    WHERE rownum = 1

    This solution uses windows functions (because the ROW_NUMBER, OVER, and PARTITION BY are applied the the result set, or "window", after the data has been retrieved).

    This and other ways of solving this problem can be found here:

    http://www.simple-talk.com/content/print.aspx?article=646

    The solution used here is near the bottom of the article in the section titled, "New Techniques for Removing Duplicate Rows in SQL Server 2005".

    Google "sql server windows functions" for more info on this solution.

    - victor di leo

  • Oops.... where I wrote

    To your query I added

    select itm.Item, itm.OrderNo ,itm.OrderLine... as rownum

    I actually meant to write

    To your query I added

    ,ROW_NUMBER() OVER (PARTITION BY ... as rownum

  • Victor,

    I believe this will absolutely resolve my issue. I will test it out shortly.

    I have been working with sql for a few months now and have learned one thing so far. There is ALWAYS more to learn.

    Thank you very much for the assistance.

    Brian Miller

  • If you omit the rownum column from the result, the query plan can use the ANY aggregate:

    SELECT

    X.Item,

    X.OrderNo,

    X.OrderLine,

    X.TxDate,

    X.QtyChg,

    X.InvoiceNo,

    X.InvoiceDate,

    X.InvoiceQty

    FROM

    (

    SELECT

    itm.Item,

    itm.OrderNo,

    itm.OrderLine,

    TxDate = CONVERT(char(10), itm.TxDate, 101),

    itm.QtyChg,

    inv.InvoiceNo,

    InvoiceDate = CONVERT(char(10), inv.InvoiceDate, 101),

    inv.InvoiceQty,

    rownum =

    ROW_NUMBER() OVER (

    PARTITION BY itm.Item, itm.OrderNo, inv.InvoiceNo

    ORDER BY itm.Item, itm.OrderNo, inv.InvoiceNo)

    FROM dbo.ItemHistory itm

    JOIN dbo.InvoiceHistory inv ON

    inv.Item = itm.Item

    AND inv.OrderNo = itm.OrderNo

    AND inv.OrderLine = itm.OrderLine

    AND inv.InvoiceDate = itm.TxDate

    AND inv.InvoiceQty = itm.QtyChg

    ) AS X

    WHERE

    X.rownum = 1;

    You could also just add a DISTINCT to the original query. Both of these approaches are imperfect though; the real solution is to fix the schema. Using a properly relational design will help avoid these issues arising in the first place.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply