Filter excel file with single-column file.

  • I have 2 excel spreadsheets, which have a common column, "drawing number". One file has 51 columns and over 100,000 records and the other has just one column and 68,000 rows. The single-column file has all unique records, whereas some drawing numbers are repeated in the large file.

    I would like to generate a third file which is filtered using the single column, which has 51 colums and 68,000 rows.

    How can I filter the large file with the smaller, so that only the first occurrance of the drawing number is output to the new file. (All subsequent ones can be ignored)?

  • Import both files into staging tables in SQL Server, perform a join with MIN and GROUP BY, then export to your new file. Alternatively, if you use SSIS, you can do the comparisons on the fly without importing into SQL Server. You will want to try both methods to see which performs better.

    John

  • John, I have imported the tables into SSMS. Could you tell me what SQL query to run?

  • Unfortunately I don't have the structure of your tables, nor any of the data in them, so there's not much I can do. As I suggested above, your query is likely to look something like this:

    SELECT x, MIN(y)

    FROM a JOIN b ON a.n = b.n

    GROUP BY x

    Have a go at writing something, and post back if you're still struggling. We'll need table DDL in the form of CREATE TABLE statements, sample data in the form of INSERT statement, your expected output, and what you've already tried.

    John

  • This is as far as I've got:

    select min(dbo.Largefile."Drawing Number")

    from dbo.Largefile join dbo.Smallfile on dbo.Largefile."Drawing Number" = dbo.Smallfile."Drawing Number"

    Here are simplified samples of the tables:

    dbo.Largefile:

    Size of drawingISO CodeDrawing Number

    2 1 2

    4 A4 429

    3 A3 2836

    4 A4 2836

    3 A4 5017

    4 A4 5017

    4 A4 5017

    4 A4 4566

    dbo.Smallfile:

    Drawing Number

    2

    429

    2836

    5017

    4566

  • Still need DDL, sample data in the form of INSERT statements and expected output. Here's a hint: use GROUP BY in the way that I suggested. If it doesn't work, please tell us how the output differs from what you expect.

    John

  • This is what I've tried:

    SELECT [Drawing Number], MIN([Desc])

    FROM dbo.Largeone JOIN dbo.Smallone ON dbo.Largeone.[Drawing Number] = dbo.Smallone.[Drawing Number]

    GROUP BY [Drawing Number]

    but I get an error "Ambiguous column name - Drawing Number" on line 3

  • You just need to qualify your Drawing Number column in the GROUP BY clause by putting the table name in front of it, in the same way as you did in the line above. Since you're doing an INNER JOIN on an equality, it doesn't make any difference which one you choose.

    John

  • Success! Filter is working great now, thanks John

    Tom

  • Here is the command that I'm using which gives me 2 columns, Drawing Number and Project Number.

    SELECT dbo.Largefile.[Drawing Number], MIN([Project Number])

    FROM dbo.Largefile

    JOIN dbo.Smallfile ON dbo.Largefile.[Drawing Number] = dbo.Smallfile.[Drawing Number]

    GROUP BY dbo.Largefile.[Drawing Number]

    How can I now add the other columns from "Largefile" to the result?

  • Now I'm confused, because your latest query contains a column (Project Number) that wasn't in the sample data you posted. This is why it's important for you to provide CREATE TABLE and INSERT statements at the very beginning - so that we don't have to guess what you're working with.

    What you need to do now is join your result set back to the original table. Do Drawing Number and Project Number together uniquely identify the rows in the table? If they do, your task is easy:

    SELECT <column list>

    FROM LargeTable l

    JOIN (<Your latest query here - be sure to alias the MIN column>) q

    ON l.DrawingNumber = q.DrawingNumber

    AND l.ProjectNumber = q.ProjectNumber

    If they don't, you'll need to decide how to choose which row to return for each Drawing Number and then put another GROUP BY in the outer query.

    John

  • John Mitchell-245523 (11/29/2011)


    Now I'm confused, because your latest query contains a column (Project Number) that wasn't in the sample data you posted. This is why it's important for you to provide CREATE TABLE and INSERT statements at the very beginning - so that we don't have to guess what you're working with.

    What you need to do now is join your result set back to the original table. Do Drawing Number and Project Number together uniquely identify the rows in the table? If they do, your task is easy:

    SELECT <column list>

    FROM LargeTable l

    JOIN (<Your latest query here - be sure to alias the MIN column>) q

    ON l.DrawingNumber = q.DrawingNumber

    AND l.ProjectNumber = q.ProjectNumber

    If they don't, you'll need to decide how to choose which row to return for each Drawing Number and then put another GROUP BY in the outer query.

    John

    A couple of things I'm not sure about:

    -When you say "your latest query here" - do I pop in the whole query here, in brackets?

    -The "q" character at the end - what is the purpose of this?

    -"Alias the MIN" - please explain

    Tom

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply