Trouble with large set of data

  • Hi,

    I'm having extreme trouble with getting some data from a table containing 145 million entries.

    I have 2 queries:

    Query 1:

    SELECT item_code

    FROM TBL_ITEMS

    WHERE item_desc LIKE '%shoes%'

    Query 2:

    SELECT store_code, SUM(items_sold)

    FROM TBL_SALES

    WHERE date_sold > '2009-01-01'

    AND item_code IN ('tennis-shoes-123','golf-shoes-456')

    GROUP BY store_code

    Query 1 gives me all item_codes for items that are shoes. Query 2 gives me total number of items sold per store for 'tennis-shoes-123' and 'golf-shoes-456' since the start of the year.

    Problem is, I can run both queries fine, but when I substitute the entire Query 1 in Query 2 the query takes EXTREMELY LONG to execute. In effect I have:

    Query 3:

    SELECT store_code, SUM(items_sold)

    FROM TBL_SALES

    WHERE date_sold > '2009-01-01'

    AND item_code IN (SELECT item_code FROM TBL_ITEMS WHERE item_desc LIKE '%shoes%')

    GROUP BY store_code

    TBL_SALES is the table with 145 million entries. It also sits on a different server than TBL_ITEMS. I've never let the query complete execution, because it takes too long. Also, canceling the execution takes on average 10 minutes.

    Any ideas or suggestions?

  • Just a suggestion have you run these queries in SSMS and displayed the actual / estimated execution plans?. And since your request is for improving performance please post following as closely as you can the

    articles referenced in my signature block. In other words help us help you.

    You may also wish to read this article on SCC

    http://qa.sqlservercentral.com/columnists/jsack/570Xch28FINAL.pdf

    edited 6:42 PM

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • i've seen similar queries that hit multiple servers with the same or more rows take days to execute. you are better off working on smaller chunks of data at a time so the plan doesn't do a clustered index scan

  • SQL Noob (11/18/2009)


    i've seen similar queries that hit multiple servers with the same or more rows take days to execute. you are better off working on smaller chunks of data at a time so the plan doesn't do a clustered index scan

    I don't think that would help at all...

    As long as there is a condition "LIKE '%shoes%'" it will always perform a table/index scan. It's like looking for phone numbers of people with the first name = Adam in a phone book...

    Another option might be using the full-text search concept, if the type of data selection is frequently use (wildcard/freetext) search. (Nothing I had to use so far, though...)



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Just another suggestion: avoid the IN clause...

    try replacing it with an EXISTS clause.

    SELECT store_code, SUM(items_sold)

    FROM TBL_SALES

    WHERE date_sold > '2009-01-01'

    AND EXISTS (SELECT 1 FROM TBL_ITEMS WHERE TBL_SALES.item_code = TBL_ITEMS.item_code AND TBL_ITEMS.item_desc LIKE '%shoes%')

    GROUP BY store_code

    And if you know the "groups" you're going to be searching (%shoes%, %socks%, ...) try having a table that has those groups and an Id and use that Id on the TBL_ITEMS table instead of search the text.

    Obviously if the LIKE clause is an user's input then you can't do this...

    Pedro



    If you need to work better, try working less...

  • Thank you for your replies.

    I did run the estimated execution plans and it is clear that 100% cost is going into the remote query and must be the reason for the extreme execution time. The estimated execution plans also revealed that one my local tables is missing an index. Hopefully adding it will make some sort of difference.

    Concerning the LIKE clause, the queries that I've posted are dumbed-down versions of the actual 2 queries that I'm using, just to illustrate the point. In the real queries there aren't any LIKE clauses, I know exactly what values I need. However, I am using the IN clause and have tried replacing it with an EXISTS, but with no luck.

    My next step will be to try and bring back a smaller subset of data from the remote table and query on that.

  • I haven't made the test on SQL 2008 but on 2005 the IN clause was quite heavy.

    If I had a query with "col IN (1)" it's much heavier than "col = 1" even if there's only one value on the IN clause...

    If you know the values returned by each IN clause are different you could type replacing "col IN (1, 2, ..)" with ".... col = 1 UNION ALL ... col = 2 .... ". Use the UNION has a sub query just to return the desired ids and then join with the "main" table to get the remaining data.

    Pedro



    If you need to work better, try working less...

  • It's going to be difficult for people to help if the query posted doesn't bear any resemblance to the query you're having problems with.

    So, the real query uses a linked server connection? Is the remote table much bigger than the local table? I'd suggest using OPENQUERY might be a good approach as this executes the query on the remote side rather than the local side.

  • Ok, well maybe if I post the exact queries I'm using you'll be able to spot something I didn't.

    Query1:

    DECLARE @Style_IDs TABLE (

    style_id char(12))

    INSERT INTO @Style_IDs (style_id)

    SELECT DISTINCT(purchase_order_plan.style_id)

    FROM TBL_RANGE_ASSORTMENT_PLAN assortment_plan

    INNER JOIN TBL_RANGE_PURCHASE_ORDER_PLAN purchase_order_plan

    ON (purchase_order_plan.subclass_id + purchase_order_plan.subclass_group_code = assortment_plan.subclass_id + assortment_plan.subclass_group_code

    AND purchase_order_plan.block_id = assortment_plan.block_id

    AND purchase_order_plan.style_no = assortment_plan.style_no)

    WHERE assortment_plan.subclass_id + assortment_plan.subclass_group_code = '5001040300'

    AND assortment_plan.block_id = '911'

    AND assortment_plan.allocation_qty > 0

    Query2:

    SELECT grading.grade_no, SUM(sales.sales) units_sold

    FROM PKB.MDSP.dbo.style_branch_sales_weekly sales

    INNER JOIN TBL_RANGE_ASSORTMENT_BRANCH grading

    ON (grading.branch_no = sales.branch_no

    AND grading.subclass_id = SUBSTRING(sales.style_id,1,8)

    AND grading.block_id = (SELECT natural_block FROM PKB_LOC.dbo.TBL_CAL_MONTH WHERE sales.week_end BETWEEN startdate AND enddate))

    WHERE sales.sales > 0

    AND sales.week_end < DATEADD(wk,8,(SELECT MIN(week_end) FROM PKB.MDSP.dbo.style_branch_sales_weekly WHERE style_id = sales.style_id AND branch_no = sales.branch_no AND sales > 0))

    AND sales.style_id in ('500104039101','500104039102')

    GROUP BY grading.grade_no

    ORDER BY grading.grade_no;

    So Query 1 gets the Style IDs for a particular Assortment Plan, while Query 2 finds the total number of units sold for the Style IDs found in Query 1. The remote table containing 145 million sales entries is in Query 2: PKB.MDSP.dbo.style_branch_sales_weekly.

    The problem comes in when I substitute "AND sales.style_id IN ('500104039101','500104039102')" with "AND sales.style_id IN (SELECT style_id FROM @Style_IDs)".

    Any ideas?

  • Ok, what's probably happening is that when you've got actual values in the IN, it's able to only bring back the rows that meet that criteria, but the optimiser is not able to do the same when using the results of a temporary table in the where clause, so it brings the entire style_branch_sales_weekly table back to the local server then does a filter on this locally which is obviously massively inefficient.

    You could try converting the IN to a join to see if it's more intelligent in its approach e.g INNER JOIN (SELECT DISTINCT style_id FROM @Style_IDs) Styles ON Styles.style_id=sales.style_id. Also I'd be interested to see if it does the same thing if it's a persisted table rather than a declared one.

    Otherwise, believe it or not, you'd probably get better performance from building dynamic SQL to change the remote query to:

    SELECT * FROM OPENQUERY(PKB, 'select * from MDSP.dbo.style_branch_sales_weekly where style_id in (''500104039101'',''500104039102''')

  • Thank you for your reply.

    I have now tried joining the result set of style id's from the temporary @Style_IDs table but it did not work. I also tried inserting the results of table @Style_IDs into a permanently defined table (I assumed this is what you meant by persisted table) and tried joining than on my second query, no luck either.

    My last way out would be to try the OPENQUERY solution. If I understand correctly, I need to build a SQL query string, containing all the necessary data needed to fetch the required results, and pass that on with an OPENQUERY statement to be executed on the remote sql server?

  • Thanks HowardW, the OpenQuery did the trick!

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply