How do I speed up this DISTINCT statement

  • Is there any way to speed up this query it takes almost 3 minutes to complete? I am running this on a dev server (2-way, 1GB RAM, all data files on 1 drive, 90GB database). I know....let's try to shoot a watermelon through a straw!!!

    PS_KK_ACTIVITY_LOG = 12 million rows

    PS_KK_SOURCE_HDR = 1.5 million rows

    Query:

    SELECT DISTINCT H.BUSINESS_UNIT ,A.LEDGER_GROUP FROM PS_KK_ACTIVITY_LOG A , PS_KK_SOURCE_HDR H WHERE H.KK_TRAN_ID = A.KK_TRAN_ID AND H.KK_TRAN_DT = A.KK_TRAN_DT

    IO Statistics:

    Table 'PS_KK_SOURCE_HDR'. Scan count 1, logical reads 7371, physical reads 2, read-ahead reads 7397.

    Table 'PS_KK_ACTIVITY_LOG'. Scan count 1, logical reads 95857, physical reads 3, read-ahead reads 96598.

  • What are the indexes on the 2 tables ?

  • Not sure any indexes is going to do much good.

    The query is a join without further filtering, which means you'll have to scan both tables anyway since the requested result is a distinct list of all units + groups..?

    Wihtout the distinct I'd assume the result would be all 12 million rows from the childtable.

    /Kenneth

  • Here are the indexes. I have run out of ideas to fix.

    PS_KK_ACTIVITY_LOG:

    nonclustered located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT, KK_TRAN_LN

    nonclustered located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT, LEDGER

    nonclustered located on PRIMARY - LEDGER, KK_TRAN_DT, BUSINESS_UNIT, FISCAL_YEAR, ACCOUNTING_PERIOD, ACCOUNT, BUDGET_REF

    nonclustered located on PRIMARY - LEDGER, DEPTID, FUND_CODE, ACCOUNT, BUDGET_REF

    nonclustered located on PRIMARY - LEDGER, PROJECT_ID, DEPTID, FUND_CODE, ACCOUNT, BUDGET_REF

    nonclustered located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT, KK_TRAN_LN, BALANCING_LINE, BUSINESS_UNIT, LEDGER_GROUP, BUDGET_REF

    clustered, unique located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT, KK_TRAN_LN, REFERENCED_BUDGET, BALANCING_LINE, LEDGER_GROUP, LEDGER, FISCAL_YEAR, ACCOUNTING_PERIOD, SEQNBR, RVRSL_FLG

    PS_KK_SOURCE_HDR:

    nonclustered located on PRIMARY - KK_PROC_INSTANCE, KK_PROCESS_STATUS, KK_SOURCE_TRAN, SEQUENCE_NBR_9

    nonclustered located on PRIMARY - KK_SOURCE_TRAN, BUSINESS_UNIT, REQ_ID

    nonclustered located on PRIMARY - KK_SOURCE_TRAN, BUSINESS_UNIT, PO_ID

    nonclustered located on PRIMARY - KK_SOURCE_TRAN, BUSINESS_UNIT, VOUCHER_ID

    nonclustered located on PRIMARY -  - KK_SOURCE_TRAN, BUSINESS_UNIT, JOURNAL_ID, JOURNAL_DATE

    nonclustered located on PRIMARY - KK_PROC_INSTANCE, KK_PROCESS_STATUS, KK_SOURCE_TRAN, SEQUENCE_NBR_9, BUSINESS_UNIT, VOUCHER_ID, KK_TRAN_ID, KK_TRAN_DT

    nonclustered located on PRIMARY - DT_TIMESTAMP, KK_TRAN_ID, KK_TRAN_DT

    nonclustered located on PRIMARY - BUSINESS_UNIT, PO_ID, KK_TRAN_ID, KK_TRAN_DT

    nonclustered located on PRIMARY - BUSINESS_UNIT, VOUCHER_ID, KK_TRAN_ID, KK_TRAN_DT

    clustered, unique located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT

  • Well I re-ran a DBCC REINDEXDB on the two tables. Got the time down to 1 minute 15 seconds. Looking at perfmon I am CPU and DISK bound (queue lengths all are above 4).

  • We also need to see the query execution plan. Put this before the SQL and post the results:

    Set Showplan_text on

    go

    >>The query is a join without further filtering, which means you'll have to scan both tables anyway

    I disagree with this. You want to see scanning of the smaller PS_KK_SOURCE_HDR table, but for each attempt to join 1 of these rows to the 12 million details, you'd want an index or clustered-index seek.

    I suspect the biggest issue here is the number of columns in the clustered index on PS_KK_ACTIVITY_LOG.

    This makes the index huge, and don't forget, this also gets replicated in each non-clustered index of the table for use in bookmark lookups.

    eg, run sp_spaceused on PS_KK_ACTIVITY_LOG. Note the results.

    Now change the clustered index:

    clustered, non-unique located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT

    Re-run your DBCC DbReindex and re-run sp_spaceused - how does the size usage change ? How does the query plan change ?

  • StmtText                                                                                                                                                                

    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    SELECT DISTINCT H.BUSINESS_UNIT ,A.LEDGER_GROUP FROM PS_KK_ACTIVITY_LOG A ,

     PS_KK_SOURCE_HDR H WHERE H.KK_TRAN_ID = A.KK_TRAN_ID AND H.KK_TRAN_DT

     = A.KK_TRAN_DT

    (1 row(s) affected)

    StmtText                                                                                                                                                                                              

    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      |--Hash Match(Aggregate, HASH[H].[BUSINESS_UNIT], [A].[LEDGER_GROUP]), RESIDUAL[H].[BUSINESS_UNIT]=[H].[BUSINESS_UNIT] AND [A].[LEDGER_GROUP]=[A].[LEDGER_GROUP]))

           |--Merge Join(Inner Join, MERGE[H].[KK_TRAN_ID], [H].[KK_TRAN_DT])=([A].[KK_TRAN_ID], [A].[KK_TRAN_DT]), RESIDUAL[A].[KK_TRAN_ID]=[H].[KK_TRAN_ID] AND [A].[KK_TRAN_DT]=[H].[KK_TRAN_DT]))

                |--Clustered Index Scan(OBJECT[FXDUSR].[dbo].[PS_KK_SOURCE_HDR].[PS_KK_SOURCE_HDR] AS [H]), ORDERED FORWARD)

                |--Index Scan(OBJECT[FXDUSR].[dbo].[PS_KK_ACTIVITY_LOG].[PSBKK_ACTIVITY_LOG] AS [A]), ORDERED FORWARD)

    (4 row(s) affected)

    StmtText                

    ------------------------

    set statistics io off

    (1 row(s) affected)

     

  • Looks like the index size grew.

    1. clustered, unique

    name                                       rows        reserved           data               index_size       unused            

    --------------------------------------- ----------- ------------------ ------------------ ------------------ -----------

    PS_KK_ACTIVITY_LOG                        12556668    10376000 KB        4018136 KB         6357624 KB         240 KB

    2. clustered, non-unique

    name                                       rows        reserved           data               index_size         unused            

    --------------------------------------- ----------- ------------------ ------------------ ------------------ -----------

    PS_KK_ACTIVITY_LOG                        12556668    10685552 KB        4018136 KB         6667176 KB         240 KB

     

    Here is the showplan:

    ELECT DISTINCT H.BUSINESS_UNIT ,A.LEDGER_GROUP FROM PS_KK_ACTIVITY_LOG A ,

     PS_KK_SOURCE_HDR H WHERE H.KK_TRAN_ID = A.KK_TRAN_ID AND H.KK_TRAN_DT

     = A.KK_TRAN_DT

    (1 row(s) affected)

    StmtText                                                                                                                                                                                              

    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      |--Hash Match(Aggregate, HASH[H].[BUSINESS_UNIT], [A].[LEDGER_GROUP]), RESIDUAL[H].[BUSINESS_UNIT]=[H].[BUSINESS_UNIT] AND [A].[LEDGER_GROUP]=[A].[LEDGER_GROUP]))

           |--Merge Join(Inner Join, MERGE[H].[KK_TRAN_ID], [H].[KK_TRAN_DT])=([A].[KK_TRAN_ID], [A].[KK_TRAN_DT]), RESIDUAL[A].[KK_TRAN_ID]=[H].[KK_TRAN_ID] AND [A].[KK_TRAN_DT]=[H].[KK_TRAN_DT]))

                |--Clustered Index Scan(OBJECT[FXDUSR].[dbo].[PS_KK_SOURCE_HDR].[PS_KK_SOURCE_HDR] AS [H]), ORDERED FORWARD)

                |--Index Scan(OBJECT[FXDUSR].[dbo].[PS_KK_ACTIVITY_LOG].[PSBKK_ACTIVITY_LOG] AS [A]), ORDERED FORWARD)

    (4 row(s) affected)

     

  • I have a similar, but slightly smaller data set that I was able to test with. An Account table (500K rows) and transactions on each account (5 million rows). Selecting a Distinct column pair, with no WHERE clause that hits every record but summarises down to 3K rows, my query will always clustered index seek on the larger table and return in less than 10 seconds.

    Depending on the relative selectivity of your columns, you could try changing the column order in the clustered index - are there multiple KK_TRAN_ID's per KK_TRAN_DT value or vice versa ?

    You could also try creating a non-clustered covering index - since the only 3 columns needed from your larger table are KK_TRAN_ID, KK_TRAN_DT and LEDGER_GROUP, you could create a 3-column index on these 3 columns, and the query would be able to use the index only without needing bookmark lookups to the data pages.

     

  • I added the covering key same results. Besides the disk queue length is stuck at between 4 and 12. Until then....waiting on hardware. Thanks for your help.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply