Grouping on CASE statements = slow query?

  • Hi,

    I have a rather large table (2-3 mill rows) and im creating a view to put on top of this table.

    This view will transform some foreign keys to a prefixed string using case statements and then grouping by these values.

    Example of such a construct:

    CASE WHEN a.SourceSystem_FK = 13 THEN 'i' + pAccHier.Prefix ELSE pAccHier.Prefix + CASE WHEN LEFT(a.AccountNumber, 1) IN ('3', '4') THEN c.ID + '_' ELSE '' END END + a.AccountNumber AS Account

    If i do this select without the grouping, then the results starts showing up almost instantly.

    But once i start grouping by these constructs, then preformance is drasticly slowed down. I have 6 such constructs and grouping by them will increase processing time from 1 sec to 3,5 mins.

    Do you clever folks know any way i could optimize this? From the execution plan it looks like most of the time is spent sorting the result.

  • Could you please provide the full query at least, please. Grouping is relatively expensive operation as well as string concatination. It's impossible to advise on optimisation without having underlying object structure. If you could provide DDL for tables and the rest as per link in my signature, I believe you will find a lot of good advices posted.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • It will definitely have some impact because now the the QP has to apply the case and then do the aggregation(s) which are may need ordering in addition to grouping. For example consider these 3 queries in AdventureWorks2008:

    SELECT

    CASE WHEN LEFT(P.LastName, 1) = 'A' THEN 1

    ELSE 2

    END

    FROM

    person.Person AS P

    GROUP BY

    P.LastName

    SELECT

    CASE WHEN LEFT(P.LastName, 1) = 'A' THEN 1

    ELSE 2

    END

    FROM

    person.Person AS P

    GROUP BY

    CASE WHEN LEFT(P.LastName, 1) = 'A' THEN 1

    ELSE 2

    END

    SELECT

    CASE WHEN LEFT(P.LastName, 1) = 'A' THEN 1

    ELSE 2

    END

    FROM

    person.Person AS P

    The first query is faster than the second because the ordering can be pushed to the index scan and a fast stream aggregate can be used for the aggregation and the compute scalar for the CASE can be done last. In the second query the same index is scanned but it can't be ordered because of the function, the CASE must be evaluated before the grouping and a more expensive HASH MATCH must be used for the aggregation since the results aren't ordered. The last query is the least expensive because there is no aggregation just an index scan and a compute scalar for the CASE.

    IF you posted DDL and the full query we might be able to offer some optimization advice. You can't really compare a query with aggregation and a simple select because the aggregation will add to the cost.

  • Thanks for both your answers and sorry for my late reply.

    First of. Here is DDL of the main selected table "PostingYTD", its indexes and the cross joined functions that create prefixes. Further down you will find the complete select statement.

    I have not provided the DDL of the DimensionBase tables, they are all just regular dimensions and is not the subject of my question.

    Thank you for all helpful suggestions. Im running out of ideas here. I cant think of any way to speed it up, so hopefully you guys have trick hidden up your sleave.

    DDL of PostingYTD, indexes and Function:

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE [Fact2012Base].[PostingYTD](

    [Year] [int] NOT NULL,

    [Month] [int] NULL,

    [Period] [nvarchar](50) NULL,

    [SourceSystem_FK] [int] NOT NULL,

    [TransactionVersion_FK] [int] NOT NULL,

    [Supplier_FK] [int] NULL,

    [Customer_FK] [int] NULL,

    [Vessel_FK] [int] NULL,

    [SFICode_FK] [int] NULL,

    [ProductGroupGeoteam_FK] [int] NULL,

    [CurrencyOriginal_FK] [int] NULL,

    [DynamicsDocumentType_FK] [int] NULL,

    [StaffCategory_FK] [int] NULL,

    [Currency_FK] [int] NOT NULL,

    [ExchangeRateType_FK] [int] NOT NULL,

    [Account_FK] [int] NOT NULL,

    [TransactionType_FK] [int] NOT NULL,

    [Department_FK] [int] NOT NULL,

    [CounterPartCompany_FK] [int] NULL,

    [Project_FK] [int] NULL,

    [Company_FK] [int] NOT NULL,

    [Amount_YTD] [float] NULL,

    [Amount_EUR_YTD] [float] NULL,

    [Updated] [datetime] NULL

    ) ON [PRIMARY]

    GO

    CREATE NONCLUSTERED INDEX [IX_PostingYTD_Main] ON [Fact2012Base].[PostingYTD]

    (

    [Year] ASC,

    [SourceSystem_FK] ASC,

    [TransactionVersion_FK] ASC,

    [TransactionType_FK] ASC,

    [Amount_YTD] ASC

    )

    INCLUDE ( [Month],

    [Period],

    [Currency_FK],

    [ExchangeRateType_FK],

    [Account_FK],

    [Department_FK],

    [CounterPartCompany_FK],

    [Project_FK],

    [Company_FK]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    CREATE NONCLUSTERED INDEX [IX_PostingYTD_Source] ON [Fact2012Base].[PostingYTD]

    (

    [SourceSystem_FK] ASC

    )

    INCLUDE ( [Year],

    [Month],

    [TransactionVersion_FK],

    [Currency_FK],

    [Account_FK],

    [Department_FK],

    [CounterPartCompany_FK],

    [Company_FK],

    [Amount_YTD]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    CREATE NONCLUSTERED INDEX [IX_YTD_Currency] ON [Fact2012Base].[PostingYTD]

    (

    [Currency_FK] ASC,

    [Company_FK] ASC,

    [Year] ASC,

    [SourceSystem_FK] ASC,

    [TransactionType_FK] ASC

    )

    INCLUDE ( [Month],

    [Period],

    [TransactionVersion_FK],

    [ExchangeRateType_FK],

    [Account_FK],

    [Department_FK],

    [CounterPartCompany_FK],

    [Project_FK],

    [Amount_YTD]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    CREATE FUNCTION [Masterdata].[GetHierarchyPrefixRow]

    (

    -- Add the parameters for the function here

    @hierarchy_FK int,

    @hierarchyLevel smallint

    )

    RETURNS TABLE

    AS

    RETURN

    (

    Select Masterdata.GetHierarchyPrefix( @hierarchy_FK, @hierarchyLevel ) as Prefix

    )

    GO

    CREATE FUNCTION [Masterdata].[GetHierarchyPrefix]

    (

    -- Add the parameters for the function here

    @hierarchy_FK int,

    @hierarchyLevel smallint

    )

    RETURNS nvarchar(50)

    AS

    BEGIN

    DECLARE @resultPrefix nvarchar(21)

    DECLARE @hierarchyPrefix nvarchar(5)

    DECLARE @levelPrefix nvarchar(5)

    DECLARE @dimesionPrefix nvarchar(5)

    DECLARE @showHierarchyPrefix bit

    DECLARE @showLevelPrefix bit

    DECLARE @showDimesionPrefix bit

    SELECT

    @hierarchyPrefix = Masterdata.Hierarchy.Prefix,

    @levelPrefix = Masterdata.HierarchyLevel.Prefix,

    @dimesionPrefix = Masterdata.Dimension.Prefix,

    @showHierarchyPrefix = Masterdata.HierarchyLevel.ShowHierarchyPrefix,

    @showLevelPrefix = Masterdata.HierarchyLevel.ShowHierarchyLevelPrefix,

    @showDimesionPrefix = Masterdata.HierarchyLevel.ShowDimensionPrefix

    FROM Masterdata.Hierarchy INNER JOIN

    Masterdata.HierarchyLevel ON Masterdata.Hierarchy.Hierarchy_PK = Masterdata.HierarchyLevel.Hierarchy_FK LEFT OUTER JOIN

    Masterdata.Dimension ON Masterdata.HierarchyLevel.Dimension_FK = Masterdata.Dimension.Dimension_PK

    WHERE (Masterdata.Hierarchy.Hierarchy_PK = @hierarchy_FK) AND (Masterdata.HierarchyLevel.[Level] = @hierarchyLevel)

    SET @resultPrefix = ''

    IF @showHierarchyPrefix = 1

    SET @resultPrefix = @hierarchyPrefix

    IF @showHierarchyPrefix = 1 AND @showLevelPrefix = 1

    SET @resultPrefix = @resultPrefix + '.'

    IF @showLevelPrefix = 1

    SET @resultPrefix = @resultPrefix + @levelPrefix

    IF @showLevelPrefix = 1 AND @showDimesionPrefix = 1

    SET @resultPrefix = @resultPrefix + '.'

    IF @showDimesionPrefix = 1

    SET @resultPrefix = @resultPrefix + @dimesionPrefix

    IF @showHierarchyPrefix = 1 OR @showLevelPrefix = 1 OR @showDimesionPrefix = 1

    SET @resultPrefix = @resultPrefix + '.'

    IF @resultPrefix is null

    RETURN 'Error Hierarchy level undefined'

    RETURN @resultPrefix

    END

    GO

    Here is the select statement with the group by case statements

    SELECT

    CASE WHEN y.TransactionType_FK IN (1, 5) THEN 'Act' ELSE 'Bud' END AS Scenario,

    CASE WHEN y.TransactionType_FK = 5 THEN 'YTD Journals Load' ELSE 'YTD Load' END AS [View],

    Year,

    Month,

    Period,

    pEntHier.Prefix + d.FullID AS Entity,

    CASE WHEN a.SourceSystem_FK = 13 THEN 'i' + pAccHier.Prefix ELSE pAccHier.Prefix + CASE WHEN LEFT(a.AccountNumber, 1)

    IN ('3', '4') THEN c.ID + '_' ELSE '' END END + a.AccountNumber AS Account,

    pCountPartHier.Prefix + cp.ID AS CounterPart,

    pProjHier.Prefix + pc.ID + '_' + p.ID AS Project,

    cu.ID AS Currency,

    y.Currency_FK,

    y.ExchangeRateType_FK,

    SUM(Amount_YTD) AS Amount_YTD

    FROM

    Fact2012Base.PostingYTD AS y

    LEFT OUTER JOIN

    DimensionBase.Project AS p

    ON p.Project_FK = y.Project_FK

    INNER JOIN

    DimensionBase.Company AS pc

    ON p.OwnerCompany_FK = pc.Company_FK

    INNER JOIN

    DimensionBase.Department AS d

    ON y.Department_FK = d.Department_FK

    INNER JOIN

    DimensionBase.Account AS a

    ON y.Account_FK = a.Account_FK

    INNER JOIN

    DimensionBase.Company AS c

    ON y.Company_FK = c.Company_FK

    AND y.Currency_FK = c.Currency_FK

    INNER JOIN

    DimensionBase.Currency AS cu

    ON y.Currency_FK = cu.Currency_FK

    LEFT OUTER JOIN

    DimensionBase.Company AS cp

    ON y.CounterpartCompany_FK = cp.Company_FK

    CROSS JOIN Masterdata.GetHierarchyPrefixRow(10, 2) AS pCountPartHier

    CROSS JOIN Masterdata.GetHierarchyPrefixRow(1, 3) AS pEntHier

    CROSS JOIN Masterdata.GetHierarchyPrefixRow(3, 2) AS pAccHier

    CROSS JOIN Masterdata.GetHierarchyPrefixRow(9, 1) AS pProjHier

    WHERE

    (y.Year >= 2010)

    AND (y.TransactionVersion_FK <> 192)

    AND (y.SourceSystem_FK > 0)

    AND (y.TransactionType_FK IN (1, 2, 5))

    AND (c.Country_FK = 31)

    AND (y.Amount_YTD <> 0)

    GROUP BY

    CASE WHEN y.TransactionType_FK IN (1, 5) THEN 'Act' ELSE 'Bud' END,

    CASE WHEN y.TransactionType_FK = 5 THEN 'YTD Journals Load' ELSE 'YTD Load' END,

    Year,

    Month,

    Period,

    pEntHier.Prefix + d.FullID,

    CASE WHEN a.SourceSystem_FK = 13 THEN 'i' + pAccHier.Prefix ELSE pAccHier.Prefix + CASE WHEN LEFT(a.AccountNumber, 1)

    IN ('3', '4') THEN c.ID + '_' ELSE '' END END + a.AccountNumber,

    pCountPartHier.Prefix + cp.ID,

    pProjHier.Prefix + pc.ID + '_' + p.ID,

    cu.ID,

    y.Currency_FK,

    y.ExchangeRateType_FK

  • I would try to SELECT INTO #table the result of the query including the calculated columns used for grouping (without grouping itself), then the final query would select from #table and group as required

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (2/10/2012)


    I would try to SELECT INTO #table the result of the query including the calculated columns used for grouping (without grouping itself), then the final query would select from #table and group as required

    Thanks for your tip.

    I just tried this. The select into #table takes about 2:30 to run, and the final group by query 0:40 sec.

    So end result is 3:10 which is marginally faster than doing the select in its original form. (3:30)

    It seems as the share amount of data to copy into the temp table is whats making the first step take so long. If i just select (without into and without group by) then the results start coming instantly, but the 1.8 million rows it returns takes 2:30 to load.

  • Ok, your "results start coming instantly" means nothing, as most likely they will take even longer to come out in full than SELECT INTO. 1.8 millions should not take 2.30 min to insert. Can you attach the query plan? Next suspect whould be cross joining to user-defined function...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Hmm.. Its a bit difficult to get the actual execution plan for you. When i turn on include actual execution plan the query just goes on forever. I've been waiting 16 mins now for a result but nothing. I'll let it run for another 15 then i have to include the estimated one instead...

  • Seeing your query I don' think your main problem is the CASE statement included in the GROUP BY, I think it more likely related to your CROSS JOIN's to the function and then the fact that you are grouping on data from the function calls. I'd be willing to bet you basically have created a cursor with each function call because SQL is calling the functions once for each row instead of doing it in a set based manner, that just a guess. I'd try putting the results of the function calls in table variables or temp tables since they are not being run based on data in any of the tables so that the functions are only called once each.

    I'd be interested to see what your performance would be if you eliminated the function calls.

    I'd also be interested in seeing what a server-side trace produced when this query is run.

  • Ok. I let the query run for an hour, no result. Seems like the actual executionplan is impossible to get.

    Anyways, here is the estimated one, if that helps..

  • erikpoker (2/10/2012)


    Ok. I let the query run for an hour, no result. Seems like the actual executionplan is impossible to get.

    Anyways, here is the estimated one, if that helps..

    I don't see the grouping happening in this query. I do see that it looks like the optimizer was smart enough to only call the cross joined functions once which is a good thing.

    You do have one other error in the query which I believe means you aren't getting the results you really want from the query. Although you have specified a LEFT OUTER JOIN to DimensionBase.Project the fact that you have that table INNER JOIN'd to DimensionBase.Company really converts the OUTER JOIN to an INNER JOIN. This is confirmed in the execution plan as you can see that optimizer has only implemented one OUTER JOIN and that is on the last LEFT OUTER JOIN to DimensionBase.Company AS cp ON y.CounterpartCompany_FK = cp.Company_FK. To actually get the first OUTER JOIN I believe you need to convert the INNER JOIN to DimensionBase.Company AS pc ON p.OwnerCompany_FK = pc.Company_FK to a LEFT OUTER JOIN and I think that will get you what you really want.

    I'm actually a little surprised you are getting an OUTER JOIN on the second one since you have a filter on that table in the WHERE clause, but I'm not smart enough to understand why that would happen. I may ask someone who is though.

  • The grouping is not happening in that queryplan. I attached the queryplan for the previous suggestion to first put the results in a temp table and then group the results. So the attached queryplan only applies to the first select into query, without any grouping.

    As for your suggestions on the outer joins, you are right, a bit of a mixup there, but not affecting things much performance wise.

  • erikpoker (2/10/2012)


    As for your suggestions on the outer joins, you are right, a bit of a mixup there, but not affecting things much performance wise.

    Yeah it's not a performance issue, but if you aren't getting the right data then does performance really matter? I'm normally concerned with getting the right results first and then tuning to get the right results as quickly as possible.

  • Jack Corbett (2/10/2012)

    Yeah it's not a performance issue, but if you aren't getting the right data then does performance really matter? I'm normally concerned with getting the right results first and then tuning to get the right results as quickly as possible.

    Lol. Actually it does belive it or not 🙂 I have a team of guys armed with excel and god-knows-what, whos job is to verify the numbers returned by this view. So my job is just to get the thing running as fast as humanly (or computerly) possible.

    Right now i cant see many more possibilities to speed this thingy up. Maybe an indexed view, or a prepopulated table, but i dont really wanna go there...

  • Can you post the execution plan for the group by?

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply