how to remove numbers from strings?

  • sheesh I go away for the weekend, and there's a testing war started? hehe...

    Jeff - for what it's worth: if you changed your indexing scheme - OVER (Partition...) works a LOT better. With no appreciable gain on the INNER join method.

    For example:

    changing the index to INCLUDE (totaldue, sales order) cuts execution to 7 sec. (7.227 sec to be exact)

    changing the primary key/clustered to customerID+salesorderID - cuts execution to 5 secs. (actually - 4.933)

    No matter what, the INNER JOIN syntax seems to be stubbornly pagged right around 2.4 (at 2.303 sec. with both indexes, down from 2.470 secs with your indexing).

    Now - keep in mind that I haven't spent enough time with PARTITION to figure out what best makes it tick, but the fact that I could improve it by 50% in a few minutes makes me wonder if we're running optimized code against non-optimized code (meaning - there might be yet over tricks to play I haven't thought of).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt, do you believe in miracles?

    How do you think OVER gets data?

    Can you post execution plans for both options of the query?

    May be this will help you believe less in marketing rubbish and more in the logic.

    _____________
    Code for TallyGenerator

  • Sergiy - what I do believe is that it took a while for people to learn how to optimize 2000 when it came out, so to toss something out based on the first attempt at using it is what I call "not giving it a fair shake".

    There are some folks out there actually trying to use the new features, and I did find a few links today on trying to make them perform better, so I was going to read through them first before tossing out every new feature. I'll reserve my judgement on them until AFTER I've exhausted those. In particular - there's a SORT task that needs to be minimized, and there are a few arcane methods for doing that - just need some time to digest them to see if they could apply.

    I don't believe a single test basis for just about anything, especially when I'm not sure I'm comparing apples to apples, nor if the test is a "fair" test of the feature. Jeff's had a few years to hone his technique, so his code's "got a headstart" on optimization - just bringing up the fact that there MIGHT be some things to do to improve the new stuff.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt,

    you did not answer the question:

    How do you think OVER gets data?

    It's not about optimisation of queries.

    It's about number of queries you execute.

    "OVER" option executes 2 subqueries, "INNER" option just one.

    Performance cannot match.

    Note, we don't take into consideration the overhead of translation of OVER's into pure T-SQL which is further processed by optimizer.

    I'm not sure the T-SQL built by auto-tools will be always the optimal one. Hope they use "WHERE EXISTS" instead of "IN (...)".

    _____________
    Code for TallyGenerator

  • Well - I'm having some fun with having "lost" what I was testing earlier - but I will recreate later.

    What it seemed to tell me though that it was performing three table scans, (meaning - three separate pointers into the same temporary table object, or so it seems) which constituted 60 percent of the effort. table scans can be "tamed" by indexing, but I haven't yet found the right combination of indexing to handle them.

    As to how is it done - I don't have a clue yet. I haven't found the "behind the scenes" description of how they tooled it out, so I'm trying to find out. It's very possible that it is three separate operations, but it looks more like three separate pointers into the same data, and "slipstreamed" somehow. I also want to take one of the Partition statements Jeff had out to see what THAT does to the execution plan (that was my second thing to pursue - whether the two PARTITION BY statements ran together, or ran separately and "merged" together, which could ALSO be why it was slower). I'm wondering if an inner join on a PARTITION might run a wee bit faster.

    Anyway - a few things to check into. and - I don't know.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Way too much pontificating about things, and way too little testing (I know, not everyone has the environment to test, but I'd personally be sure before I made bold claims that might not stand up). 🙂

    I finally got around to looking at the second test setup, and...

    The second test is flawed (well, the first is too, since it removed the ORDER BY, which is how it is used in our reports). The second test also added a highly improbable (let's face it, we're not going to create a covering index for every column of every derived table out there, especially for a report that runs in a couple of seconds either way) covering index to assist the INNER JOIN's derived table. The two queries in the second test don't return the same results, as SQL Server happily keeps all of the customer rows together in a group when using OVER, but not in the INNER JOIN. In other words, the Over() version is all nice and neat, while the INNER JOIN requires an ORDER BY to make any sense at all. To be fair, I also required the OVER to do an ORDER BY, even though it's nice and clean the first pass. It's probably safer to do so anyway, as caching could potentially shift the order of CustomerID, even though each group would remain together as a whole. I also figured we'd go ahead and take caching advantages out of the equation, as that's just a good testing practice. Lowell, or anyone else who can test, feel free to verify my claims about the order of the data in the original second test for the audience at large.

    With the ORDER BYs returned to the code, the totaldue column removed from the index, and caching take out of the picture, here are the results of three passes. Even though the Over technically was faster 2 out of 3 times, I'm willing to call it a tie. 🙂

    Pass 1

    (1000000 row(s) affected)

    00:00:10:950 Duration (hh:mi:ss:mmm)

    ==============================================================================

    INNER JOIN method...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (1000000 row(s) affected)

    00:00:04:750 Duration (hh:mi:ss:mmm)

    ==============================================================================

    OverParition method...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (1000000 row(s) affected)

    00:00:04:627 Duration (hh:mi:ss:mmm)

    ==============================================================================

    Pass 2

    (1000000 row(s) affected)

    00:00:10:623 Duration (hh:mi:ss:mmm)

    ==============================================================================

    INNER JOIN method...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (1000000 row(s) affected)

    00:00:04:720 Duration (hh:mi:ss:mmm)

    ==============================================================================

    OverParition method...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (1000000 row(s) affected)

    00:00:04:750 Duration (hh:mi:ss:mmm)

    ==============================================================================

    Pass 3

    (1000000 row(s) affected)

    00:00:10:700 Duration (hh:mi:ss:mmm)

    ==============================================================================

    INNER JOIN method...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (1000000 row(s) affected)

    00:00:04:907 Duration (hh:mi:ss:mmm)

    ==============================================================================

    OverParition method...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (1000000 row(s) affected)

    00:00:04:843 Duration (hh:mi:ss:mmm)

    ==============================================================================

    Here is the modified code if someone wants to run it, see what I changed, etc.

    --===== If they already exist, drop the temp tables

    IF OBJECT_ID('TempDB..#SalesOrderHeader') IS NOT NULL

    DROP TABLE #SalesOrderHeader

    IF OBJECT_ID('TempDB..#Test1Results') IS NOT NULL

    DROP TABLE #Test1Results

    IF OBJECT_ID('TempDB..#Test2Results') IS NOT NULL

    DROP TABLE #Test2Results

    --===== Create a variable to measure duration with

    DECLARE @StartDate DATETIME

    --===== Create a million row test table PRINT 'Creating test table...'

    SET @StartDate = GETDATE()

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1)

    ,CustomerID = CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT)

    ,SalesOrderID = CAST(RAND(CAST(NEWID() AS VARBINARY))*10000000+1 AS INT)

    ,TotalDue = CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY)

    INTO

    #SalesOrderHeader

    FROM

    Master.dbo.SysColumns sc1

    ,Master.dbo.SysColumns sc2

    --===== A table is not properly formed unless a Primary Key has been assigned

    ALTER TABLE #SalesOrderHeader

    ADD PRIMARY KEY CLUSTERED (RowNum) WITH FILLFACTOR = 100

    --===== Add an index to the CustomerID/TotalDue columns just for grins

    --===== Removed the TotalDue from the index, since it quite frankly wouldn't be there in a production environment

    CREATE INDEX tmpSalesOrderHeader_CustomerID_TotalDue ON #SalesOrderHeader (CustomerID)

    PRINT CONVERT(VARCHAR(12),GETDATE()-@StartDate,114) + ' Duration (hh:mi:ss:mmm)'

    PRINT REPLICATE('=',78)

    --===== Run duration test on INNER JOIN method

    PRINT 'INNER JOIN method...'

    DBCC FREEPROCCACHE

    DBCC DROPCLEANBUFFERS

    SET @StartDate = GETDATE()

    SELECT

    H.CustomerID

    ,H.SalesOrderID

    ,H.TotalDue

    ,PercentageOfTotal = H.TotalDue / T.Sum_TotalDue * 100.0

    ,DifferenceFromAverage = H.TotalDue - T.Avg_TotalDue

    INTO

    #Test1Results

    FROM

    #SalesOrderHeader H

    INNER JOIN(

    SELECT

    CustomerID

    ,SUM(TotalDue) Sum_TotalDue

    ,AVG(TotalDue) Avg_TotalDue

    FROM

    #SalesOrderHeader

    GROUP BY

    CustomerID

    ) T ON T.CustomerID = H.CustomerID

    ORDER BY

    H.CustomerID

    PRINT CONVERT(VARCHAR(12),GETDATE()-@StartDate,114) + ' Duration (hh:mi:ss:mmm)'

    PRINT REPLICATE('=',78)

    --===== Run duration test on OVER PARTITION method

    PRINT 'OverParition method...'

    DBCC FREEPROCCACHE

    DBCC DROPCLEANBUFFERS

    SET @StartDate = GETDATE()

    SELECT

    CustomerID

    ,SalesOrderID

    ,TotalDue

    ,PercentageOfTotal = TotalDue / Sum(TotalDue) OVER(Partition BY CustomerID) * 100.0

    ,DifferenceFromAverage = TotalDue - Avg(TotalDue) OVER(Partition BY CustomerID)

    INTO

    #Test2Results

    FROM

    #SalesOrderHeader

    ORDER BY

    CustomerID

    PRINT CONVERT(VARCHAR(12),GETDATE()-@StartDate,114) + ' Duration (hh:mi:ss:mmm)'

    PRINT REPLICATE('=',78)

    Fire away!

  • I forgot to mention, the above two queries have identical actual execution plans. 🙂

  • David,

    Another false statement from you:

    [Code]

    --===== Add an index to the CustomerID/TotalDue columns just for grins

    --===== Removed the TotalDue from the index, since it quite frankly wouldn't be there in a production environment

    [/Code]

    If you have not created that index SQL Server will create statistics on this column anyway.

    It's not that good as index but it WILL BE USED in any query, including reports.

    Of course, the only way for you to make OVER to tie is to force table scans. When SQL Server has to collect statistics on fly.

    It quite frankly wouldn't be the case in a production environment.

    Or it would be? In yours one?

    _____________
    Code for TallyGenerator

  • BTW,

    can you please post those identical execution plans?

    😉

    _____________
    Code for TallyGenerator

  • And for clean test you need to try to reverse order of tests.

    With no index in place Server supposes to create statistics on those columns during 1st query and use it for second one.

    I don't see you drop statistics between runs.

    _____________
    Code for TallyGenerator

  • David McFarland (10/22/2007)The two queries in the second test don't return the same results, as SQL Server happily keeps all of the customer rows together in a group when using OVER, but not in the INNER JOIN.

    What is you definition of "the same result"?

    In other words, the Over() version is all nice and neat, while the INNER JOIN requires an ORDER BY to make any sense at all.

    For you, probably.

    CustomerID unlikely to be used for ordering records in any report.

    It's some internal ID nobody cares about.

    In real life you need to sort by CustomerLastName (which is in another table) or CustomerFullName (which is computed on fly), or by aggregated amounts.

    So, your OVER() "advantage" is quite useless for real life tasks.

    Not to mention it's just occasional.

    _____________
    Code for TallyGenerator

  • Sergiy (10/22/2007)


    David,

    Another false statement from you:

    [Code]

    --===== Add an index to the CustomerID/TotalDue columns just for grins

    --===== Removed the TotalDue from the index, since it quite frankly wouldn't be there in a production environment

    [/Code]

    If you have not created that index SQL Server will create statistics on this column anyway.

    It's not that good as index but it WILL BE USED in any query, including reports.[/QUOTE]

    I simply stated that people aren't going to put such a covering index on a query for a single report. That's a fact. We call that "gaming the system" where I come from. Heck, if you want to keep it, I'm fine with that. I'll just demonstrate the other advantage of Over that I noted, ease of modifications. I'll make a change that will blow your covering index away, and while my change will take 10 seconds, your's will require a changed index, changed code, and will face the exact same issue when the query changes even more. You can try and get around it, but to accomplish the same thing, they run in the same time frame, with the same query.

    Of course, the only way for you to make OVER to tie is to force table scans. When SQL Server has to collect statistics on fly.

    It quite frankly wouldn't be the case in a production environment.

    Or it would be? In yours one?

    Both queries use a Clustered Index scan. One.

  • David McFarland (10/22/2007)I'll make a change that will blow your covering index away, and while my change will take 10 seconds, your's will require a changed index, changed code, and will face the exact same issue when the query changes even more. You can try and get around it, but to accomplish the same thing, they run in the same time frame, with the same query.

    Try it.

    SQL Server will create statistics on used columns anyway.

    No matter how hard you try.

    Both queries use a Clustered Index scan. One.

    Do you know the difference between table scan and clustered index scan?

    _____________
    Code for TallyGenerator

  • Sergiy (10/22/2007)


    BTW,

    can you please post those identical execution plans?

    😉

    Absolutely.

    Inner Join

    StmtText

    --------

    Table Insert(OBJECT:([#Test1Results]), SET:([#Test1Results].[CustomerID] = [tempdb].[dbo].[#SalesOrderHeader].[CustomerID] as [H].[CustomerID],[#Test1Results].[SalesOrderID] = [tempdb].[dbo].[#SalesOrderHeader].[SalesOrderID] as [H].[SalesOrderID],[#Test1Results].[TotalDue] = [tempdb].[dbo].[#SalesOrderHeader].[TotalDue] as [H].[TotalDue],[#Test1Results].[PercentageOfTotal] = [Expr1010],[#Test1Results].[DifferenceFromAverage] = [Expr1011]))

    |--Compute Scalar(DEFINE:([Expr1010]=CONVERT_IMPLICIT(numeric(19,4),[tempdb].[dbo].[#SalesOrderHeader].[TotalDue] as [H].[TotalDue]/[Expr1008],0)*(100.0), [Expr1011]=[tempdb].[dbo].[#SalesOrderHeader].[TotalDue] as [H].[TotalDue]-[Expr1009]))

    |--Top(ROWCOUNT est 0)

    |--Parallelism(Gather Streams, ORDER BY:([H].[CustomerID] ASC))

    |--Nested Loops(Inner Join)

    |--Table Spool

    | |--Segment

    | |--Sort(ORDER BY:([H].[CustomerID] ASC))

    | |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([H].[CustomerID]))

    | |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[#SalesOrderHeader] AS [H]), WHERE:([tempdb].[dbo].[#SalesOrderHeader].[CustomerID] as [H].[CustomerID] IS NOT NULL))

    |--Nested Loops(Inner Join, WHERE:((1)))

    |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1023]=(0) THEN NULL ELSE [Expr1024] END, [Expr1009]=CASE WHEN [Expr1023]=(0) THEN NULL ELSE [Expr1024]/CONVERT_IMPLICIT(money,[Expr1023],0) END))

    | |--Stream Aggregate(DEFINE:([Expr1023]=COUNT_BIG([tempdb].[dbo].[#SalesOrderHeader].[TotalDue] as [H].[TotalDue]), [Expr1024]=SUM([tempdb].[dbo].[#SalesOrderHeader].[TotalDue] as [H].[TotalDue])))

    | |--Table Spool

    |--Table Spool

    Over

    StmtText

    --------

    Table Insert(OBJECT:([#Test2Results]), SET:([#Test2Results].[CustomerID] = [tempdb].[dbo].[#SalesOrderHeader].[CustomerID],[#Test2Results].[SalesOrderID] = [tempdb].[dbo].[#SalesOrderHeader].[SalesOrderID],[#Test2Results].[TotalDue] = [tempdb].[dbo].[#SalesOrderHeader].[TotalDue],[#Test2Results].[PercentageOfTotal] = [Expr1008],[#Test2Results].[DifferenceFromAverage] = [Expr1009]))

    |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(numeric(19,4),[tempdb].[dbo].[#SalesOrderHeader].[TotalDue]/[Expr1006],0)*(100.0), [Expr1009]=[tempdb].[dbo].[#SalesOrderHeader].[TotalDue]-[Expr1007]))

    |--Top(ROWCOUNT est 0)

    |--Parallelism(Gather Streams, ORDER BY:([tempdb].[dbo].[#SalesOrderHeader].[CustomerID] ASC))

    |--Nested Loops(Inner Join)

    |--Table Spool

    | |--Segment

    | |--Sort(ORDER BY:([tempdb].[dbo].[#SalesOrderHeader].[CustomerID] ASC))

    | |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([tempdb].[dbo].[#SalesOrderHeader].[CustomerID]))

    | |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[#SalesOrderHeader]))

    |--Nested Loops(Inner Join, WHERE:((1)))

    |--Compute Scalar(DEFINE:([Expr1006]=CASE WHEN [Expr1011]=(0) THEN NULL ELSE [Expr1012] END, [Expr1007]=CASE WHEN [Expr1011]=(0) THEN NULL ELSE [Expr1012]/CONVERT_IMPLICIT(money,[Expr1011],0) END))

    | |--Stream Aggregate(DEFINE:([Expr1011]=COUNT_BIG([tempdb].[dbo].[#SalesOrderHeader].[TotalDue]), [Expr1012]=SUM([tempdb].[dbo].[#SalesOrderHeader].[TotalDue])))

    | |--Table Spool

    |--Table Spool

    ETA: It didn't like being in a code window, as those don't have horizontal scroll bars here. You'll have to imagine the smilies aren't there due to that.

  • Sergiy (10/22/2007)


    And for clean test you need to try to reverse order of tests.

    With no index in place Server supposes to create statistics on those columns during 1st query and use it for second one.

    I don't see you drop statistics between runs.

    Why certainly, I can do that.

    Pass 1

    (1000000 row(s) affected)

    00:00:10:737 Duration (hh:mi:ss:mmm)

    ==============================================================================

    OverParition method...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (1000000 row(s) affected)

    00:00:04:893 Duration (hh:mi:ss:mmm)

    ==============================================================================

    INNER JOIN method...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (1000000 row(s) affected)

    00:00:04:937 Duration (hh:mi:ss:mmm)

    ==============================================================================

    Pass 2

    (1000000 row(s) affected)

    00:00:10:690 Duration (hh:mi:ss:mmm)

    ==============================================================================

    OverParition method...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (1000000 row(s) affected)

    00:00:04:720 Duration (hh:mi:ss:mmm)

    ==============================================================================

    INNER JOIN method...

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    (1000000 row(s) affected)

    00:00:04:783 Duration (hh:mi:ss:mmm)

    ==============================================================================

    Hmm, no difference, I'm shocked! Okay, not really.

Viewing 15 posts - 106 through 120 (of 172 total)

You must be logged in to reply to this topic. Login to reply