REPLACE Multiple Spaces with One

  • I copy and pasted your exact code and deleted the CLR one. (I didn't run my function version)

    I also tested on our new 2008 Server instead, my previous were from a 2005 server.

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL:Looping

    1433

    SQL:

    353

    What are your system specs? Are you Express/Enterprise/Standard?

    brigzy (11/16/2009)


    newjcb (11/16/2009)


    @brigzy

    Before I go ahead and try again, can you first try switching the orders around? Try running the loop version first, then select, then clr. Your Raid controller may be caching the temp table and causing all queries after the first to be run with different IO circumstances.

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL:Looping

    2706

    SQL:

    6206

    SQL:CLR

    440

    --PREPARE

    SET NOCOUNT ON

    go

    CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

    RETURNS VARCHAR(max)

    BEGIN

    WHILE CHARINDEX(' ',@FooString) > 0

    SELECT @FooString = REPLACE(@FooString,' ',' ')

    RETURN @FooString

    END

    GO

    CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

    CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

    go

    --INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

    DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

    SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

    INSERT INTO #TEMP1 (COL1)

    OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

    SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

    GO 10000

    --select * FROM #TEMP1

    --select * FROM #TEMP2

    --SELECTS

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

    FROM #TEMP2

    PRINT 'SQL:Looping'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    go

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

    ))

    FROM #TEMP2

    PRINT 'SQL:'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    GO

    DECLARE @TheTime DATETIME

    SET @TheTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)

    FROM #TEMP2

    PRINT 'SQL:CLR'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    --CLEANUP

    DROP FUNCTION dbo.fn_CleanUp

    DROP TABLE #TEMP1

    DROP TABLE #TEMP2

    go

  • It's nice, but in this test, it loses by a large margin (on my desktop machine):

    if object_id(N'tempdb..#T') is null

    create table #T (

    ID int identity primary key,

    Col1 varchar(100));

    if object_id(N'tempdb..#T2') is not null

    drop table #T2;

    if object_id(N'tempdb..#T3') is not null

    drop table #T3;

    set nocount on;

    if not exists

    (select *

    from #T)

    insert into #T (Col1)

    select 'a' + replicate(' ', abs(checksum(newid()))%98) + 'b'

    from dbo.Numbers N1

    cross join dbo.Numbers N2

    where N1.Number between 1 and 1000

    and N2.Number between 1 and 1000;

    set statistics time on;

    select replace(replace(replace(Col1, ' ', ' c'), 'c ', ''), 'c', '') as Col2

    into #T2

    from #T;

    select

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(Col1, replicate(' ', 10), ' '),

    replicate(' ', 9), ' '),

    replicate(' ', 8), ' '),

    replicate(' ', 7), ' '),

    replicate(' ', 6), ' '),

    replicate(' ', 5), ' '),

    replicate(' ', 4), ' '),

    replicate(' ', 3), ' '),

    replicate(' ', 2), ' ') as Col3

    into #T3

    from #T;

    set statistics time off;

    select *

    from #T2

    where Col2 like '% %';

    select *

    from #T3

    where col3 like '% %';

    SQL Server Execution Times:

    CPU time = 7859 ms, elapsed time = 7942 ms.

    SQL Server Execution Times:

    CPU time = 3797 ms, elapsed time = 3837 ms.

    10 runs, each time, the second one came in at about half the time of the other.

    Changed Col1 in #T to varchar(1000), modified the checksum modulus accordingly, and it came out at 67.304 second total, 66.375 CPU for the article method, and 15.815 total, 15.360 CPU, for the nested replaces method.

    If you are dealing with truly phenomenal numbers of duplicate characters, you can nest a replicate 100 inside the replicate 10, or a replicate 1000, and so on. I've found that, with up to hundreds or low thousands of characters, 10, 9, 8, ... 3, 2, is as far as you need to go. Up to 5 is almost always enough, but I'm into overkill.

    The reason for this speed advantage is simple enough. If you have 30 spaces (or whatever duplicate character you're getting rid of) in a row, then the article's method has to replace 15 of them, then replace 14 of those, then get rid of one more character, while the nested replicates method just has to replace 3 tens and 1 three, and it's done. Far fewer CPU cycles. For 9 duplicates, the nested replicates method does one replace and is done, while the alternating characters method still has to go through all three cycles. More CPU cycles, more RAM addressing, more RAM rewrites.

    If someone else can get different results, where the alternating characters method is actually faster, I'd love to see the test.

    Edit: Just realized the mod 98 should have had a +1 to actually fill up the column, but that's just to be more tidy. Doesn't actually affect the test.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I just got motivated and wrote two CLRs for this and tested on the same 2008 server from my above post with 100000 rows as per brigzy's example.

    Beginning execution loop

    Batch execution completed 100000 times.

    SQL:Looping

    11786

    SQL:

    3573

    SQL:CLRLOOP

    2650

    SQL:CLRONCE

    1873

    Code for CLRLOOP

    while (inData.IndexOf(" ") > -1)

    {

    inData = inData.Replace(" ", " ");

    }

    return inData;

    Code for CLRONCE:

    StringBuilder result = new StringBuilder();

    int pos = 0;

    int nextWriteStart = -1;

    int length = inData.Length;

    bool lastWasWhiteSpace = true;

    bool buffering = false;

    while (pos < length) {

    if (lastWasWhiteSpace && inData[pos] == ' ' && buffering) {

    result.Append(inData.Substring(nextWriteStart, (pos - nextWriteStart)));

    nextWriteStart = -1;

    buffering = false;

    } else if (nextWriteStart == -1 && (inData[pos] != ' ' || (!lastWasWhiteSpace && inData[pos] == ' '))) {

    nextWriteStart = pos;

    buffering = true;

    }

    if (inData[pos] == ' ')

    {

    lastWasWhiteSpace = true;

    }

    else

    {

    lastWasWhiteSpace = false;

    }

    pos++;

    }

    if (buffering)

    {

    result.Append(inData.Substring(nextWriteStart, (pos - nextWriteStart)));

    }

    return result.ToString();

    For the record, I purposely made it return a single space at the beginning and/or end because the problem description really just says to remove more than one space, not trailing and ending spaces. But a trim at the return statement won't hurt anything.

  • Nadrek (11/16/2009)


    I didn't see it later on, but my solution in the past has been very simple; nest REPLACE() two spaces with one space as deep as required for the maximum number of contiguous spaces in the field, i.e. for VARCHAR(8000), I nest it 13 deep, as 2^13=8192.

    Note that the OX style replacing, when X is multiple characters, theoretically can result in severe internal fragmentation, as the strings become longer (and thus fewer rows per page are allows... if the pages were already full, they have to split).

    When I ran into Jeff's article, my first thought was to do a recursive replace of 32 blanks to 1 blank, followed by 16 blank s to 1 blank, etc... followed by 2 blanks to 1 blank. This would be based on some maximum number of blanks I would consider reasonable.

    BUT Jeff's code seems to do this even with 8000 blanks.

    It's now up to the performance wizzards to declare who is the winner on one million records with one column containing 8000 blanks.

  • Michael Meierruth (11/16/2009)


    Nadrek (11/16/2009)


    I didn't see it later on, but my solution in the past has been very simple; nest REPLACE() two spaces with one space as deep as required for the maximum number of contiguous spaces in the field, i.e. for VARCHAR(8000), I nest it 13 deep, as 2^13=8192.

    Note that the OX style replacing, when X is multiple characters, theoretically can result in severe internal fragmentation, as the strings become longer (and thus fewer rows per page are allows... if the pages were already full, they have to split).

    When I ran into Jeff's article, my first thought was to do a recursive replace of 32 blanks to 1 blank, followed by 16 blank s to 1 blank, etc... followed by 2 blanks to 1 blank. This would be based on some maximum number of blanks I would consider reasonable.

    BUT Jeff's code seems to do this even with 8000 blanks.

    It's now up to the performance wizzards to declare who is the winner on one million records with one column containing 8000 blanks.

    To test 1-million at 8,000, I ran the below:

    if object_id(N'tempdb..#T') is null

    create table #T (

    ID int identity primary key,

    Col1 varchar(8000));

    if object_id(N'tempdb..#T2') is not null

    drop table #T2;

    if object_id(N'tempdb..#T3') is not null

    drop table #T3;

    set nocount on;

    if not exists

    (select *

    from #T)

    insert into #T (Col1)

    select 'a' + replicate(' ', abs(checksum(newid()))%7999) + 'b'

    from dbo.Numbers N1

    cross join dbo.Numbers N2

    where N1.Number between 1 and 1000

    and N2.Number between 1 and 1000;

    set statistics time on;

    select replace(replace(replace(Col1, ' ', ' c'), 'c ', ''), 'c', '') as Col2

    into #T2

    from #T;

    select

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(Col1, replicate(' ', 1000), ' '),

    replicate(' ', 10), ' '),

    replicate(' ', 9), ' '),

    replicate(' ', 8), ' '),

    replicate(' ', 7), ' '),

    replicate(' ', 6), ' '),

    replicate(' ', 5), ' '),

    replicate(' ', 4), ' '),

    replicate(' ', 3), ' '),

    replicate(' ', 2), ' ') as Col3

    into #T3

    from #T;

    set statistics time off;

    if exists

    (select *

    from #T2

    where Col2 like '% %')

    print 'Test 1 failed';

    if exists

    (select *

    from #T3

    where col3 like '% %')

    print 'Test 2 failed';

    Only change from my prior test was adding a replicate 1000 into the nested replaces.

    Results:

    SQL Server Execution Times:

    CPU time = 530859 ms, elapsed time = 559134 ms.

    SQL Server Execution Times:

    CPU time = 50828 ms, elapsed time = 134154 ms.

    The Alternating Characters method took over 10 times as much CPU time, and over 4 times as much total time, as the Nested Replace method. (It's a Core2 Duo, so CPU being higher than total isn't unusual.)

    Tested it with a replicate 100 between the 1000 and the 10, and that took the time to:

    SQL Server Execution Times:

    CPU time = 44484 ms, elapsed time = 120706 ms.

    Not much of an improvement, but it is better. Can be fine-tuned more, most likely. But with the speed difference I'm seeing already, I'm not too worried about making it "perfect".

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared: did you try hard-coded space-strings instead of the pile of replicates? I would HOPE the optimizer would expand those on compile, but if it doesn't that should shave off some CPU ticks.

    newjcb: is StringBuilder the most efficient mechanism for doing this string maniplation in CLR?? last I heard (years ago) that was a pretty bloated structure.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Sorry have been out 4 curry last few hours....:hehe:

    What are your system specs? Are you Express/Enterprise/Standard?

    Vista, SQL Standard 2005 SP 3 (local laptop install),

    Mem 2GBm (1.5GB in use), Intel Core Duo

    I still get the following using batch & C# CLR I posted earlier

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL:Looping

    2520

    SQL:

    6040

    SQL:CLR

    440

  • I'm going to run out of 'stars' at this rate ๐Ÿ˜‰

  • Taking the ||*9*9|| technique down to C# is sweet

    This brings the C# down to just one line of code.

    If the ||*9*9|| string was reduced it could be even faster..

    Here are the test results:

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL:Looping

    2556

    SQL:

    6433

    SQL:CLR looping

    443

    SQL:CLR Not looping using ||*9*9|| technique in C#

    310

    Here is the C#

    using System;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    public partial class UserDefinedFunctions

    {

    [SqlFunction()]

    public static string fn_CleanStringNotLooping_CLR(string CleanMeUpScottie)

    {

    return CleanMeUpScottie.Replace(" "," ||*9*9||")

    .Replace("||*9*9|| ","").Replace("||*9*9||","");

    }

    [SqlFunction()]

    public static string fn_CleanStringLooping_CLR(string CleanMeUpScottie)

    {

    while (CleanMeUpScottie.IndexOf(" ", 0) != -1)

    {

    CleanMeUpScottie = CleanMeUpScottie.Replace(" ", " ");

    }

    return CleanMeUpScottie;

    }

    };

    SQL Batch

    --PREPARE

    SET NOCOUNT ON

    go

    CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

    RETURNS VARCHAR(max)

    BEGIN

    WHILE CHARINDEX(' ',@FooString) > 0

    SELECT @FooString = REPLACE(@FooString,' ',' ')

    RETURN @FooString

    END

    GO

    CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

    CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

    go

    --INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

    DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

    SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

    INSERT INTO #TEMP1 (COL1)

    OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

    SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

    GO 10000

    --select * FROM #TEMP1

    --select * FROM #TEMP2

    --SELECTS

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

    FROM #TEMP2

    PRINT 'SQL:Looping'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    go

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

    ))

    FROM #TEMP2

    PRINT 'SQL:'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    GO

    DECLARE @TheTime DATETIME

    SET @TheTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_CleanStringLooping_CLR(COL2)

    FROM #TEMP2

    PRINT 'SQL:CLR looping '

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    go

    DECLARE @TheTime DATETIME

    SET @TheTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_CleanStringNotLooping_CLR(COL2)

    FROM #TEMP2

    PRINT 'SQL:CLR Not looping using ||*9*9|| technique in C#'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    --CLEANUP

    DROP FUNCTION dbo.fn_CleanUp

    DROP TABLE #TEMP1

    DROP TABLE #TEMP2

    go

    C++ anyone?

  • brigzy (11/16/2009)


    Taking the ||*9*9|| technique down to C# is sweet!

    It seems pretty obvious that a .NET solution will usually be faster. What I liked about this article was the encouragement to think set-based and apply some old-school creative thinking to an old problem. No doubt a thorough treatment of which solution is 'best' (for some well-defined value of 'best') would make an interesting article.

    Paul

  • Hi how about this code whether this will work fine or not

    DECLARE @tmpstr2 Varchar(MAX)

    SELECT @tmpstr2=replace(OriginalString, char(10) + char(13), '') FROM @Demo

    SELECT @tmpstr2=replace(@tmpstr2,' ','|')

    SELECT @tmpstr2=replace(@tmpstr2,'|','')

    SELECT @tmpstr2

  • Paul White (11/16/2009)


    brigzy (11/16/2009)


    Taking the ||*9*9|| technique down to C# is sweet!

    It seems pretty obvious that a .NET solution will usually be faster.

    Yes we we did clarify expected/obvious:

    We might have expected c# to have the day, but the Query/Function result is surprising.

    I think it is quite resonable and hopefully of general interest to discuss fastest C# technique on the subject. Seeing how simple the C# code is might also encourage pure T-SQL folk to venture C#.

  • Jeff: Neat. Whether it's the best on SQL Server or not is only partly the point (for me). I just like the different approach to the problem. Don't attack it directly, transform it to a smaller set of problems that _do_ lend themselves to a direct solution.

    To the CLR freaks: The OP stated it had to be SQL 2000.

  • SDM (11/17/2009)


    To the people who can program in a .NET language and recognise when to use it appropriately to complement T-SQL:

    Fixed that for ya! ๐Ÿ˜€

    SDM (11/17/2009)


    The OP stated it had to be SQL 2000.

    OP? It's an article! And where does it specify 2000? Read it again carefully ๐Ÿ˜›

  • brigzy (11/16/2009)


    We might have expected c# to have the day, but the Query/Function result is surprising.

    Not really, no. Using the string "||*9*9||" as a replacement has a huge hidden cost, enough to make the normally less-efficient WHILE-loop-in-a-function approach seem faster. (Let us leave aside for a second the fact that the article stipulates that the string to be processed should not be made longer at any point.)

    If we use "รพ" as the replacement character, we don't change the overall length of the string. This allows SQL Server to do an in-place substitution. When using the long string, the source string has to be split at the replacement point, the "||*9*9||" inserted, and then the remainder of the original string tacked onto the end. The point is that there is an awful lot of string manipulation to be done, along with some allocations and copies, which just aren't needed with the "รพ" method.

    Using the single-character method, a quick test using your code produces a run time of around 4.9s for the function, and 3.2s for the article's method.

    While I'm here, I'd just like to mention that the nested-replace method mentioned by Joe Celko and illustrated by G2 is very fast. Probably not as fast as a similarly constructed .NET method, but good nonetheless. The problem with it is complexity and verbosity, at least to my mind. Simply put, it's just not elegant. I like Jeff's method for the reasons mentioned previously.

    Paul

Viewing 15 posts - 61 through 75 (of 425 total)

You must be logged in to reply to this topic. Login to reply