Replace non numeric characters in string

  • All right... You asked for it now.....:)

    First - the test scenario. Using Gus' table, and functions. I also threw in two variant to make sure that the system knew they were deterministic. Just added WITH SCHEMABINDING in there. There was another thread about this earlier seeming to point to the "flagged ones" performing better in certain cases.

    Just to play fair - we'll also throw in FREEPROCCACHE and DROPCLEANBUFFERS after each, so that noone gets crowded or gets to reuse anything from previously.

    Finally - just to catch the performance factor, and not be slowed down by returns to the screen - I will assign the values to a variable (meaning, it processes the replace then throws it away).

    Performed against 1M rows.

    Print replicate('=',60)

    print 'numberclean1'

    Print replicate('=',60)

    Declare @t varchar(100)

    set statistics IO on

    set statistics time on

    select top 1000000 @t=dbo.numberclean1(clean)

    from numberclean

    set statistics IO off

    set statistics time off

    go

    dbcc freeproccache

    dbcc dropcleanbuffers

    go

    Print replicate('=',60)

    print 'numberclean1determ'

    Print replicate('=',60)

    Declare @t varchar(100)

    set statistics IO on

    set statistics time on

    select top 1000000 @t=dbo.NumberClean1Determ(clean)

    from numberclean

    set statistics IO off

    set statistics time off

    go

    dbcc freeproccache

    dbcc dropcleanbuffers

    go

    Print replicate('=',60)

    print 'numberclean2'

    Print replicate('=',60)

    Declare @t varchar(100)

    set statistics IO on

    set statistics time on

    select top 1000000 @t=dbo.numberclean2(clean)

    from numberclean

    set statistics IO off

    set statistics time off

    go

    dbcc freeproccache

    dbcc dropcleanbuffers

    go

    Print replicate('=',60)

    print 'numberclean2determ'

    Print replicate('=',60)

    Declare @t varchar(100)

    set statistics IO on

    set statistics time on

    select top 1000000 @t=dbo.NumberClean2Determ(clean)

    from numberclean

    set statistics IO off

    set statistics time off

    go

    dbcc freeproccache

    dbcc dropcleanbuffers

    go

    Print replicate('=',60)

    print 'regexreplace'

    Print replicate('=',60)

    Declare @t varchar(100)

    set statistics IO on

    set statistics time on

    select top 1000000 @t=dbo.regexreplace(clean,'[^0-9]','')

    from numberclean

    set statistics IO off

    set statistics time off

    go

    dbcc freeproccache

    dbcc dropcleanbuffers

    go

    And now...for the results:

    ============================================================

    numberclean1

    ============================================================

    Table 'NumberClean'. Scan count 1, logical reads 2241, physical reads 1, read-ahead reads 3024,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 90328 ms, elapsed time = 175244 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ============================================================

    numberclean1determ

    ============================================================

    Table 'NumberClean'. Scan count 1, logical reads 2241, physical reads 1, read-ahead reads 3024,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 89234 ms, elapsed time = 175231 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ============================================================

    numberclean2

    ============================================================

    Table 'NumberClean'. Scan count 1, logical reads 2241, physical reads 1, read-ahead reads 3024,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 107312 ms, elapsed time = 212591 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ============================================================

    numberclean2determ

    ============================================================

    Table 'NumberClean'. Scan count 1, logical reads 2241, physical reads 1, read-ahead reads 3024,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 107656 ms, elapsed time = 209686 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ============================================================

    regexreplace

    ============================================================

    Table 'NumberClean'. Scan count 1, logical reads 2241, physical reads 1, read-ahead reads 3024,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 31219 ms, elapsed time = 32343 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    As I said... you DID ask for it....:)

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Did you try the stacked CTEs script that I tested last, or just the inline functions?

    Judging by the IO numbers, I'm going to assume you used the inline function that Jeff wrote.

    Looking at the CPU time, your CLR was marginally faster than the inline function.

    Try using the stacked CTEs on the same data. In my test, that was about 1/4th the CPU time as the the inline function, and 1/10th the total time.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (3/19/2008)


    Did you try the stacked CTEs script that I tested last, or just the inline functions?

    Judging by the IO numbers, I'm going to assume you used the inline function that Jeff wrote.

    Looking at the CPU time, your CLR was marginally faster than the inline function.

    Try using the stacked CTEs on the same data. In my test, that was about 1/4th the CPU time as the the inline function, and 1/10th the total time.

    Will do about the CTE's - but three times faster is Marginally??? The Regex function is the last one one the list.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Results (executed against all 4M rows):

    ============================================================

    CTE

    ============================================================

    Table 'Tally'. Scan count 4000000, logical reads 16750657, physical reads 1, read-ahead reads 8,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'NumberClean'. Scan count 3, logical reads 12822, physical reads 1, read-ahead reads 11686,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 147016 ms, elapsed time = 454307 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ============================================================

    regexreplace

    ============================================================

    Table 'NumberClean'. Scan count 1, logical reads 17126, physical reads 1, read-ahead reads 17119,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 127844 ms, elapsed time = 138779 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    The CPU time isn't too bad - but the elapsed time and the IO's hurt, badly. Tally is getting hammered.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (3/19/2008)...but three times faster is Marginally??? The Regex function is the last one one the list.

    Sorry, didn't look far enough down on the list of results you provided. Just saw the top two.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (3/19/2008)


    Matt Miller (3/19/2008)...but three times faster is Marginally??? The Regex function is the last one one the list.

    Sorry, didn't look far enough down on the list of results you provided. Just saw the top two.

    No issue - I should have highlighted the cliff notes...:)

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Yeah, the scan count on Numbers is totally insane. I saw the same thing when I originally put it together. Looks like the regex is probably best for this, at least so far.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • My next test has a major caveat on it: It won't work if there are two or more separate numbers in a field. For example, "123 dogs, 5 cats", it will return "123 dogs, 5". It will, however, work on "there are 123 dogs of various breeds".

    Unlike the other versions I've tested, it does work on numbers with commas and/or decimal points. For example, it will handle: "123.45", or "1,000", correctly. Complex numeric expressions, like "10^-23" will also survive this version. "Sqrt 16", won't.

    set statistics io on

    set statistics time on

    declare @Count bigint

    select @count =

    reverse(

    substring(

    reverse(

    substring(clean, patindex('%[0-9]%', clean), len(clean))),

    patindex('%[0-9]%',

    reverse(

    substring(clean, patindex('%[0-9]%', clean), len(clean)))),

    len(clean)))

    from dbo.numberclean

    =========================

    1st Run

    =========================

    Table 'NumberClean'. Scan count 1, logical reads 15626, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 11000 ms, elapsed time = 11061 ms.

    =========================

    2nd Run

    =========================

    SQL Server parse and compile time:

    CPU time = 0 ms, elapsed time = 2 ms.

    Table 'NumberClean'. Scan count 1, logical reads 15626, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 10985 ms, elapsed time = 11221 ms.

    ==========================

    That's on 4-million rows, no indexes, etc. Same tables as my last tests.

    How does that do on your machine?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • For what it's worth, here's my entry:

    create function ExtractDigits (

    @Pattern varchar( 8000 )

    )

    RETURNS varchar( 8000 )

    as begin

    declare @Result varchar( 8000 );

    select @Result = '';

    select @Result = @Result + case

    when SubString( @Pattern, t.ZN, 1 ) like '[0-9\-+.]' escape '\'

    then SubString( @Pattern, t.ZN, 1 )

    else ''

    end

    from Utility.dbo.Tally t

    where t.ZN between 1 and Len( @Pattern );

    return @Result;

    end--Function

    The "between" is because my Tally table starts at zero. I've also added the characters '-', '+' and '.' to the list of acceptable characters -- or just use "like '[0-9]'" if you want digits only.

    I haven't tested it against the other entries to see how well it compares, but 96% is in one index seek (the Tally table) so it's not inefficient. However, it will return all digits no matter how they appear in the input string: 'abc123xyz890anything else5' will return '1238905'.

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

  • GSquared (3/19/2008)


    That's on 4-million rows, no indexes, etc. Same tables as my last tests.

    How does that do on your machine?

    About like yours. It kills the regex replace handily. Same with a match function.

    Knowing the pattern allows you to come up with something better.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (3/19/2008)


    As I said... you DID ask for it....:)

    And now, I have a target... 😛 Dunno if I can beat the awesome speed of Regex, but I'll sure try...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Matt Miller (3/19/2008)


    ...Knowing the pattern allows you to come up with something better.

    Yeah.

    Of course, as in my prior example, turning "235 dogs, 5 cats", into "2355", is probably not a sane transformation anyway. In cases like that, the data would have to be parsed differently before it was stripped for numbers anyway. In a simple case, where there's one number surrounded by text/other symbols, I think the string functions are the fastest I'm going to come up with.

    If I knew even more about the pattern, it might be possible to strip more out of it. If, for example, the pattern was that the number is always the first part of the string and is always followed by a space before the text, that would be easier to remove, and somewhat faster.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (3/20/2008)


    Matt Miller (3/19/2008)


    ...Knowing the pattern allows you to come up with something better.

    Yeah.

    Of course, as in my prior example, turning "235 dogs, 5 cats", into "2355", is probably not a sane transformation anyway. In cases like that, the data would have to be parsed differently before it was stripped for numbers anyway. In a simple case, where there's one number surrounded by text/other symbols, I think the string functions are the fastest I'm going to come up with.

    If I knew even more about the pattern, it might be possible to strip more out of it. If, for example, the pattern was that the number is always the first part of the string and is always followed by a space before the text, that would be easier to remove, and somewhat faster.

    True - knowing the pattern (and knowing that the pattern actually holds throughout) is a HUGE advantage.

    But sometimes even knowing the pattern won't help. One of the other test Jeff and I ran over the summer was a redaction test (as in - "find every phone number/SSN/fill in your own pattern in a vary large varchar field, and replace it with something else".) Jeff and I managed to "tie" in my mind on that one, but the "harder the pattern to find, the better Regex does" from what I've seen. Ultimately because it's a better pattern matcher.

    In the meantime - let's wait and see what Jeff comes up with. Always a nail-biter:)

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • By the way - here are the results for the "top 4", based on 4M rows... (Excluding the "I know the pattern" solution)

    Tomm - yours made the list. It's about on par with the other T-SQL function ones.

    ============================================================

    regexreplace

    ============================================================

    Table 'NumberClean'. Scan count 1, logical reads 17126, physical reads 1, read-ahead reads 17119,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 145265 ms, elapsed time = 156782 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ============================================================

    Gsquared's CTE Solution

    ============================================================

    Table 'Tally'. Scan count 4000000, logical reads 16750660, physical reads 1, read-ahead reads 8,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'NumberClean'. Scan count 3, logical reads 12822, physical reads 2, read-ahead reads 11690,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 166719 ms, elapsed time = 441303 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ============================================================

    Gsquared's numberclean1

    ============================================================

    Table 'NumberClean'. Scan count 1, logical reads 11684, physical reads 1, read-ahead reads 11690,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 375750 ms, elapsed time = 706387 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    SQL Server parse and compile time:

    CPU time = 0 ms, elapsed time = 30 ms.

    ============================================================

    Tomm's ExtractDigit

    ============================================================

    Table 'NumberClean'. Scan count 1, logical reads 11684, physical reads 1, read-ahead reads 11690,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 437281 ms, elapsed time = 850318 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • GSquared (3/20/2008)


    Matt Miller (3/19/2008)


    ...Knowing the pattern allows you to come up with something better.

    Yeah.

    Of course, as in my prior example, turning "235 dogs, 5 cats", into "2355", is probably not a sane transformation anyway. In cases like that, the data would have to be parsed differently before it was stripped for numbers anyway. In a simple case, where there's one number surrounded by text/other symbols, I think the string functions are the fastest I'm going to come up with.

    If I knew even more about the pattern, it might be possible to strip more out of it. If, for example, the pattern was that the number is always the first part of the string and is always followed by a space before the text, that would be easier to remove, and somewhat faster.

    If you go via a CLR regex function to render "235 dogs, 5 cats" to "2355" then [considering u r doing millions of recs] this a bit more advanced replace would probably give you better performance:

    regex.replace("235 dogs, 5 cats", @"\D+","")

    --returns: "2355"

    it allows the .NET regex engine to roll from Nth to (N+1)th occurrence of a digit [\d] without stopping for a replace, capture all the \D (non-digits) and replace them at once with an empty str.

Viewing 15 posts - 16 through 30 (of 81 total)

You must be logged in to reply to this topic. Login to reply