REPLACE Multiple Spaces with One

Question

Post reply

REPLACE Multiple Spaces with One

Viewing 15 posts - 61 through 75 (of 425 total)

You must be logged in to reply to this topic. Login to reply

newjcb SSC Enthusiast Points: 153 More actions · Answer 1

I copy and pasted your exact code and deleted the CLR one. (I didn't run my function version)

I also tested on our new 2008 Server instead, my previous were from a 2005 server.

Beginning execution loop

Batch execution completed 10000 times.

SQL:Looping

1433

SQL:

353

What are your system specs? Are you Express/Enterprise/Standard?

brigzy (11/16/2009)
newjcb (11/16/2009)
@brigzy
Before I go ahead and try again, can you first try switching the orders around? Try running the loop version first, then select, then clr. Your Raid controller may be caching the temp table and causing all queries after the first to be run with different IO circumstances.
Beginning execution loop
Batch execution completed 10000 times.
SQL:Looping
2706
SQL:
6206
SQL:CLR
440
--PREPARE
SET NOCOUNT ON
go
CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))
RETURNS VARCHAR(max)
BEGIN
WHILE CHARINDEX(' ',@FooString) > 0
SELECT @FooString = REPLACE(@FooString,' ',' ')
RETURN @FooString
END
GO
CREATE TABLE #TEMP1 (COL1 VARCHAR(900))
CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))
go
--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE
DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT
SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)
INSERT INTO #TEMP1 (COL1)
OUTPUT inserted.COL1 INTO #TEMP2 (COL2)
SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'
GO 10000
--select * FROM #TEMP1
--select * FROM #TEMP2
--SELECTS
DECLARE @TheTime DATETIME
SELECT @TheTime= GETDATE()
UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)
FROM #TEMP2
PRINT 'SQL:Looping'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
go
DECLARE @TheTime DATETIME
SELECT @TheTime= GETDATE()
UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(
REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')
))
FROM #TEMP2
PRINT 'SQL:'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
GO
DECLARE @TheTime DATETIME
SET @TheTime=GETDATE()
UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)
FROM #TEMP2
PRINT 'SQL:CLR'
PRINT DATEDIFF(ms,@TheTime,GETDATE())
--CLEANUP
DROP FUNCTION dbo.fn_CleanUp
DROP TABLE #TEMP1
DROP TABLE #TEMP2
go

GSquared SSC Guru Points: 260824 More actions · Answer 2

It's nice, but in this test, it loses by a large margin (on my desktop machine):

if object_id(N'tempdb..#T') is null

create table #T (

ID int identity primary key,

Col1 varchar(100));

if object_id(N'tempdb..#T2') is not null

drop table #T2;

if object_id(N'tempdb..#T3') is not null

drop table #T3;

set nocount on;

if not exists

(select *

from #T)

insert into #T (Col1)

select 'a' + replicate(' ', abs(checksum(newid()))%98) + 'b'

from dbo.Numbers N1

cross join dbo.Numbers N2

where N1.Number between 1 and 1000

and N2.Number between 1 and 1000;

set statistics time on;

select replace(replace(replace(Col1, ' ', ' c'), 'c ', ''), 'c', '') as Col2

into #T2

from #T;

select

replace(

replace(Col1, replicate(' ', 10), ' '),

replicate(' ', 9), ' '),

replicate(' ', 8), ' '),

replicate(' ', 7), ' '),

replicate(' ', 6), ' '),

replicate(' ', 5), ' '),

replicate(' ', 4), ' '),

replicate(' ', 3), ' '),

replicate(' ', 2), ' ') as Col3

into #T3

from #T;

set statistics time off;

select *

from #T2

where Col2 like '% %';

select *

from #T3

where col3 like '% %';

SQL Server Execution Times:
CPU time = 7859 ms, elapsed time = 7942 ms.
SQL Server Execution Times:
CPU time = 3797 ms, elapsed time = 3837 ms.

10 runs, each time, the second one came in at about half the time of the other.

Changed Col1 in #T to varchar(1000), modified the checksum modulus accordingly, and it came out at 67.304 second total, 66.375 CPU for the article method, and 15.815 total, 15.360 CPU, for the nested replaces method.

If you are dealing with truly phenomenal numbers of duplicate characters, you can nest a replicate 100 inside the replicate 10, or a replicate 1000, and so on. I've found that, with up to hundreds or low thousands of characters, 10, 9, 8, ... 3, 2, is as far as you need to go. Up to 5 is almost always enough, but I'm into overkill.

The reason for this speed advantage is simple enough. If you have 30 spaces (or whatever duplicate character you're getting rid of) in a row, then the article's method has to replace 15 of them, then replace 14 of those, then get rid of one more character, while the nested replicates method just has to replace 3 tens and 1 three, and it's done. Far fewer CPU cycles. For 9 duplicates, the nested replicates method does one replace and is done, while the alternating characters method still has to go through all three cycles. More CPU cycles, more RAM addressing, more RAM rewrites.

If someone else can get different results, where the alternating characters method is actually faster, I'd love to see the test.

Edit: Just realized the mod 98 should have had a +1 to actually fill up the column, but that's just to be more tidy. Doesn't actually affect the test.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

newjcb SSC Enthusiast Points: 153 More actions · Answer 3

I just got motivated and wrote two CLRs for this and tested on the same 2008 server from my above post with 100000 rows as per brigzy's example.

Beginning execution loop

Batch execution completed 100000 times.

SQL:Looping

11786

SQL:

3573

SQL:CLRLOOP

2650

SQL:CLRONCE

1873

Code for CLRLOOP

while (inData.IndexOf(" ") > -1)

{

inData = inData.Replace(" ", " ");

}

return inData;

Code for CLRONCE:

StringBuilder result = new StringBuilder();

int pos = 0;

int nextWriteStart = -1;

int length = inData.Length;

bool lastWasWhiteSpace = true;

bool buffering = false;

while (pos < length) {

if (lastWasWhiteSpace && inData[pos] == ' ' && buffering) {

result.Append(inData.Substring(nextWriteStart, (pos - nextWriteStart)));

nextWriteStart = -1;

buffering = false;

} else if (nextWriteStart == -1 && (inData[pos] != ' ' || (!lastWasWhiteSpace && inData[pos] == ' '))) {

nextWriteStart = pos;

buffering = true;

}

if (inData[pos] == ' ')

{

lastWasWhiteSpace = true;

}

else

{

lastWasWhiteSpace = false;

}

pos++;

}

if (buffering)

{

result.Append(inData.Substring(nextWriteStart, (pos - nextWriteStart)));

}

return result.ToString();

For the record, I purposely made it return a single space at the beginning and/or end because the problem description really just says to remove more than one space, not trailing and ending spaces. But a trim at the return statement won't hurt anything.

Michael Meierruth SSChampion Points: 10051 More actions · Answer 4

Nadrek (11/16/2009)
I didn't see it later on, but my solution in the past has been very simple; nest REPLACE() two spaces with one space as deep as required for the maximum number of contiguous spaces in the field, i.e. for VARCHAR(8000), I nest it 13 deep, as 2^13=8192.
Note that the OX style replacing, when X is multiple characters, theoretically can result in severe internal fragmentation, as the strings become longer (and thus fewer rows per page are allows... if the pages were already full, they have to split).

When I ran into Jeff's article, my first thought was to do a recursive replace of 32 blanks to 1 blank, followed by 16 blank s to 1 blank, etc... followed by 2 blanks to 1 blank. This would be based on some maximum number of blanks I would consider reasonable.

BUT Jeff's code seems to do this even with 8000 blanks.

It's now up to the performance wizzards to declare who is the winner on one million records with one column containing 8000 blanks.

GSquared SSC Guru Points: 260824 More actions · Answer 5

Michael Meierruth (11/16/2009)
Nadrek (11/16/2009)
I didn't see it later on, but my solution in the past has been very simple; nest REPLACE() two spaces with one space as deep as required for the maximum number of contiguous spaces in the field, i.e. for VARCHAR(8000), I nest it 13 deep, as 2^13=8192.
Note that the OX style replacing, when X is multiple characters, theoretically can result in severe internal fragmentation, as the strings become longer (and thus fewer rows per page are allows... if the pages were already full, they have to split).
When I ran into Jeff's article, my first thought was to do a recursive replace of 32 blanks to 1 blank, followed by 16 blank s to 1 blank, etc... followed by 2 blanks to 1 blank. This would be based on some maximum number of blanks I would consider reasonable.
BUT Jeff's code seems to do this even with 8000 blanks.
It's now up to the performance wizzards to declare who is the winner on one million records with one column containing 8000 blanks.

To test 1-million at 8,000, I ran the below:

if object_id(N'tempdb..#T') is null

create table #T (

ID int identity primary key,

Col1 varchar(8000));

if object_id(N'tempdb..#T2') is not null

drop table #T2;

if object_id(N'tempdb..#T3') is not null

drop table #T3;

set nocount on;

if not exists

(select *

from #T)

insert into #T (Col1)

select 'a' + replicate(' ', abs(checksum(newid()))%7999) + 'b'

from dbo.Numbers N1

cross join dbo.Numbers N2

where N1.Number between 1 and 1000

and N2.Number between 1 and 1000;

set statistics time on;

select replace(replace(replace(Col1, ' ', ' c'), 'c ', ''), 'c', '') as Col2

into #T2

from #T;

select

replace(

replace(Col1, replicate(' ', 1000), ' '),

replicate(' ', 10), ' '),

replicate(' ', 9), ' '),

replicate(' ', 8), ' '),

replicate(' ', 7), ' '),

replicate(' ', 6), ' '),

replicate(' ', 5), ' '),

replicate(' ', 4), ' '),

replicate(' ', 3), ' '),

replicate(' ', 2), ' ') as Col3

into #T3

from #T;

set statistics time off;

if exists

(select *

from #T2

where Col2 like '% %')

print 'Test 1 failed';

if exists

(select *

from #T3

where col3 like '% %')

print 'Test 2 failed';

Only change from my prior test was adding a replicate 1000 into the nested replaces.

Results:

SQL Server Execution Times:
CPU time = 530859 ms, elapsed time = 559134 ms.
SQL Server Execution Times:
CPU time = 50828 ms, elapsed time = 134154 ms.

The Alternating Characters method took over 10 times as much CPU time, and over 4 times as much total time, as the Nested Replace method. (It's a Core2 Duo, so CPU being higher than total isn't unusual.)

Tested it with a replicate 100 between the 1000 and the 10, and that took the time to:

SQL Server Execution Times:
CPU time = 44484 ms, elapsed time = 120706 ms.

Not much of an improvement, but it is better. Can be fine-tuned more, most likely. But with the speed difference I'm seeing already, I'm not too worried about making it "perfect".

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 6

GSquared: did you try hard-coded space-strings instead of the pile of replicates? I would HOPE the optimizer would expand those on compile, but if it doesn't that should shave off some CPU ticks.

newjcb: is StringBuilder the most efficient mechanism for doing this string maniplation in CLR?? last I heard (years ago) that was a pretty bloated structure.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

C# Gnu SSCertifiable Points: 5973 More actions · Answer 7

Sorry have been out 4 curry last few hours....:hehe:

What are your system specs? Are you Express/Enterprise/Standard?

Vista, SQL Standard 2005 SP 3 (local laptop install),

Mem 2GBm (1.5GB in use), Intel Core Duo

I still get the following using batch & C# CLR I posted earlier

Beginning execution loop

Batch execution completed 10000 times.

SQL:Looping

2520

SQL:

6040

SQL:CLR

440

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

Paul White SSC Guru Points: 150457 More actions · Answer 8

I'm going to run out of 'stars' at this rate 😉

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

C# Gnu SSCertifiable Points: 5973 More actions · Answer 9

Taking the ||*9*9|| technique down to C# is sweet

This brings the C# down to just one line of code.

If the ||*9*9|| string was reduced it could be even faster..

Here are the test results:

Beginning execution loop

Batch execution completed 10000 times.

SQL:Looping

2556

SQL:

6433

SQL:CLR looping

443

SQL:CLR Not looping using ||*9*9|| technique in C#

310

Here is the C#

using System;

using System.Data;

using System.Data.SqlClient;

using System.Data.SqlTypes;

using Microsoft.SqlServer.Server;

public partial class UserDefinedFunctions

{

[SqlFunction()]

public static string fn_CleanStringNotLooping_CLR(string CleanMeUpScottie)

{

return CleanMeUpScottie.Replace(" "," ||*9*9||")

.Replace("||*9*9|| ","").Replace("||*9*9||","");

}

[SqlFunction()]

public static string fn_CleanStringLooping_CLR(string CleanMeUpScottie)

{

while (CleanMeUpScottie.IndexOf(" ", 0) != -1)

{

CleanMeUpScottie = CleanMeUpScottie.Replace(" ", " ");

}

return CleanMeUpScottie;

}

};

SQL Batch

--PREPARE

SET NOCOUNT ON

go

CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

RETURNS VARCHAR(max)

BEGIN

WHILE CHARINDEX(' ',@FooString) > 0

SELECT @FooString = REPLACE(@FooString,' ',' ')

RETURN @FooString

END

GO

CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

go

--INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

INSERT INTO #TEMP1 (COL1)

OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

GO 10000

--select * FROM #TEMP1

--select * FROM #TEMP2

--SELECTS

DECLARE @TheTime DATETIME

SELECT @TheTime= GETDATE()

UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

FROM #TEMP2

PRINT 'SQL:Looping'

PRINT DATEDIFF(ms,@TheTime,GETDATE())

go

DECLARE @TheTime DATETIME

SELECT @TheTime= GETDATE()

UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

))

FROM #TEMP2

PRINT 'SQL:'

PRINT DATEDIFF(ms,@TheTime,GETDATE())

GO

DECLARE @TheTime DATETIME

SET @TheTime=GETDATE()

UPDATE #TEMP2 SET COL5= dbo.fn_CleanStringLooping_CLR(COL2)

FROM #TEMP2

PRINT 'SQL:CLR looping '

PRINT DATEDIFF(ms,@TheTime,GETDATE())

go

DECLARE @TheTime DATETIME

SET @TheTime=GETDATE()

UPDATE #TEMP2 SET COL5= dbo.fn_CleanStringNotLooping_CLR(COL2)

FROM #TEMP2

PRINT 'SQL:CLR Not looping using ||*9*9|| technique in C#'

PRINT DATEDIFF(ms,@TheTime,GETDATE())

--CLEANUP

DROP FUNCTION dbo.fn_CleanUp

DROP TABLE #TEMP1

DROP TABLE #TEMP2

go

C++ anyone?

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

Paul White SSC Guru Points: 150457 More actions · Answer 10

brigzy (11/16/2009)
Taking the ||*9*9|| technique down to C# is sweet!

It seems pretty obvious that a .NET solution will usually be faster. What I liked about this article was the encouragement to think set-based and apply some old-school creative thinking to an old problem. No doubt a thorough treatment of which solution is 'best' (for some well-defined value of 'best') would make an interesting article.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

bornsql SSC-Addicted Points: 496 More actions · Answer 11

Hi how about this code whether this will work fine or not

DECLARE @tmpstr2 Varchar(MAX)

SELECT @tmpstr2=replace(OriginalString, char(10) + char(13), '') FROM @Demo

SELECT @tmpstr2=replace(@tmpstr2,' ','|')

SELECT @tmpstr2=replace(@tmpstr2,'|','')

SELECT @tmpstr2

C# Gnu SSCertifiable Points: 5973 More actions · Answer 12

Paul White (11/16/2009)
brigzy (11/16/2009)
Taking the ||*9*9|| technique down to C# is sweet!
It seems pretty obvious that a .NET solution will usually be faster.

Yes we we did clarify expected/obvious:

We might have expected c# to have the day, but the Query/Function result is surprising.

I think it is quite resonable and hopefully of general interest to discuss fastest C# technique on the subject. Seeing how simple the C# code is might also encourage pure T-SQL folk to venture C#.

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

SDM SSCommitted Points: 1580 More actions · Answer 13

Jeff: Neat. Whether it's the best on SQL Server or not is only partly the point (for me). I just like the different approach to the problem. Don't attack it directly, transform it to a smaller set of problems that _do_ lend themselves to a direct solution.

To the CLR freaks: The OP stated it had to be SQL 2000.

Paul White SSC Guru Points: 150457 More actions · Answer 14

SDM (11/17/2009)
To the people who can program in a .NET language and recognise when to use it appropriately to complement T-SQL:

Fixed that for ya! 😀

SDM (11/17/2009)
The OP stated it had to be SQL 2000.

OP? It's an article! And where does it specify 2000? Read it again carefully 😛

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Paul White SSC Guru Points: 150457 More actions · Answer 15

brigzy (11/16/2009)
We might have expected c# to have the day, but the Query/Function result is surprising.

Not really, no. Using the string "||*9*9||" as a replacement has a huge hidden cost, enough to make the normally less-efficient WHILE-loop-in-a-function approach seem faster. (Let us leave aside for a second the fact that the article stipulates that the string to be processed should not be made longer at any point.)

If we use "þ" as the replacement character, we don't change the overall length of the string. This allows SQL Server to do an in-place substitution. When using the long string, the source string has to be split at the replacement point, the "||*9*9||" inserted, and then the remainder of the original string tacked onto the end. The point is that there is an awful lot of string manipulation to be done, along with some allocations and copies, which just aren't needed with the "þ" method.

Using the single-character method, a quick test using your code produces a run time of around 4.9s for the function, and 3.2s for the article's method.

While I'm here, I'd just like to mention that the nested-replace method mentioned by Joe Celko and illustrated by G² is very fast. Probably not as fast as a similarly constructed .NET method, but good nonetheless. The problem with it is complexity and verbosity, at least to my mind. Simply put, it's just not elegant. I like Jeff's method for the reasons mentioned previously.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi