April 30, 2015 at 11:39 am
One of my varchar columns in a table has multiple key words enclosed in a pattern of special characters.
Eg: William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world's pre-eminent [##dramatist##]. He is often called England's national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].
I need to write to query to find all distinct key words that are enclosed within [## and ##]. My query should yield the following results from the string in the example above
[##actor##]
[##dramatist##]
[##playwright##] -- 2 occurrances, but I need it only once in my result set
[##poems##]
[##poet##] -- 2 occurrances, but I need it only once in my result set
[##sonnets##]
[##verses##]
[##writer##]
I need to run this on a large table, so I am looking for the best possible way to minimize any performance issues.
Just give you sample code, I have provided below 2 separate snippets, one with table variable and another with temp table.
DECLARE @MyTable TABLE (MyString VARCHAR (8000))
INSERT @MyTable VALUES ('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].')
SELECT * from @MyTable
IF OBJECT_ID('tempdb..#MyTable') IS NOT NULL
DROP TABLE #MyTable
CREATE TABLE #MyTable (MyString VARCHAR (8000))
INSERT #MyTable VALUES ('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].')
SELECT * from #MyTable
Thanks in advance.
April 30, 2015 at 11:51 am
I would probably start with Dwain Camps' article here:
http://qa.sqlservercentral.com/articles/String+Manipulation/94365/
I might go about it by writing all the locations (use CharIndex) of a string to a temp table and then doing a select distinct on it. Not pretty for sure, but performance on something like this without a temp table would probably be hideous.
April 30, 2015 at 12:00 pm
Thanks for the link, @pietlinden. I will go through the article.
Just to re-phrase my original request, I am not looking for a single query. This is a one-time effort, so a T-SQL block with temp tables is just fine.
Thanks!
April 30, 2015 at 12:57 pm
Maybe the DelimitedSplit8k will be faster as there's no really a need for wildcards.
http://qa.sqlservercentral.com/articles/Tally+Table/72993/
SELECT SUBSTRING( Item, 3, CHARINDEX( '#', Item, 3) - 3) AS keyword,
'[' + LEFT( Item, CHARINDEX( '#', Item, 3) + 2) AS keywordWrapped,
COUNT(*) AS Occurences
from #MyTable
CROSS APPLY dbo.DelimitedSplit8K( MyString, '[')
WHERE ItemNumber > 1
GROUP BY MyString,
SUBSTRING( Item, 3, CHARINDEX( '#', Item, 3) - 3),
LEFT( Item, CHARINDEX( '#', Item, 3) + 2)
April 30, 2015 at 1:35 pm
SELECT SUBSTRING(string, t.tally + 3, CHARINDEX('##]', string, t.tally + 3) - t.tally - 3)
FROM ( --#MyTable
SELECT CAST('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].' AS varchar(8000)) AS string
) AS test_data
INNER JOIN dbo.tally t ON
SUBSTRING(string, t.tally, 3) = '[##' AND
CHARINDEX('##]', string, t.tally + 3) > 0
Edit: Where "tally" is a standard tally table; I also named the column "tally". Naturally change that to match your own CTE or physical tally table.
SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!
May 2, 2015 at 3:10 am
For fun, yet another solution
😎
USE tempdb;
GO
SET NOCOUNT ON;
IF OBJECT_ID('dbo.TBL_SAMPLE_PARSE') IS NOT NULL
DROP TABLE dbo.TBL_SAMPLE_PARSE
CREATE TABLE dbo.TBL_SAMPLE_PARSE
(
SP_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TBL_SAMPLE_PARSE_SP_ID PRIMARY KEY CLUSTERED
,MyString VARCHAR (8000) NOT NULL
)
INSERT dbo.TBL_SAMPLE_PARSE (MyString)
VALUES ('William Shakespeare was an English [##poet##], [##playwright##], and [##actor##], widely regarded as the greatest [##writer##] in the English language and the world''s pre-eminent [##dramatist##]. He is often called England''s national [##poet##] and the "Bard of Avon". His extant works, including some collaborations, consist of about 38 plays, 154 [##sonnets##], two long narrative [##poems##], and a few other [##verses##], of which the authorship of some is uncertain. His plays have been translated into every major living language and are performed more often than those of any other [##playwright##].')
;WITH T(N) AS (SELECT N FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N))
,BASE_DATA AS
(
SELECT
SP.SP_ID
,CHARINDEX('[##',SP.MyString,NM.N) AS POS_FROM
,CHARINDEX('##]',SP.MyString,NM.N) AS POS_TO
,SP.MyString
FROM dbo.TBL_SAMPLE_PARSE SP
CROSS APPLY
(
SELECT TOP (LEN(SP.MyString))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N
FROM T T1,T T2,T T3,T T4
) AS NM
WHERE CHARINDEX('[##',SP.MyString,NM.N) > 0
AND
(
NM.N = CHARINDEX('[##',SP.MyString,NM.N)
OR
NM.N = CHARINDEX('##]',SP.MyString,NM.N)
)
)
SELECT DISTINCT
BD.SP_ID
,SUBSTRING(BD.MyString,BD.POS_FROM,(BD.POS_TO-BD.POS_FROM) + 3) AS TOKEN_STR
FROM BASE_DATA BD
WHERE BD.POS_TO > BD.POS_FROM;
Results
SP_ID TOKEN_STR
----------- -----------------
1 [##actor##]
1 [##dramatist##]
1 [##playwright##]
1 [##poems##]
1 [##poet##]
1 [##sonnets##]
1 [##verses##]
1 [##writer##]
October 20, 2015 at 12:40 pm
Thanks for your valuable inputs, @pietlinden/@Luis Cazares/@ScottPletcher/@Eirikur Eiriksson
SQLCurious
October 20, 2015 at 1:48 pm
Using Eirkur's sample data and the aforementioned splitter, you could do this:
UPDATE dbo.TBL_SAMPLE_PARSE
SET MyString = REPLACE(REPLACE(MyString,',',' '),'.',' ')
SELECT DISTINCT Item
FROM TBL_SAMPLE_PARSE
CROSS APPLY dbo.delimitedSplit8K(MyString, ' ')
WHERE item LIKE '%[##%##]%' ESCAPE '[';
Edit: fixed one minor typo in my code.
-- Itzik Ben-Gan 2001
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply