Finding a substring and extracting a number associated with it

  • quentin.harris (7/4/2016)


    Awesome.

    It works. Now have to get my head round it what it actually all does.

    Thank you!

    well to start with, I suggest that you run this

    SELECT i.rowid, i.TextString, s.ItemNumber, s.Item

    FROM #input i

    CROSS APPLY dbo.DelimitedSplit8K(i.TextString,' ') s

    then you can start to see how the rest builds up.

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Thank you so much for your time.

  • You're welcome, Quentin.

    I apologize for not posting a link to delimitedsplit8k. It is a tool which should be in everyone's toolkit. I should have posted more of an explanation as well. I was just ready to start my weekend. Also sorry for using LAG and TRY_CONVERT. I'm running on 2012, obviously.

    Here is the basic logic of my original solution.

    1. CTE: Read each row and use DelimitedSplit8k to parse it into individual strings.

    2. CTE2: Use the CASE statement to identify when a string is like 'Admission%' or 'Discharge%, and if so, store the preceding element (LAG) in the CountX column. If a string is a valid date, store it in the DateX column.

    3: The final query crosstabs the individual rows into columns on one row. The Group By Rowid keeps data for each original line together. The where clause eliminates strings that weren't significant.

    J Livingston: Sorry about the white space issue. It lines up fine in my editor. Remember that perfect speed is being there. šŸ˜‰

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Just realized what you meant by the white space issue. The unformatted lines might have more than one space between the number and the word "Admission(s)" or "Discharge(s)". Adding a WHERE clause to cte2 fixes this.

    ,cte2 as(select rowID, Item

    , casewhen Item like 'Admission%' or Item Like 'Discharge%'

    then LAG(item,1,null) over(order by rowid,itemnumber) else null end as CountX

    , TRY_CONVERT(date, replace(item,'.','')) as DateX

    from cte

    where item > '')

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Thank you. I will see if i can place the new code in the right place and then run.

    Q

  • Hello

    Not sure if Iā€™m missing something but could you not simplify it to :

    select * into #input

    from (values (1,'There have been 5 admissions and 3 discharges since 04/14/2016.')

    ,(2,'There has been 1 admission and 2 discharges since 04/22/2016.')

    ,(3,'There have been 2 admissions and 1 discharge since 04/29/2016.')

    ,(4,'There have been 2 discharges and 2 admissions since 05/15/2016.')

    ,(5,'There have been 2 discharges and 1 admission since 05/24/2016.')) dt (Rowid,TextString);

    select * from #input;

    SELECT ip.Rowid

    , SUBSTRING(ip.TextString, PATINDEX('%[0-9][0-9]/[0-9][0-9]/20%',ip.TextString), 10) AS [Date]

    -- With the assumption you may have up to 9 of each

    , SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('admiss',REPLACE(ip.TextString, ' ', ''))-3, 3) AS Admissions

    , SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('dischar',REPLACE(ip.TextString, ' ', ''))-3, 3) AS Discharges

    -- With the assumption you may have up to 999 of each

    , REPLACE(REPLACE(REPLACE(REPLACE(SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('admiss',REPLACE(ip.TextString, ' ', ''))-3, 3), 'n', ''), 'e', ''), 'd', ''), 's', '') AS Admissions

    , REPLACE(REPLACE(REPLACE(REPLACE(SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('dischar',REPLACE(ip.TextString, ' ', ''))-3, 3), 'n', ''), 'e', ''), 'd', ''), 's', '') AS Discharges

    FROM #input AS ip

  • Black Robin (7/5/2016)


    Hello

    Not sure if Iā€™m missing something but could you not simplify it to :

    select * into #input

    from (values (1,'There have been 5 admissions and 3 discharges since 04/14/2016.')

    ,(2,'There has been 1 admission and 2 discharges since 04/22/2016.')

    ,(3,'There have been 2 admissions and 1 discharge since 04/29/2016.')

    ,(4,'There have been 2 discharges and 2 admissions since 05/15/2016.')

    ,(5,'There have been 2 discharges and 1 admission since 05/24/2016.')) dt (Rowid,TextString);

    select * from #input;

    SELECT ip.Rowid

    , SUBSTRING(ip.TextString, PATINDEX('%[0-9][0-9]/[0-9][0-9]/20%',ip.TextString), 10) AS [Date]

    -- With the assumption you may have up to 9 of each

    , SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('admiss',REPLACE(ip.TextString, ' ', ''))-3, 3) AS Admissions

    , SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('dischar',REPLACE(ip.TextString, ' ', ''))-3, 3) AS Discharges

    -- With the assumption you may have up to 999 of each

    , REPLACE(REPLACE(REPLACE(REPLACE(SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('admiss',REPLACE(ip.TextString, ' ', ''))-3, 3), 'n', ''), 'e', ''), 'd', ''), 's', '') AS Admissions

    , REPLACE(REPLACE(REPLACE(REPLACE(SUBSTRING(REPLACE(ip.TextString, ' ', ''), CHARINDEX('dischar',REPLACE(ip.TextString, ' ', ''))-3, 3), 'n', ''), 'e', ''), 'd', ''), 's', '') AS Discharges

    FROM #input AS ip

    there are no doubt numerous ways to solve the OP question, particlulary when there appears to be "ALWAYS" a definitive way the string is constructed....(pls see earlier posts in this thread)......however, I was interested to see what I could come up with that deliverd the requried results, when the data in the string was a little "unstructred"....was only for my own benefit and a little bit of fun <grin>

    here is the data I played around with...and yes, there are permutations that my code will fall over on....but I have already said that .

    SELECT *

    INTO #input

    FROM (values

    ( 1, 'There were 5 admissions and 3 discharge since reporting 22/04/2016'),

    ( 2, 'today here were 6 discharges and 3 admission since reporting 21/04/2016'),

    ( 3, 'since reporting 30/04/2016 the discharges are 8 and the admissions are 10' ),

    ( 4, ' the discharges are 20 and the admissions are 30 on 10/05/2016 '),

    ( 5, ' the admission was 100 on 15/05/2016 and the discharge was 30 '),

    ( 6, ' on 15/08/2016 5 admissions and discharge 10' ),

    ( 7, ' 100 admission 200 discharges 25/12/2018' ),

    ( 8, ' discharge 500 on 01/01/2020 and 100 admission '),

    ( 9, ' discharge 1 2 admission 01/01/2020' ),

    ( 10, ' 20180101 100 admission 200 discharges' ),

    ( 11, '20160101 admission 1 2 discharges '),

    ( 12, 'here are the results up to 12/12/12 Admission 300 400 Discharge ' ) ,

    ( 13, ' Discharges 9000 20/May/2015 8000 Admissions ' ) ,

    ( 14, ' Discharges 99999 88888 Admissions 161216' ),

    ( 15, ' Discharges I think were 99 and maybe 88 possibly less Admissions on 2016-05-15' )

    )

    dt (rowID,TextString)

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Robin, you can absolutely do it your way. As JLS said, there are many ways to solve a problem. There are always tradeoffs though. For example, your code assumes that the date will always use slashes as separators (M/D/Y). The Try_Convert function will catch any valid (single-string) date format, but will not trap invalid dates.

    Your solution would in fact run faster than mine by about a second for every 10,000 rows, so for the problem as stated I would adopt it for production, because, hey, faster is faster.

    JLS: I want to try to solve your new input set when time allows.

    Fun problem, but again GET THE INPUTS STANDARDIZED.... this whole exercise should be unnecessary.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (7/5/2016)


    JLS: I want to try to solve your new input set when time allows.

    look fwd to the "challenge" Dixie....good fun me thinks

    just remember as per OP previous post .....

    I will try to adapt what has been posted to suit our sandpit environment where i have discovered that some functions are not available (LAG and TRY_CONVERT).

    Seems 2012 is not used for the sandpit which limits things.

    game on <grin>

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Another approach would be to split the text into word-level bi-grams using NGrams8K[/url]. See the article for the function and a description of what a word-level bi-gram is.

    Using Ngrams8K you can create the following functions:

    CREATE FUNCTION dbo.WBIGrams(@string varchar(8000))

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    /*

    SELECT TokenNumber, Token

    FROM dbo.WBIGrams('There have been 5 admissions and 3 discharges since 04/14/2016.')

    */

    WITH

    delim(RN,N) AS -- locate all of the spaces in the string

    (

    SELECT 0,0 UNION ALL

    SELECT ROW_NUMBER() OVER (ORDER BY position), position

    FROM dbo.NGrams8k(RTRIM(LTRIM(@string)),1)

    WHERE token = CHAR(32)

    ),

    tokens(tokenNumber, tokenStartPosition, token, tokenCount) AS -- Create the tokens

    (

    SELECT

    N1.RN+1,

    N1.N+1,

    SUBSTRING(@string, N1.N+1, ISNULL(N2.N-(N1.N+1), 8000)),

    (SELECT COUNT(*)-1 FROM delim)

    FROM delim N1

    LEFT JOIN delim N2 ON N2.RN = N1.RN+2

    )

    SELECT t.tokenNumber, tokenStartPosition, t.token

    FROM tokens t

    WHERE tokenNumber <= tokenCount;

    GO

    CREATE FUNCTION dbo.WhiteSpaceCleanup(@String varchar(max))

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    /*

    DECLARE

    @string varchar(8000) = 'ABC 123'+CHAR(13)+CHAR(10)+'xxx xxx xx xxx xx';

    SELECT * FROM dbo.WhiteSpaceCleanup(@string);

    --Results: ABC 123 xxx xxx xx xxx xx

    */

    SELECT NewString =

    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(

    LTRIM(RTRIM(REPLACE(REPLACE(REPLACE(REPLACE(

    @String,CHAR(9),' '),CHAR(10),' '),CHAR(13),' '),CHAR(160),' '))),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ');

    First for updated sample data:

    select * into #input

    from (values

    (1,'There''s been 5 admissions then another another 200 admissions, and 3 discharges since 04/14/2016.')

    ,(2,'There has been 11 admission and 2 discharges since 04/22/2016.')

    ,(3,'There have been 2 admissions and 1 discharge since 04/29/2016.')

    ,(4,'There have been 2 discharges 50 admissions admissions since 05/15/2016.')

    ,(5,'There have been 16 discharges and 1 admission since 05/24/2016.')) dt (Rowid,TextString);

    ALTER TABLE #input ALTER COLUMN Rowid int NOT NULL;

    ALTER TABLE #input ADD CONSTRAINT pk_input PRIMARY KEY CLUSTERED (Rowid);

    Here's a few queries to understand my solution. This first query turns tabs, line feeds, carriage returns into spaces then removes duplicate spaces.

    -- (1) Remove whitespace

    SELECT RowID, NewString

    FROM #input i

    CROSS APPLY dbo.WhiteSpaceCleanup(i.TextString);

    RowID NewString

    ------ -------------------------------------------------------------------------------------------------

    1 There's been 5 admissions then another another 200 admissions, and 3 discharges since 04/14/2016.

    2 There has been 11 admission and 2 discharges since 04/22/2016.

    3 There have been 2 admissions and 1 discharge since 04/29/2016.

    ...

    This second query builds off the first to "extract" the admissions and discharges

    -- (2) Get the date and the bigrams that contain the admissions or discharges

    WITH cleanedStrings AS

    (

    SELECT RowID, NewString

    FROM #input i

    CROSS APPLY dbo.WhiteSpaceCleanup(i.TextString)

    )

    SELECT

    RowID,

    token,

    ActivityDate = SUBSTRING(NewString,

    PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%', NewString), 10)

    from cleanedStrings i

    CROSS APPLY dbo.WBIGrams(i.NewString)

    WHERE token LIKE '[0-9]%'

    AND (token LIKE '%[0-9] admission%' OR token LIKE '%[0-9] discharge%');

    RowID token ActivityDate

    ----------- --------------- ---------------

    1 5 admissions 04/14/2016

    1 200 admissions, 04/14/2016

    1 3 discharges 04/14/2016

    2 11 admission 04/22/2016

    2 2 discharges 04/22/2016

    3 2 admissions 04/29/2016

    3 1 discharge 04/29/2016

    ...

    And here's the final solution:

    WITH

    cleanedStrings AS -- Let's first remove duplicate spaces, line feeds, tabs and carriage returns

    (

    SELECT RowID, NewString

    FROM #input i

    CROSS APPLY dbo.WhiteSpaceCleanup(i.TextString)

    )

    select

    i.RowID,

    Admission = SUM(CASE WHEN token LIKE '%admission%'

    THEN CAST(LEFT(token,CHARINDEX(' ', token)) AS int) END),

    Discharge = SUM(CASE WHEN token LIKE '%discharge%'

    THEN CAST(LEFT(token,CHARINDEX(' ', token)) AS int) END),

    ActivityDate = MAX

    (

    SUBSTRING(NewString,

    PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%', NewString), 10)

    )

    FROM cleanedStrings i

    CROSS APPLY dbo.WBIGrams(i.NewString)

    WHERE token LIKE '[0-9]%'

    AND (token LIKE '%[0-9] admission%' OR token LIKE '%[0-9] discharge%')

    GROUP BY i.Rowid;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Hi Alan..have been following your recent scripts and am suitably impressed.

    its now

    bed o'clock here and busy tomorrow...but hoping I can work through and understand your code later tomorrow.

    JLS

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • J Livingston SQL (7/5/2016)


    Hi Alan..have been following your recent scripts and am suitably impressed.

    Thanks a lot J. That means a lot!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Hi,

    Thank you all. I am a little overwhelmed, in a very nice way, with all the help and new SQL to pick apart.

    You are an awesome crowd.

    šŸ˜€

    Q

  • J Livingston SQL (7/5/2016)


    Hi Alan..have been following your recent scripts and am suitably impressed.

    its now

    bed o'clock here and busy tomorrow...but hoping I can work through and understand your code later tomorrow.

    JLS

    Hi Alan

    have had a play with your code (I see its been marked as answer)

    as I have previously said....I was looking at expanding the solution to cover other permutations.

    I maybe be mistaken.....but please take a look at additional sample data in your code below.....doesnt deliver for me.

    thoughts?

    select * into #input

    from (values

    (1,'There''s been 5 admissions then another another 200 admissions, and 3 discharges since 04/14/2016.')

    ,(2,'There has been 11 admission and 2 discharges since 04/22/2016.')

    ,(3,'There have been 2 admissions and 1 discharge since 04/29/2016.')

    ,(4,'There have been 2 discharges 50 admissions admissions since 05/15/2016.')

    ,(5,'There have been 16 discharges and 1 admission since 05/24/2016.')

    -- additional rows

    ,( 10, ' 20180101 100 admission 200 discharges' )

    ,( 11, '20160101 admission 1 and 2 discharges ')

    ) dt (Rowid,TextString);

    ALTER TABLE #input ALTER COLUMN Rowid int NOT NULL;

    ALTER TABLE #input ADD CONSTRAINT pk_input PRIMARY KEY CLUSTERED (Rowid);

    WITH

    cleanedStrings AS -- Let's first remove duplicate spaces, line feeds, tabs and carriage returns

    (

    SELECT RowID, NewString

    FROM #input i

    CROSS APPLY dbo.WhiteSpaceCleanup(i.TextString)

    )

    select

    i.RowID,

    Admission = SUM(CASE WHEN token LIKE '%admission%'

    THEN CAST(LEFT(token,CHARINDEX(' ', token)) AS int) END),

    Discharge = SUM(CASE WHEN token LIKE '%discharge%'

    THEN CAST(LEFT(token,CHARINDEX(' ', token)) AS int) END),

    ActivityDate = MAX

    (

    SUBSTRING(NewString,

    PATINDEX('%[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]%', NewString), 10)

    )

    FROM cleanedStrings i

    CROSS APPLY dbo.WBIGrams(i.NewString)

    WHERE token LIKE '[0-9]%'

    AND (token LIKE '%[0-9] admission%' OR token LIKE '%[0-9] discharge%')

    GROUP BY i.Rowid;

    drop table #input

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Hey JLS, I think I have a solution, but for some strange reason the website won't let me post or upload the code if I spell out TRY_CONVERT more than once.

    Do a search_replace on the code below and let me know.

    Key points:

    1. Expanded rules for what can be considered a date string.

    2. Used PATINDEX to determine if admissions came before discharges in CTE ([AdmissionFirst])

    3. Used ROW_NUMBER() in CTE2 to make sure that DateX was always the last row for a given RowID and that the order of integers in [textstring] was preserved.

    4. Used CASE statements in the final query to decide which integers were used for admissions and which for discharges, based on the value of the AdmissionFirst flag

    ---------------------------------------------------------------------------------------------------------------------------------------------------------

    -- Search/replace "try_C(" to the correct function

    ---------------------------------------------------------------------------------------------------------------------------------------------------------

    with cte as (select rowID, itemnumber,convert(varchar(50),item) as item

    , case when PATINDEX('%Admission%Discharge%',TextString) > 0 then 'Y' else '' end as AdmissionFirst

    , case when item like '%/%/%' or (try_C(date,Item) is not null and len(item) > 5) then item end as DateX

    from #input i

    cross apply DelimitedSplit8K(TextString, ' ')

    )

    ,cte2 as (select *, ROW_NUMBER() over(Partition by RowID order by rowID,datex, ItemNumber) as RowNum

    from cte

    where item > ''

    and (try_C(int,Item) is not null

    or Item like '%/%/%'

    or try_C(date,Item) is not null ))

    select RowID

    ,max(case when AdmissionFirst = 'Y' and RowNum = 1 then Item

    when AdmissionFirst <> 'Y' and RowNum = 2 then Item

    end) as Admissions

    ,max(case when AdmissionFirst = 'Y' and RowNum = 2 then Item

    when AdmissionFirst <> 'Y' and RowNum = 1 then Item

    end) as Discharges

    ,max(DateX) as DateX

    from cte2

    group by RowID

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply