Help getting last 2 entries from split string with variable number of splits

  • Hi y'all

    I have some 3rd party data with variable format as follows

    Type-Country-Company-Code-Identifier

    TT-CCC-AA-99-0000

    I am not interested in the Type, Country and Company parts.  Only the Code and Identifier is needed. However frequently the data is missing some of the data in this string,

    so it could be any of the following

    CCC-AA-99-0000

    -CCC-AA-99-0000

    TT-CCC-AA-0000

    -CCC-AA-0000

    The Identifier is always present and is numeric translatable to BIGINT

    The Code is sometimes present, and is numeric translatable to SMALLINT (max 65535)

    I want to extract the Code and Identifier from the string

    I have been experimenting with STRING_SPLIT - but this does not guarantee the order of splits,

    So I moved on to the Delimited8KSplit function (from this site - all credit to those who made this and shared it - thank-you)

    Here is what I have so Far.   It looks close

     

    DROP TABLE IF EXISTS #TmpData;
    CREATE TABLE #TmpData (
    Id INT NOT NULL IDENTITY(1,1),
    Name Varchar(20) NOT NULL,
    IdentificationString VARCHAR(30) )
    INSERT INTO #TmpData (Name, IdentificationString)
    VALUES
    ('Product1', 'PP-UK-MS-123-123456'),
    ('Product2', '-UK-MS-345-234567'),
    ('Product3', 'UK-MS-678-345678'),
    ('Product4', 'PP-UK-MS-123459'),
    ('Product5', '-UK-MS-234569'),
    ('Product6', 'UK-MS-234560'),
    ('Product7', '');

    SELECT T.Id, T.Name AS Product, Split.Item AS Identifer, MAX(Split.ItemNumber) AS MaxNum, Split.ItemNumber
    FROM #TmpData AS T
    CROSS APPLY dbo.DelimitedSplit8K(T.IdentificationString, '-') Split
    WHERE ISNUMERIC(Split.Item) = 1
    GROUP BY T.Id, T.Name, Split.Item, Split.ItemNumber
    HAVING MAX(Split.ItemNumber) = Split.ItemNumber
    ORDER BY T.Id


    -- Desired Output
    IdProduct Code Identifier
    1Product1123123456
    2Product2345234567
    3Product3678345678
    4Product4NULL123459
    5Product5NULL234569
    6Product6NULL234560
    7Product7NULLNULL

     

    • This topic was modified 3 years, 10 months ago by  tom 69406.
  • A bit crude but essentially need to find a way to reverse the item number

     

    DROP TABLE IF EXISTS #TmpData;
    CREATE TABLE #TmpData (
    Id INT NOT NULL IDENTITY(1,1),
    Name Varchar(20) NOT NULL,
    IdentificationString VARCHAR(30) )
    INSERT INTO #TmpData (Name, IdentificationString)
    VALUES
    ('Product1', 'PP-UK-MS-123-123456'),
    ('Product2', '-UK-MS-345-234567'),
    ('Product3', 'UK-MS-678-345678'),
    ('Product4', 'PP-UK-MS-123459'),
    ('Product5', '-UK-MS-234569'),
    ('Product6', 'UK-MS-234560'),
    ('Product7', '');

    ;with cte as
    (SELECT T.Id, T.Name AS Product, Split.Item AS Identifer, MAX(Split.ItemNumber) AS MaxNum, Split.ItemNumber
    FROM #TmpData AS T
    CROSS APPLY dbo.DelimitedSplit8K(T.IdentificationString, '-') Split
    WHERE ISNUMERIC(Split.Item) = 1
    GROUP BY T.Id, T.Name, Split.Item, Split.ItemNumber
    HAVING MAX(Split.ItemNumber) = Split.ItemNumber
    ), cte2 as
    (
    SELECT
    *, ROW_NUMBER() over (partition by id order by maxnum desc) as rn --add a rownumber in the descending order so its identifier as 1, code as 2
    from cte
    )
    select tmp.id, tmp.Name as Product, max(case when rn = 2 then Identifer end) AS Code, max(case when rn = 1 then Identifer end) as Identifier
    from #TmpData tmp
    left join cte2 c
    on tmp.Id = c.id
    group by tmp.id, tmp.name
    /*
    -- Desired Output
    IdProduct Code Identifier
    1Product1123123456
    2Product2345234567
    3Product3678345678
    4Product4NULL123459
    5Product5NULL234569
    6Product6NULL234560
    7Product7NULLNULL
    */
  • This may work.  Didn't really test it thoroughly.

    SELECT 
    T.Id,
    T.Name AS Product,
    REVERSE(Split.ItemValue) AS Identifer,
    Split.ItemNumber AS MaxNum,
    Split.ItemNumber
    FROM #TmpData AS T
    CROSS APPLY dbo.DelimitedSplit8K(REVERSE(T.IdentificationString), '-') Split
    WHERE Split.ItemNumber IN (1,2)
    AND Split.ItemValue LIKE '%[0-9]%'
    ORDER BY T.Id

     

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • This has less overhead, which may or not be a concern to you for this task, but I thought I'd post it just in case.

    ;WITH cte_find_last_2_dashes AS (
    SELECT *, CHARINDEX('-', REVERSE(IdentificationString)) AS last_dash,
    CHARINDEX('-', REVERSE(IdentificationString), CHARINDEX('-', REVERSE(IdentificationString)) + 1) AS next_to_last_dash
    FROM #TmpData
    ),
    cte_get_last_2_strings AS (
    SELECT *, CASE WHEN next_to_last_dash = 0 THEN NULL ELSE SUBSTRING(IdentificationString, LEN(IdentificationString) -
    next_to_last_dash + 2, next_to_last_dash - last_dash - 1) END AS next_to_last_string,
    CASE WHEN last_dash = 0 THEN NULL ELSE SUBSTRING(IdentificationString, LEN(IdentificationString) -
    last_dash + 2, 10) END AS last_string
    FROM cte_find_last_2_dashes
    )
    SELECT
    Id, Name AS Product,
    CASE WHEN next_to_last_string LIKE '%[^0-9]%' THEN NULL ELSE next_to_last_string END AS Code,
    CASE WHEN last_string LIKE '%[^0-9]%' THEN NULL ELSE last_string END AS Identifier
    FROM cte_get_last_2_strings

     

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • ScottPletcher wrote:

    This has less overhead... 

    Just curious, Scott... how are you measuring "overhead" for this?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden wrote:

    ScottPletcher wrote:

    This has less overhead... 

    Just curious, Scott... how are you measuring "overhead" for this?

    Actually, for this one, I just looked at the query plans.  The other code is doing a lot of operations, including at least one sort.  They query plan for my code shows a single table scan with some computations.

    Edit: btw, I was comparing to the post marked as "Answer", not the subsequent post.  Although I think any function invocation would be more overhead than just some simple calcs.

    • This reply was modified 3 years, 10 months ago by  ScottPletcher.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • ScottPletcher wrote:

    Jeff Moden wrote:

    ScottPletcher wrote:

    This has less overhead... 

    Just curious, Scott... how are you measuring "overhead" for this?

    Actually, for this one, I just looked at the query plans.  The other code is doing a lot of operations, including at least one sort.  They query plan for my code shows a single table scan with some computations.

    Edit: btw, I was comparing to the post marked as "Answer", not the subsequent post.  Although I think any function invocation would be more overhead than just some simple calcs.

    On this type of thing, I totally agree. I'm playing with some things (including your code) on a typical "Million Row Table" I built to test performance.  My personal opinion is that the code you wrote is going to be very difficult to beat for performance and can probably only be tied at best.  Things like DelimitedSplit8k don't stand a chance.

    I'm also in the process of finding out just how expensive CROSS APPLY and ISNULL can be.  I knew there was a little cost but had no idea that (for example) ISNULL was so bloody expensive (just adding two doubled the CPU usage).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks everyone.

    As I'm developing under a deadline, I went with the first answer.  And I'm developing with limited data, and we are expecting no more than a few thousand at a time, and only a few times per week, so there is no performance issue - yet.  But I've put a link to Scott's answer as a comment in the code, so if it becomes a problem, a faster alternative is there.

     

     

  • tom 69406 wrote:

    Thanks everyone.

    As I'm developing under a deadline, I went with the first answer.  And I'm developing with limited data, and we are expecting no more than a few thousand at a time, and only a few times per week, so there is no performance issue - yet.  But I've put a link to Scott's answer as a comment in the code, so if it becomes a problem, a faster alternative is there.

    I hope that the performance issues that you push off because of "deadlines" don't actually cause you to have more deadlines.  It's a common issue known as "Technical Debt".  It's a viscous circle that can only be broken one way... do it right the first time.

    Didn't mean for that to be a lecture and certainly not a slam, especially since you probably weren't the one that came up with the "deadline".  Just wanted you to know.  In one shop I worked in, everything was a struggle.  Once we adopted the philosophy that if anything could go wrong, it needed to be fixed even before it went to QA, rework plummeted to nearly zero and, oddly enough, taking the little bit of time to "do it right the first time" caused us to be a whole lot more productive because, it turned out, that rework took 8 times longer than doing it right and about a hundred times longer than the little bit of extra time we spent to do that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • WITH cteParse(Id, Name, IdentificationString)
    AS (
    SELECT Id,
    Name,
    REPLACE(RIGHT(IdentificationString, CHARINDEX('-', REVERSE(IdentificationString), CHARINDEX('-', REVERSE(IdentificationString)) + 1)), '-', '.') AS IdentificationString
    FROM #TmpData
    )
    SELECT Id,
    Name,
    TRY_PARSE(PARSENAME(IdentificationString, 2) AS SMALLINT) AS Code,
    TRY_PARSE(PARSENAME(IdentificationString, 1) AS BIGINT) AS Identifier
    FROM cteParse;

    N 56°04'39.16"
    E 12°55'05.25"

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply