Help getting last 2 entries from split string with variable number of splits

Question

Post reply

Help getting last 2 entries from split string with variable number of splits

tom 69406

SSC-Addicted

Points: 446
More actions
December 14, 2020 at 11:39 am
Go to Answer
#3820225
Hi y'all
I have some 3rd party data with variable format as follows
Type-Country-Company-Code-Identifier
TT-CCC-AA-99-0000
I am not interested in the Type, Country and Company parts. Only the Code and Identifier is needed. However frequently the data is missing some of the data in this string,
so it could be any of the following
CCC-AA-99-0000
-CCC-AA-99-0000
TT-CCC-AA-0000
-CCC-AA-0000
The Identifier is always present and is numeric translatable to BIGINT
The Code is sometimes present, and is numeric translatable to SMALLINT (max 65535)
I want to extract the Code and Identifier from the string
I have been experimenting with STRING_SPLIT - but this does not guarantee the order of splits,
So I moved on to the Delimited8KSplit function (from this site - all credit to those who made this and shared it - thank-you)
Here is what I have so Far. It looks close
```
DROP TABLE IF EXISTS #TmpData;
CREATE TABLE #TmpData (
Id INT NOT NULL IDENTITY(1,1),
Name Varchar(20) NOT NULL,
IdentificationString VARCHAR(30) )
INSERT INTO #TmpData (Name, IdentificationString)
VALUES 
('Product1', 'PP-UK-MS-123-123456'),
('Product2', '-UK-MS-345-234567'),
('Product3', 'UK-MS-678-345678'),
('Product4', 'PP-UK-MS-123459'),
('Product5', '-UK-MS-234569'),
('Product6', 'UK-MS-234560'),
('Product7', '');

SELECT T.Id, T.Name AS Product, Split.Item AS Identifer, MAX(Split.ItemNumber) AS MaxNum, Split.ItemNumber
FROM #TmpData AS T
CROSS APPLY dbo.DelimitedSplit8K(T.IdentificationString, '-') Split      
WHERE ISNUMERIC(Split.Item) = 1
GROUP BY T.Id, T.Name, Split.Item, Split.ItemNumber
HAVING MAX(Split.ItemNumber) = Split.ItemNumber
ORDER BY T.Id


-- Desired Output
IdProduct   Code  Identifier
1Product1123123456
2Product2345234567
3Product3678345678
4Product4NULL123459
5Product5NULL234569
6Product6NULL234560
7Product7NULLNULL
```
- This topic was modified 3 years, 10 months ago by tom 69406.

Ant-Green

SSC Guru

Points: 112965

December 14, 2020 at 12:04 pm

Answer

#3820234

A bit crude but essentially need to find a way to reverse the item number

DROP TABLE IF EXISTS #TmpData;
CREATE TABLE #TmpData (
Id INT NOT NULL IDENTITY(1,1),
Name Varchar(20) NOT NULL,
IdentificationString VARCHAR(30) )
INSERT INTO #TmpData (Name, IdentificationString)
VALUES 
('Product1', 'PP-UK-MS-123-123456'),
('Product2', '-UK-MS-345-234567'),
('Product3', 'UK-MS-678-345678'),
('Product4', 'PP-UK-MS-123459'),
('Product5', '-UK-MS-234569'),
('Product6', 'UK-MS-234560'),
('Product7', '');

;with cte as
(SELECT T.Id, T.Name AS Product, Split.Item AS Identifer, MAX(Split.ItemNumber) AS MaxNum, Split.ItemNumber
FROM #TmpData AS T
CROSS APPLY dbo.DelimitedSplit8K(T.IdentificationString, '-') Split      
WHERE ISNUMERIC(Split.Item) = 1
GROUP BY T.Id, T.Name, Split.Item, Split.ItemNumber
HAVING MAX(Split.ItemNumber) = Split.ItemNumber
), cte2 as
(
SELECT
*, ROW_NUMBER() over (partition by id order by maxnum desc) as rn  --add a rownumber in the descending order so its identifier as 1, code as 2
from cte
)
select tmp.id, tmp.Name as Product, max(case when rn = 2 then Identifer end) AS Code, max(case when rn = 1 then Identifer end) as Identifier
from #TmpData tmp
left join cte2 c
on tmp.Id = c.id
group by tmp.id, tmp.name
/*
-- Desired Output
IdProduct   Code  Identifier
1Product1123123456
2Product2345234567
3Product3678345678
4Product4NULL123459
5Product5NULL234569
6Product6NULL234560
7Product7NULLNULL
*/

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Michael L John One Orange Chip Points: 26551 More actions · Answer 1

This may work. Didn't really test it thoroughly.

SELECT 
T.Id, 
T.Name AS Product, 
REVERSE(Split.ItemValue) AS Identifer, 
Split.ItemNumber AS MaxNum, 
Split.ItemNumber
FROM #TmpData AS T
CROSS APPLY dbo.DelimitedSplit8K(REVERSE(T.IdentificationString), '-') Split      
WHERE Split.ItemNumber IN (1,2)
AND Split.ItemValue LIKE '%[0-9]%'
ORDER BY T.Id

Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/

ScottPletcher SSC Guru Points: 100262 More actions · Answer 2

This has less overhead, which may or not be a concern to you for this task, but I thought I'd post it just in case.

;WITH cte_find_last_2_dashes AS (
    SELECT *, CHARINDEX('-', REVERSE(IdentificationString)) AS last_dash,
        CHARINDEX('-', REVERSE(IdentificationString), CHARINDEX('-', REVERSE(IdentificationString)) + 1) AS next_to_last_dash 
    FROM #TmpData
),
cte_get_last_2_strings AS (
    SELECT *, CASE WHEN next_to_last_dash = 0 THEN NULL ELSE SUBSTRING(IdentificationString, LEN(IdentificationString) - 
            next_to_last_dash + 2, next_to_last_dash - last_dash - 1) END AS next_to_last_string,
        CASE WHEN last_dash = 0 THEN NULL ELSE SUBSTRING(IdentificationString, LEN(IdentificationString) - 
            last_dash + 2, 10) END AS last_string
    FROM cte_find_last_2_dashes
)
SELECT 
    Id, Name AS Product,
    CASE WHEN next_to_last_string LIKE '%[^0-9]%' THEN NULL ELSE next_to_last_string END AS Code,
    CASE WHEN last_string LIKE '%[^0-9]%' THEN NULL ELSE last_string END AS Identifier
FROM cte_get_last_2_strings

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 3

ScottPletcher wrote:

This has less overhead...

Just curious, Scott... how are you measuring "overhead" for this?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

ScottPletcher SSC Guru Points: 100262 More actions · Answer 4

Jeff Moden wrote:

ScottPletcher wrote:
This has less overhead...
Just curious, Scott... how are you measuring "overhead" for this?

Actually, for this one, I just looked at the query plans. The other code is doing a lot of operations, including at least one sort. They query plan for my code shows a single table scan with some computations.

Edit: btw, I was comparing to the post marked as "Answer", not the subsequent post. Although I think any function invocation would be more overhead than just some simple calcs.

This reply was modified 3 years, 10 months ago by ScottPletcher.

SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 5

ScottPletcher wrote:

Jeff Moden wrote:
ScottPletcher wrote:
This has less overhead...
Just curious, Scott... how are you measuring "overhead" for this?
Actually, for this one, I just looked at the query plans. The other code is doing a lot of operations, including at least one sort. They query plan for my code shows a single table scan with some computations.
Edit: btw, I was comparing to the post marked as "Answer", not the subsequent post. Although I think any function invocation would be more overhead than just some simple calcs.

On this type of thing, I totally agree. I'm playing with some things (including your code) on a typical "Million Row Table" I built to test performance. My personal opinion is that the code you wrote is going to be very difficult to beat for performance and can probably only be tied at best. Things like DelimitedSplit8k don't stand a chance.

I'm also in the process of finding out just how expensive CROSS APPLY and ISNULL can be. I knew there was a little cost but had no idea that (for example) ISNULL was so bloody expensive (just adding two doubled the CPU usage).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

tom 69406 SSC-Addicted Points: 446 More actions · Answer 6

Thanks everyone.

As I'm developing under a deadline, I went with the first answer. And I'm developing with limited data, and we are expecting no more than a few thousand at a time, and only a few times per week, so there is no performance issue - yet. But I've put a link to Scott's answer as a comment in the code, so if it becomes a problem, a faster alternative is there.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 7

tom 69406 wrote:

Thanks everyone.
As I'm developing under a deadline, I went with the first answer. And I'm developing with limited data, and we are expecting no more than a few thousand at a time, and only a few times per week, so there is no performance issue - yet. But I've put a link to Scott's answer as a comment in the code, so if it becomes a problem, a faster alternative is there.

I hope that the performance issues that you push off because of "deadlines" don't actually cause you to have more deadlines. It's a common issue known as "Technical Debt". It's a viscous circle that can only be broken one way... do it right the first time.

Didn't mean for that to be a lecture and certainly not a slam, especially since you probably weren't the one that came up with the "deadline". Just wanted you to know. In one shop I worked in, everything was a struggle. Once we adopted the philosophy that if anything could go wrong, it needed to be fixed even before it went to QA, rework plummeted to nearly zero and, oddly enough, taking the little bit of time to "do it right the first time" caused us to be a whole lot more productive because, it turned out, that rework took 8 times longer than doing it right and about a hundred times longer than the little bit of extra time we spent to do that.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 8

WITH cteParse(Id, Name, IdentificationString)
AS (
        SELECT  Id,
                Name,
                REPLACE(RIGHT(IdentificationString, CHARINDEX('-', REVERSE(IdentificationString), CHARINDEX('-', REVERSE(IdentificationString)) + 1)), '-', '.') AS IdentificationString
        FROM    #TmpData
)
SELECT  Id,
        Name,
        TRY_PARSE(PARSENAME(IdentificationString, 2) AS SMALLINT) AS Code,
        TRY_PARSE(PARSENAME(IdentificationString, 1) AS BIGINT) AS Identifier
FROM    cteParse;

N 56°04'39.16"
E 12°55'05.25"