Finding related records within the same table

Question

Finding related records within the same table

webskater

SSCarpal Tunnel

Points: 4214
More actions
August 4, 2016 at 8:11 am

#310219

CREATE TABLE #tblStakeholders
(
StakeholderID int,
StageID int,
StakeholderWasID int
)
GO
INSERT INTO #tblStakeholders (StakeholderID, StageID, StakeholderWasID)
SELECT 1, 1, 0 UNION ALL
SELECT 2, 1, 0 UNION ALL
SELECT 3, 1, 0 UNION ALL
SELECT 4, 2, 1 UNION ALL
SELECT 5, 2, 2 UNION ALL
SELECT 6, 2, 3 UNION ALL
SELECT 7, 3, 4 UNION ALL
SELECT 8, 3, 5 UNION ALL
SELECT 9, 3, 6
GO
CREATE TABLE #tblStakeholdersNotes
(
NoteID int,
StakeholderID int,
Note varchar(50)
)
GO
INSERT INTO #tblStakeholdersNotes (NoteID, StakeholderID, Note)
SELECT 1, 1, 'Note on Stage 1 for Stakeholder 1' UNION ALL
SELECT 2, 4, 'Note on Stage 2 for Stakeholder 1' UNION ALL
SELECT 3, 7, 'Note on Stage 3 for Stakeholder 1' UNION ALL
SELECT 4, 2, 'Note on Stage 1 for Stakeholder 2' UNION ALL
SELECT 5, 5, 'Note on Stage 2 for Stakeholder 2' UNION ALL
SELECT 5, 8, 'Note on Stage 3 for Stakeholder 2'
--work up to the first StakeholderID
DECLARE @StakeholderID int
;WITH rCTE(StakeholderID, StageID, StakeholderWasID, Sort) AS
(
SELECT StakeholderID, StageID, StakeholderWasID, 1
FROM #tblStakeholders
WHERE StakeholderID = 4
UNION ALL
SELECT e.StakeholderID, e.StageID, e.StakeholderWasID, 1 + Sort
FROM #tblStakeholders e
INNER JOIN rCTE c ON e.StakeholderID = c.StakeholderWasID
)
SELECT TOP 1 @StakeholderID = StakeholderID FROM rCTE ORDER BY Sort DESC
;WITH rCTE(StakeholderID, StageID, StakeholderWasID, Sort) AS
(
SELECT StakeholderID, StageID, StakeholderWasID, 1
FROM #tblStakeholders
WHERE StakeholderID = @StakeholderID
UNION ALL
SELECT e.StakeholderID, e.StageID, e.StakeholderWasID, 1 + Sort
FROM #tblStakeholders e
INNER JOIN rCTE c ON e.StakeholderWasID = c.StakeholderID
)
SELECT * FROM rCTE
SELECT * FROM #tblStakeholders
SELECT * FROM #tblStakeholdersNotes
DROP TABLE #tblStakeholders
DROP TABLE #tblStakeholdersNotes
In the Stakeholders table above, Stakeholder IDs 1, 4 and 7 are the same Stakeholder. They are related because the StakeholderWasID for StakeholderID 4 is 1 - and the StakeholderWasID for StakeholderID 7 is 4.
So, let's say I want to see the notes for StakeholderID 4. I need to return the notes from tblStakeholdersNotes for StakeholderIDs 1,4 and 7.
The cte code I have posted above gets me the dataset I need - but I have to use two Common Table Expressions - one to go up' the table looking for the first record related to the one I am working on. Then, another, to find all the records related 'down' from there. It works, but it feels clunky. Is there a better way of doing this? (The data structure I have is not my choice, I am lumbered with it following a last minute, radical change to how they want things done).

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

drew.allen SSC Guru Points: 76776 More actions · Answer 1

Your rCTE should start with the original stakeholders, instead of starting with the stakeholder that you are interested in, because you only need to traverse the tree in one direction that way. Then you will need to find the original stakeholder for the stakeholder you are interested in and filter on that.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

webskater SSCarpal Tunnel Points: 4214 More actions · Answer 2

Thanks for your reply. But when I am on any particular stage, all I have to hand is the StakeholderID for that stage - so I can't start with the original stakeholder - I have to 'look up the table' to find the first stakeholder. I was hoping there was some way of doing the cte where you could go find the Stakeholders going up and then down - as sorting them would not be a problem as I have a DateAdded field I can sort on (as Stages may not be sequential as they can be added out of order by users if they need to do the same stage twice).

Jon-183221 SSCommitted Points: 1903 More actions · Answer 3

Hi,

Assume this is something that is live and you are trying to work around?

Obviously only seeing a snippet but if Stakeholder was a table and Stage was a table you could join the records and Notes could run off stakeholder. Therefore stakeholder id would never change, they would just get assigned to different stages and all notes would be held against one record.

In saying that, and knowing you probably have no control over the design now, will have a look and see what can be done.

Jon

drew.allen SSC Guru Points: 76776 More actions · Answer 4

sku370870 (8/5/2016)
Thanks for your reply. But when I am on any particular stage, all I have to hand is the StakeholderID for that stage - so I can't start with the original stakeholder - I have to 'look up the table' to find the first stakeholder. I was hoping there was some way of doing the cte where you could go find the Stakeholders going up and then down - as sorting them would not be a problem as I have a DateAdded field I can sort on (as Stages may not be sequential as they can be added out of order by users if they need to do the same stage twice).

Original stakeholders have StakeholderWasID = 0.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

webskater SSCarpal Tunnel Points: 4214 More actions · Answer 5

drew.allen (8/5/2016)
sku370870 (8/5/2016)
Thanks for your reply. But when I am on any particular stage, all I have to hand is the StakeholderID for that stage - so I can't start with the original stakeholder - I have to 'look up the table' to find the first stakeholder. I was hoping there was some way of doing the cte where you could go find the Stakeholders going up and then down - as sorting them would not be a problem as I have a DateAdded field I can sort on (as Stages may not be sequential as they can be added out of order by users if they need to do the same stage twice).
Original stakeholders have StakeholderWasID = 0.
Drew

Yes, but if I have a StakeholderID of, say, 4 - I don't know which StakeholderID that has a StakeholderWasID of zero is the 'parent' of StakeholderID 4 - without working up the table to find out.

webskater SSCarpal Tunnel Points: 4214 More actions · Answer 6

No, I am afraid it is too late for a proper data structure. The original requirement was that Stakeholders would only be on one stage - and that where displayed on other stages, the data would be fetched from the single stage they were on and displayed subject to various conditions. Then it was decided that Stakeholders would be added to the first stage - and then copied forwards to each successive stage as users felt the stage was complete ... so that Stakeholders added from outside the chronological process could only be added to the stage currently being worked on. As the StakeholderID is the Primary Key of the Stakeholders table, to copy stakeholders would have meant creating another unique ID for each Stakeholder and copying that from Stage to Stage ... but dozens of procedures reference the existing PK ... so I decided to simply copy the Stakeholders from one stage to another and link them by tracking their StakeholderID in StakeholderIDWas.

The Stakeholders only really exist in the Stakeholders table - in that it is not a simple list. A Stakeholder can be an organisation, a contact, a contact linked to an organisation they don't work for, an ad-hoc contact not stored in the system and virtual contacts where the only details we have are whatever someone enters on a web site where they register interest.

To make life more interesting this is a already a complete rewrite of an existing system and this has to be kept working too - again limiting the scope for major data structure changes.

drew.allen SSC Guru Points: 76776 More actions · Answer 7

sku370870 (8/5/2016)
drew.allen (8/5/2016)
sku370870 (8/5/2016)
Thanks for your reply. But when I am on any particular stage, all I have to hand is the StakeholderID for that stage - so I can't start with the original stakeholder - I have to 'look up the table' to find the first stakeholder. I was hoping there was some way of doing the cte where you could go find the Stakeholders going up and then down - as sorting them would not be a problem as I have a DateAdded field I can sort on (as Stages may not be sequential as they can be added out of order by users if they need to do the same stage twice).
Original stakeholders have StakeholderWasID = 0.
Drew
Yes, but if I have a StakeholderID of, say, 4 - I don't know which StakeholderID that has a StakeholderWasID of zero is the 'parent' of StakeholderID 4 - without working up the table to find out.

Look, I know it can be done, because I wrote a query that does it. I'm trying to show you the path to take to get there, without handing it to you on a platter, but you refuse to even consider what I've said.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Hugo Kornelis SSC Guru Points: 64748 More actions · Answer 8

First of all, I never buy the "we cannot change the data structure anymore" argument. The only argument I buy is that it would be too expensive, and the couterargument is an estimate of the cost of development plus the cost of future maintenance for the current data structure. (I have read that the cost of maintenance typically is about 80-90% of the total lifecost of any application - so typically, "quick-win" decision that come back to bite you later tend to continue biting for a long time, and contiue biting hard.

Also note that even when this table is really cast in stone, you can still add additional tables. The data here is clearly a hierarchy and there are lots of good ways to represent hierarchies. The most common are Adjacecncy List (what you have now), Nested Sets, and Materialized Path[/url]. It can sometimes be convenient to store both, with some mechanism (i.e. triggers or periodic recomputation) to keep them synchronized.

But let's not talk too much about this messy database design, I get a headache just trying to understand how three different stakeholderID values can all be the same stakeholder and I refuse to submit myself to that headache.

The solution Drew has in mind is absolutely elegant. If you read his posts you'll understand what he means: his suggestion is to write logic to convert the adjacency list you have to a materialized path for all rows, then use those materialized paths to find the top-level and then use that same materialized path to find all stakeholders with the same top-level. I do not know if this is faster than what you have now, that depends on a lot of factors. Most of all the amount of data in total vs the amount of data that you'll end up selecting. If there are millions of stakeholders and the tree is only a few levels deep, then you are trading a solution that walks a few steps in both directions but only accesses the rows needed for a solution that walks the entire data set in a single direction. On the other hand, if there are only a few separate hierarchies in the table and they are very deep, then traversing all of them once will probably be faster than walking up and down a single one.

If performance is relevant for your situation, then you will have to test both to find out. If maintainability is the most important factor, then go with the one you feel is easiest to understand and maintain for you and your successor as long as it gives at least the minimum required performance.

However, there is a more serious issue with this post. Unless I misread something, there is an inconsistency between your written explanation and the current code. And there is also a grey area.

The inconsistency can be seen by adding twoi rows (10, 3, 4) and (11, 2, 1) to the table. ID 10 is also a direct descendant of 4 so I assume (based on both current code and description) that this one, too, must be included. But what about ID 11? This is not a direct descendant of 4, but it's sibling - a child of the same parent. Your written description suggest that this one should not be selected. But your code will select it. Which one is correct? You (or your end user) will have to be the judge of that. And then you should definitely ensure that a case such as this occurs in your collection of test data.

The grey area can be seen by adding yet another row: (4, 2, 2). Now ID 4 has two parent IDs - it already had ID 1 as a parent and we added ID 2. How should this influence the end results?

(And then if you really want to make your head hurt, think about what happens if I introduce a loop in the table - e.g. by changing the (1, 1, 0) row to read (1, 1, 9). What should happen? What will the code actually do?)

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

drew.allen SSC Guru Points: 76776 More actions · Answer 9

Hugo is right that my approach may have a lot of reads, but there is a modification to my approach that will have many fewer reads, but it first requires getting to the point where you have my original approach.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 10

sku370870 (8/4/2016)
CREATE TABLE #tblStakeholders ( StakeholderID int, StageID int, StakeholderWasID int ) GO INSERT INTO #tblStakeholders (StakeholderID, StageID, StakeholderWasID) SELECT 1, 1, 0 UNION ALL SELECT 2, 1, 0 UNION ALL SELECT 3, 1, 0 UNION ALL SELECT 4, 2, 1 UNION ALL SELECT 5, 2, 2 UNION ALL SELECT 6, 2, 3 UNION ALL SELECT 7, 3, 4 UNION ALL SELECT 8, 3, 5 UNION ALL SELECT 9, 3, 6 GO CREATE TABLE #tblStakeholdersNotes ( NoteID int, StakeholderID int, Note varchar(50) ) GO INSERT INTO #tblStakeholdersNotes (NoteID, StakeholderID, Note) SELECT 1, 1, 'Note on Stage 1 for Stakeholder 1' UNION ALL SELECT 2, 4, 'Note on Stage 2 for Stakeholder 1' UNION ALL SELECT 3, 7, 'Note on Stage 3 for Stakeholder 1' UNION ALL SELECT 4, 2, 'Note on Stage 1 for Stakeholder 2' UNION ALL SELECT 5, 5, 'Note on Stage 2 for Stakeholder 2' UNION ALL SELECT 5, 8, 'Note on Stage 3 for Stakeholder 2' --work up to the first StakeholderID DECLARE @StakeholderID int ;WITH rCTE(StakeholderID, StageID, StakeholderWasID, Sort) AS ( SELECT StakeholderID, StageID, StakeholderWasID, 1 FROM #tblStakeholders WHERE StakeholderID = 4 UNION ALL SELECT e.StakeholderID, e.StageID, e.StakeholderWasID, 1 + Sort FROM #tblStakeholders e INNER JOIN rCTE c ON e.StakeholderID = c.StakeholderWasID ) SELECT TOP 1 @StakeholderID = StakeholderID FROM rCTE ORDER BY Sort DESC ;WITH rCTE(StakeholderID, StageID, StakeholderWasID, Sort) AS ( SELECT StakeholderID, StageID, StakeholderWasID, 1 FROM #tblStakeholders WHERE StakeholderID = @StakeholderID UNION ALL SELECT e.StakeholderID, e.StageID, e.StakeholderWasID, 1 + Sort FROM #tblStakeholders e INNER JOIN rCTE c ON e.StakeholderWasID = c.StakeholderID ) SELECT * FROM rCTE SELECT * FROM #tblStakeholders SELECT * FROM #tblStakeholdersNotes DROP TABLE #tblStakeholders DROP TABLE #tblStakeholdersNotes
In the Stakeholders table above, Stakeholder IDs 1, 4 and 7 are the same Stakeholder. They are related because the StakeholderWasID for StakeholderID 4 is 1 - and the StakeholderWasID for StakeholderID 7 is 4.
So, let's say I want to see the notes for StakeholderID 4. I need to return the notes from tblStakeholdersNotes for StakeholderIDs 1,4 and 7.
The cte code I have posted above gets me the dataset I need - but I have to use two Common Table Expressions - one to go up' the table looking for the first record related to the one I am working on. Then, another, to find all the records related 'down' from there. It works, but it feels clunky. Is there a better way of doing this? (The data structure I have is not my choice, I am lumbered with it following a last minute, radical change to how they want things done).

Hi there, sku370870,

First of all, thank you for the very well laid out readily consumable data. It makes things pretty easy to know what's going on.

Shifting gears to the problem at hand, there might be a pretty easy way of doing this that won't feel clunky and stands the chance of being smokin' hot for performance. And, I don't see anything wrong with your data... it's an Adjacency List Hierarchy and there is a way to turn it into something "Set Based".

I do have a couple of questions, though, just to help me with design and setting up of a more substantial test to verify performance...

1. How many rows are in the Stake Holders Table?

2. How often is the data in the Stake Holders Table updated?

3. How many rows are in the Stake Holders Notes Table?

4. How often is the data in the Stake Holders Notes Table updated?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 11

drew.allen (8/5/2016)
sku370870 (8/5/2016)
drew.allen (8/5/2016)
sku370870 (8/5/2016)
Thanks for your reply. But when I am on any particular stage, all I have to hand is the StakeholderID for that stage - so I can't start with the original stakeholder - I have to 'look up the table' to find the first stakeholder. I was hoping there was some way of doing the cte where you could go find the Stakeholders going up and then down - as sorting them would not be a problem as I have a DateAdded field I can sort on (as Stages may not be sequential as they can be added out of order by users if they need to do the same stage twice).
Original stakeholders have StakeholderWasID = 0.
Drew
Yes, but if I have a StakeholderID of, say, 4 - I don't know which StakeholderID that has a StakeholderWasID of zero is the 'parent' of StakeholderID 4 - without working up the table to find out.
Look, I know it can be done, because I wrote a query that does it. I'm trying to show you the path to take to get there, without handing it to you on a platter, but you refuse to even consider what I've said.
Drew

Perhaps if you posted the code that you're talking about, the OP might consider it. Sometimes a silver platter helps a lot. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 12

Hugo Kornelis (8/5/2016)
First of all, I never buy the "we cannot change the data structure anymore" argument. The only argument I buy is that it would be too expensive, and the couterargument is an estimate of the cost of development plus the cost of future maintenance for the current data structure.

Heh... old gaming trick. If you can't move forward, backward, to the left, to the right, up, or down, what's the only move left?

Realize that you're in a box and change color. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

webskater SSCarpal Tunnel Points: 4214 More actions · Answer 13

Jeff Moden (8/5/2016)

sku370870 (8/4/2016)
I do have a couple of questions, though, just to help me with design and setting up of a more substantial test to verify performance...
1. How many rows are in the Stake Holders Table?
2. How often is the data in the Stake Holders Table updated?
3. How many rows are in the Stake Holders Notes Table?
4. How often is the data in the Stake Holders Notes Table updated?

1. At the moment, about 20,000. Each new project sees about 500 added.

2. Every day

3. About 100,000

4. Every day

Cheers

webskater SSCarpal Tunnel Points: 4214 More actions · Answer 14

drew.allen (8/5/2016)
sku370870 (8/5/2016)
drew.allen (8/5/2016)
sku370870 (8/5/2016)
Thanks for your reply. But when I am on any particular stage, all I have to hand is the StakeholderID for that stage - so I can't start with the original stakeholder - I have to 'look up the table' to find the first stakeholder. I was hoping there was some way of doing the cte where you could go find the Stakeholders going up and then down - as sorting them would not be a problem as I have a DateAdded field I can sort on (as Stages may not be sequential as they can be added out of order by users if they need to do the same stage twice).
Original stakeholders have StakeholderWasID = 0.
Drew
Yes, but if I have a StakeholderID of, say, 4 - I don't know which StakeholderID that has a StakeholderWasID of zero is the 'parent' of StakeholderID 4 - without working up the table to find out.
Look, I know it can be done, because I wrote a query that does it. I'm trying to show you the path to take to get there, without handing it to you on a platter, but you refuse to even consider what I've said.
Drew

I'm sorry, but I don't understand how to establish which is the parent with a StakeholderWasID of zero - without using the cte that I posted. If you do, I'd be very pleased to see it.