How to remove infinite loops in hierarchical data without using cursor?

  • Does anyone know a way to clean up data in a hierarchical table without using cursor?

    Here is the hierarchical table definition: NodeItem (NodeID int, ParentNodeID int) where ParentNodeID is also an entry in the same table.

    By walking up the hierarchy, I can show all the upper node items of a given node up to the root node level.

    The infinite loop problem comes up when there is data corruption, like ParentNodeID loops back to the starting NodeID after some steps. This data issue can be discovered easily by using the new CTE feature of SQL Server 2005 to display the hierarchical data all at once and setting the MAXRECURSION Option to a high value (like 50). If there is infinite loop, CTE will fail (or throw an exception).

    However, I'm struggling to identify the data involved in the infinite loop and clean them up. The only way I can think of is to put the entire NodeItem table into a cursor loop, and examine data row by row following the NodeID / ParentNodeID relationship in a temp table. The performance is quite bad because the table has over 1 million rows.

    My question: is there a non-cursor way to identify all the rows involved in infinite loop?

    Thanks.

  • There's always a non-cursor way, but without some sample data and sample code, you may never know that way...

    Copy in some sample data depicting your scenario it will help other readers determine how to help you.

    -

  • If I'm not wrong sample data may be like this...

    create table NodeItem

    (NodeID int, ParentNodeID int)

    go

    insert into NodeItem

    select 1 NodeID, null ParentNodeID union all

    select 11 , 1111 union all

    select 12 , 1 union all

    select 2 , null union all

    select 21 , 2 union all

    select 22 , 2 union all

    select 111 , 11 union all

    select 112 , 11 union all

    select 1111 , 111

    go

    select * from NodeItem

    go

    Close loop is created between these 11 >> 1111 >> 111 >> 11

    Expected output:

    11, 1111

    1111, 111

    111, 11

  • Use a while loop

  • I should have clarified that:

    (1) There may be many root nodes (root = dead end or no more parent entry found);

    (2) Not all paths will have the same depth (steps);

    (3) Infinite loop can be created in 0 or more steps

    0: A -> A;

    1: A -> B -> A;

    n: A -> B -> C -> ... -> A.

    Here are some sample data:

    create table NodeItem

    (NodeID int, ParentNodeID int)

    go

    insert into NodeItem

    select 1 NodeID, -1 ParentNodeID union all -- 1st root

    select 2 , NULL union all -- the 2nd root

    select 3 , -2 union all -- a 3rd root

    select 11 , 11 union all -- infinite loop: 0 step (11 -> 11)

    select 222 , 22 union all

    select 22 , 222 union all -- infinte loop: 1 step (22-> 222 -> 22)

    select 33 , 3 union all

    select 333 , 33

    go

    -- I use this CTE to discover whether there are infinite loop in data

    WITH HierarchyTree (NodeID, ParentNodeID, Step) AS

    (

    SELECT NodeID, ParentNodeID, 1

    FROM NodeItem

    UNION ALL

    SELECT r.NodeID, r.ParentNodeID, t.Step + 1

    FROM NodeItem r inner join HierarchyTree t on r.ParentNodeID = t.NodeID

    )

    SELECT Count(NodeID)

    From HierarchyTree

    Where Step = 50 -- a limit

    OPTION (HASH JOIN, MAXRECURSION 50);

    If the above CTE throws error, I know there are bad data. Since the table has millions of rows, looping row by row to find out whether a row is involved in infinite loop is a rather slow process. However, I haven't figured out a non-cursor way to list all those offending rows.

  • This should give you all of the invalid nodes

    WITH HierarchyTree (NodeID, ParentNodeID, Step) AS

    (

    SELECT NodeID, ParentNodeID, 1

    FROM NodeItem

    WHERE ParentNodeID = 0

    UNION ALL

    SELECT r.NodeID, r.ParentNodeID, t.Step + 1

    FROM NodeItem r

    inner join HierarchyTree t on r.ParentNodeID = t.NodeID

    )

    SELECT NodeID

    FROM NodeItem

    WHERE NodeID NOT IN (SELECT NodeID FROM HierarchyTree)

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Old Hand,

    Your CTE will not work in my instance because Root Nodes may not have ParentNodeID = 0. As I said, there might be many root nodes.

    Thanks anyway.

  • lxz20 (6/4/2009)


    Old Hand,

    Your CTE will not work in my instance because Root Nodes may not have ParentNodeID = 0. As I said, there might be many root nodes.

    Thanks anyway.

    So how do you identify a root node?

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Mark (6/4/2009)


    lxz20 (6/4/2009)


    Old Hand,

    Your CTE will not work in my instance because Root Nodes may not have ParentNodeID = 0. As I said, there might be many root nodes.

    Thanks anyway.

    So how do you identify a root node?

    I'll take a guess that root nodes don't have a valid parent, in which case this should work

    WITH HierarchyTree (NodeID, ParentNodeID, Step) AS

    (

    SELECT NodeID, ParentNodeID, 1

    FROM NodeItem

    WHERE ParentNodeID NOT IN (SELECT NodeID FROM NodeItem) OR ParentNodeID IS NULL

    UNION ALL

    SELECT r.NodeID, r.ParentNodeID, t.Step + 1

    FROM NodeItem r

    inner join HierarchyTree t on r.ParentNodeID = t.NodeID

    )

    SELECT NodeID

    FROM NodeItem

    WHERE NodeID NOT IN (SELECT NodeID FROM HierarchyTree)

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • The first thing to do is to select all rows where the parent ID is equal to the node ID. Eliminate/fix those, then add a constraint that blocks that from coming up again.

    Next is to do the same for ones where the child is the parent of the parent.

    After that, it's pretty much going to need to be a cursor.

    There are ways to prevent this kind of thing in the first place, but cleaning it up once it's already in there is a bit of a pain.

    One thing I always recommend on this kind of hierarchy, is have a "TopNodeID" type column. All levels in the same hierarchy have the same entry in that column. Speeds them up immensely. Makes it easy to prevent/fix these type of corruption issues too, since you can join on the TopNodeID column and find the loops right there. But since you don't have that, it doesn't help in this situation.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Old Hand,

    Your 2nd CTE still blew up with the following sample data:

    insert into NodeItem

    select 1 NodeID, -1 ParentNodeID -- 1st root

    union all select 2 , NULL -- the 2nd root

    union all select 3 , -2 -- a 3rd root

    union all select 11 , 11 -- infinite loop: 0 step (11 -> 11)

    union all select 22 , 2

    union all select 222 , 22

    union all select 22 , 222 -- infinte loop: 1 step (22-> 222 -> 22)

    union all select 2222 , 222

    union all select 33 , 3

    union all select 333 , 33

    union all select 3, 333 -- infinte loop: 2 steps (3 -> 333 -> 33 -> 3)

    go

    Since how data is put into this table is beyond my control, I'm wondering whether using cursor loop to clean up is the only way to go, like GSquared suggested.

  • You could potentially avoid a cursor in this. But it'll take more work and be less efficient.

    Cursors do have their uses. Limited, yes, but uses nonetheless.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • lxz20 (6/4/2009)


    Old Hand,

    Your 2nd CTE still blew up with the following sample data:

    insert into NodeItem

    select 1 NodeID, -1 ParentNodeID -- 1st root

    union all select 2 , NULL -- the 2nd root

    union all select 3 , -2 -- a 3rd root

    union all select 11 , 11 -- infinite loop: 0 step (11 -> 11)

    union all select 22 , 2

    union all select 222 , 22

    union all select 22 , 222 -- infinte loop: 1 step (22-> 222 -> 22)

    union all select 2222 , 222

    union all select 33 , 3

    union all select 333 , 33

    union all select 3, 333 -- infinte loop: 2 steps (3 -> 333 -> 33 -> 3)

    go

    Since how data is put into this table is beyond my control, I'm wondering whether using cursor loop to clean up is the only way to go, like GSquared suggested.

    NodeID 22 and 3 are repeated in your data, is that what you intended? These would be easy to identify and remove.

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Yep, NodeIDs are repeated here as examples to show infinite loops in 0, 1, or 2 steps. These loops are actually easier to catch and removed. There are other instances where a node is NOT looped back to itself until like 7 or 10 steps later. I would like a non-cursor way to list / remove all those n step loops, if possible, for performance reason. Currently I have to use a cursor, going through row by row following the following pseudo-code. The performance is NOT good because I have to use 2 loops:

    Put all data into a cursor

    Begin cursor loop

      Fetch a row into a temp table

      Inner While Loop (while there is a new parent)

        Follow the chain to get a new parent row

        If parentNodeID = nodeID in temp table

          Infinite Loop -- do something here

          Break Inner While Loop

        Else

          Insert parent row into the same temp table

      End Inner While Loop

    End corsor loop

  • FYI, nesting loops is needed to list the offending data to send back to the other team so that they can clean it up at the source.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply