XQuery, output the integer that represents a singleton

  • Hi All,

    I'm currently working on shredding a large xml document (around half a GB) into relational format in order to allow easy comparison to other data I hold.

    However, there are a few issues with the structure of the document which make this slightly more complicated than I'd initally thought. (Unfortunately I have no control over the production of the document, it comes how it comes. 🙁 )

    Below is an example of the xml:-

    <ReportDetails>

    <ReportRequest>

    <ReportRequestCaseNo>1</ReportRequestCaseNo>

    <PersonDetails>

    <PersonDetailsCaseNo>1</PersonDetailsCaseNo>

    <FirstName>Bob</FirstName>

    <SurName>Smith</SurName>

    <Address>1B Shady St, London</Address>

    <OtherNames>

    <OtherName>Robert Smith</OtherName>

    <OtherName>Robert J Smith</OtherName>

    </OtherNames>

    </PersonDetails>

    </ReportRequest>

    <ReportRequest>

    <ReportRequestCaseNo>2</ReportRequestCaseNo>

    <PersonDetails>

    <PersonDetailsCaseNo>2</PersonDetailsCaseNo>

    <FirstName>Geoff</FirstName>

    <Surname>Jones</Surname>

    <Address>12 Tipton St, London</Address>

    <OtherNames>No Other Names</OtherNames>

    </PersonDetails>

    </ReportRequest>

    <ReportRequest>

    <ReportRequestCaseNo>1</ReportRequestCaseNo>

    <PersonDetails>

    <PersonDetailsCaseNo>1</PersonDetailsCaseNo>

    <FirstName>Susan</FirstName>

    <SurName>Smith</SurName>

    <Address>1B Shady St, London</Address>

    <OtherNames>No Other Names</OtherNames>

    </PersonDetails>

    </ReportRequest>

    </ReportDetails>

    There are other elements as well such as case description etc, but not really relevant to the issue as the same solution here will be applicable to them.

    I'm currently extracting this information into various tables for each section of elements, one for the person details, one for the other names, one for the case details and so on, in order to move this to a relational format.

    However, the issue that I have is that it is possible for a case to be a joint case resulting, as shown above in the sample xml, with duplicated case numbers but with different details in the person section. This becomes especially complicated when it comes to shredding the Other Names section as it leaves me with no way in which to match the other names back to the correct node.

    I've tried including the xml fragment in the person table to allow subsequent querying of the other names section, but performance wise this ends up pretty much killing the server and turning a 20 minute shredding job into server meltdown! As such, I was wondering if it's possible to generate some sort of unique row id that will always be consistent from the xml document that I can pass into a header table and then use this when performing the subsequent shredding tasks. I know that if I run something like:-

    SELECT root.value('ReportRequest[3]/PersonDetails[1]/FirstName[1]', 'varchar(50)')

    FROM @xml.nodes('ReportDetails') root(root)

    then I will always get the third ReportRequest; but I'm not sure how to actually dynamically return the element number itself in a query if I was shredding the entire document to produce something along the following lines:-

    ElementID - CaseNo - FirstName - SurName

    1 - 1 - Bob - Smith

    2 - 2 - Geoff - Jones

    3 - 1 - Susan - Smith

    Which would then allow me to have a second OtherNames table which would be able to be joined to the first table along the lines of:-

    ElementID - CaseNo - OtherNames

    1 - 1 - Robert Smith

    etc

    Hope I've made all that clear, if not please ask and I'll try to expand further.

    Cheers in advance,

    Matthew

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • Matthew Darwin (12/29/2010)


    I was wondering if it's possible to generate some sort of unique row id that will always be consistent from the xml document that I can pass into a header table and then use this when performing the subsequent shredding tasks. I know that if I run something like:-

    SELECT root.value('ReportRequest[3]/PersonDetails[1]/FirstName[1]', 'varchar(50)')

    FROM @xml.nodes('ReportDetails') root(root)

    then I will always get the third ReportRequest; but I'm not sure how to actually dynamically return the element number itself in a query if I was shredding the entire document

    Hi Matthew,

    From what I understand, this is in the XML specs, but not implemented in SQL Server.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Hi Wayne,

    Thanks for the response. Bit of a bummer then. 🙁 I'm trying to play around with ways in which to try and break each fragment down seperately but so far seems to be too heavy resource wise to be a viable solution.

    Cheers

    Matthew

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • Superb! Having a little play on my small sample and does the trick perfectly!

    Going to try and apply it to the full document, will let you know how I get on.

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • Just so it's clear, I thought I'd post the code that would populate the two tables. Works a charm, thanks very much!

    DECLARE @xml xml

    SET @xml =

    '<ReportDetails>

    <ReportRequest>

    <ReportRequestCaseNo>1</ReportRequestCaseNo>

    <PersonDetails>

    <PersonDetailsCaseNo>1</PersonDetailsCaseNo>

    <FirstName>Bob</FirstName>

    <SurName>Smith</SurName>

    <Address>1B Shady St, London</Address>

    <OtherNames>

    <OtherName>Robert Smith</OtherName>

    <OtherName>Robert J Smith</OtherName>

    </OtherNames>

    </PersonDetails>

    </ReportRequest>

    <ReportRequest>

    <ReportRequestCaseNo>2</ReportRequestCaseNo>

    <PersonDetails>

    <PersonDetailsCaseNo>2</PersonDetailsCaseNo>

    <FirstName>Geoff</FirstName>

    <SurName>Jones</SurName>

    <Address>12 Tipton St, London</Address>

    <OtherNames>No Other Names</OtherNames>

    </PersonDetails>

    </ReportRequest>

    <ReportRequest>

    <ReportRequestCaseNo>1</ReportRequestCaseNo>

    <PersonDetails>

    <PersonDetailsCaseNo>1</PersonDetailsCaseNo>

    <FirstName>Susan</FirstName>

    <SurName>Smith</SurName>

    <Address>1B Shady St, London</Address>

    <OtherNames>No Other Names</OtherNames>

    </PersonDetails>

    </ReportRequest>

    </ReportDetails>'

    SELECT

    root.value('ReportRequestCaseNo[1]','int') AS CaseNo

    , pd.value('FirstName[1]', 'varchar(50)') AS FirstName

    , pd.value('SurName[1]', 'varchar(50)') AS SurName

    , pd.value('Address[1]','varchar(50)') AS Address

    , pd.query('OtherNames/OtherName').value('count(*)', 'int') AS NumberOtherNames

    , DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

    FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

    CROSS APPLY root.nodes('PersonDetails') pd(pd)

    SELECT

    root.value('ReportRequestCaseNo[1]','int') AS CaseNo

    , onm.value('.[1]', 'varchar(50)') AS OtherName

    , ROW_NUMBER() OVER (PARTITION BY pd ORDER BY onm) AS OtherNameID

    , DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

    FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

    CROSS APPLY root.nodes('PersonDetails') pd(pd)

    CROSS APPLY pd.nodes('OtherNames/OtherName') onm(onm)

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • Matthew,

    Thanks for following up with the code that shows it working - it will help out others in the future when they have a similar issue.

    How does this run performance-wise for you?

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Performance wise it's not too bad; the first query takes around 15 seconds to produce ~300k rows.

    However, there's a slight tweak that needs making to the second query in order to ensure the unique rows are in alignment - the cross apply needs altering to an outer apply:-

    SELECT

    root.value('ReportRequestCaseNo[1]','int') AS CaseNo

    , onm.value('.[1]', 'varchar(50)') AS OtherName

    , ROW_NUMBER() OVER (PARTITION BY pd ORDER BY onm) AS OtherNameID

    , DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

    FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

    CROSS APPLY root.nodes('PersonDetails') pd(pd)

    OUTER APPLY pd.nodes('OtherNames/OtherName') onm(onm)

    This slows it down somewhat, sadly (just run it and took just over 8 minutes 🙁 ), but is required otherwise those elements that don't have the OtherName element get missed from the DENSE_RANK generation and so throw the generated row numbers out of sequence:-

    DECLARE @xml xml

    SET @xml =

    '<ReportDetails>

    <ReportRequest>

    <ReportRequestCaseNo>1</ReportRequestCaseNo>

    <PersonDetails>

    <PersonDetailsCaseNo>1</PersonDetailsCaseNo>

    <FirstName>Bob</FirstName>

    <SurName>Smith</SurName>

    <Address>1B Shady St, London</Address>

    <OtherNames>

    <OtherName>Robert Smith</OtherName>

    <OtherName>Robert J Smith</OtherName>

    </OtherNames>

    </PersonDetails>

    </ReportRequest>

    <ReportRequest>

    <ReportRequestCaseNo>2</ReportRequestCaseNo>

    <PersonDetails>

    <PersonDetailsCaseNo>2</PersonDetailsCaseNo>

    <FirstName>Geoff</FirstName>

    <SurName>Jones</SurName>

    <Address>12 Tipton St, London</Address>

    <OtherNames>No Other Names</OtherNames>

    </PersonDetails>

    </ReportRequest>

    <ReportRequest>

    <ReportRequestCaseNo>1</ReportRequestCaseNo>

    <PersonDetails>

    <PersonDetailsCaseNo>1</PersonDetailsCaseNo>

    <FirstName>Susan</FirstName>

    <SurName>Smith</SurName>

    <Address>1B Shady St, London</Address>

    <OtherNames>

    <OtherName>Sue</OtherName>

    </OtherNames>

    </PersonDetails>

    </ReportRequest>

    </ReportDetails>'

    SELECT

    root.value('ReportRequestCaseNo[1]','int') AS CaseNo

    , pd.value('FirstName[1]', 'varchar(50)') AS FirstName

    , pd.value('SurName[1]', 'varchar(50)') AS SurName

    , pd.value('Address[1]','varchar(50)') AS Address

    , pd.query('OtherNames/OtherName').value('count(*)', 'int') AS NumberOtherNames

    , DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

    FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

    CROSS APPLY root.nodes('PersonDetails') pd(pd)

    /*using cross apply for the OtherName elements throws the dense rank sequence out */

    SELECT

    root.value('ReportRequestCaseNo[1]','int') AS CaseNo

    , onm.value('.[1]', 'varchar(50)') AS OtherName

    , ROW_NUMBER() OVER (PARTITION BY pd ORDER BY onm) AS OtherNameID

    , DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

    FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

    CROSS APPLY root.nodes('PersonDetails') pd(pd)

    CROSS APPLY pd.nodes('OtherNames/OtherName') onm(onm)

    /*Using outer apply means that dense rank is generated for everything*/

    SELECT

    root.value('ReportRequestCaseNo[1]','int') AS CaseNo

    , onm.value('.[1]', 'varchar(50)') AS OtherName

    , ROW_NUMBER() OVER (PARTITION BY pd ORDER BY onm) AS OtherNameID

    , DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

    FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

    CROSS APPLY root.nodes('PersonDetails') pd(pd)

    OUTER APPLY pd.nodes('OtherNames/OtherName') onm(onm)

    I guess in the circumstances I can cope with the performance, this only needs to be run once a day maximum and can be run out of hours; the resulting normalised data will then be used for querying.

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • Can you post the schema of the actual table, a few rows of sample data, number of actual rows, and the query plans for both? I'm sure if the CROSS APPLY is taking only 15 seconds we can figure out some way to get the OUTER APPLY to take much less than 8 minutes.

    --
    Adam Machanic
    whoisactive

  • Will do, give me a few minutes and I'll get everything together.

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • The code to create the schema collection, objects and with a small sample xml is attached; just getting the query plans now.

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • And the two execution plans.

    Forgot to add total number of rows for the full document:-

    Individual Details - 289166

    Other Names - 294767

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • Just to be clear, you run one file at a time, and it takes 8 minutes?

    --
    Adam Machanic
    whoisactive

  • Not quite; I load the file into the dbo.eiirXML_TEST table, then perform the shredding from there.

    The query to populate the dbo.eiirIndividualDetails_TEST table takes 15 seconds, the query to populate the dbo.eiirOtherNames_TEST table was taking 8 minutes. I've just run an update statistics command (not sure how this applies to xml indexes) and run that second query again and it took 3:36 so I'm guessing maybe the statistics were a little off?

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • OK, I've run the two queries below:-

    Cross apply, takes 19 seconds to return 27,968 rows:-

    SELECT idvd.value('CaseNoIndividual[1]', 'int') AS CaseNoOtherNames

    , idvd.value('FirstName[1]', 'varchar(100)') AS FirstName

    , idvd.value('Surname[1]', 'varchar(100)') AS SurName

    , idvd.query('OtherNames/OtherName').value('count(*)', 'int') AS NumberOtherNames

    , otnm.value('.[1]', 'varchar(4000)') As OtherName

    , 1--@FileName

    , DENSE_RANK() OVER(ORDER BY rd) AS UniqueXMLRowID

    , ROW_NUMBER() OVER (PARTITION BY rd ORDER BY otnm) AS OtherNameSubID

    FROM dbo.eiirXML

    CROSS APPLY data.nodes('/ReportDetails/ReportRequest') AS rd(rd)

    CROSS APPLY rd.nodes('IndividualDetails') AS idvd(idvd)

    OUTER APPLY idvd.nodes('OtherNames/OtherName') AS otnm(otnm)

    WHERE ID = 1--@FileName

    Outer Apply, takes 3:38 to return 294,768 rows :-

    SELECT idvd.value('CaseNoIndividual[1]', 'int') AS CaseNoOtherNames

    , idvd.value('FirstName[1]', 'varchar(100)') AS FirstName

    , idvd.value('Surname[1]', 'varchar(100)') AS SurName

    , idvd.query('OtherNames/OtherName').value('count(*)', 'int') AS NumberOtherNames

    , otnm.value('.[1]', 'varchar(4000)') As OtherName

    , 1--@FileName

    , DENSE_RANK() OVER(ORDER BY rd) AS UniqueXMLRowID

    , ROW_NUMBER() OVER (PARTITION BY rd ORDER BY otnm) AS OtherNameSubID

    FROM dbo.eiirXML

    CROSS APPLY data.nodes('/ReportDetails/ReportRequest') AS rd(rd)

    CROSS APPLY rd.nodes('IndividualDetails') AS idvd(idvd)

    CROSS APPLY idvd.nodes('OtherNames/OtherName') AS otnm(otnm)

    WHERE ID = 1--@FileName

    I've also attached the query plans following running each, as well as the plan from running the two queries in the same batch.

    On both plans, the most expensive action shows as the Nested Loops (Inner join) at 77% with an estimated row count of 44,325,400, actual row count of 42,202; this is the same on both plans, yet the cross apply query takes 1% of the cost relative to the batch? I'll happily admit I get easily confused looking at query plans, but this really doesn't make much sense to me. 🙁

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply