XQuery, output the integer that represents a singleton

Question

XQuery, output the integer that represents a singleton

Matthew Darwin

SSCertifiable

Points: 5513
More actions
December 29, 2010 at 4:24 am

#242814

Hi All,
I'm currently working on shredding a large xml document (around half a GB) into relational format in order to allow easy comparison to other data I hold.
However, there are a few issues with the structure of the document which make this slightly more complicated than I'd initally thought. (Unfortunately I have no control over the production of the document, it comes how it comes. 🙁 )
Below is an example of the xml:-
<ReportDetails>
<ReportRequest>
<ReportRequestCaseNo>1</ReportRequestCaseNo>
<PersonDetails>
<PersonDetailsCaseNo>1</PersonDetailsCaseNo>
<FirstName>Bob</FirstName>
<SurName>Smith</SurName>
<Address>1B Shady St, London</Address>
<OtherNames>
<OtherName>Robert Smith</OtherName>
<OtherName>Robert J Smith</OtherName>
</OtherNames>
</PersonDetails>
</ReportRequest>
<ReportRequest>
<ReportRequestCaseNo>2</ReportRequestCaseNo>
<PersonDetails>
<PersonDetailsCaseNo>2</PersonDetailsCaseNo>
<FirstName>Geoff</FirstName>
<Surname>Jones</Surname>
<Address>12 Tipton St, London</Address>
<OtherNames>No Other Names</OtherNames>
</PersonDetails>
</ReportRequest>
<ReportRequest>
<ReportRequestCaseNo>1</ReportRequestCaseNo>
<PersonDetails>
<PersonDetailsCaseNo>1</PersonDetailsCaseNo>
<FirstName>Susan</FirstName>
<SurName>Smith</SurName>
<Address>1B Shady St, London</Address>
<OtherNames>No Other Names</OtherNames>
</PersonDetails>
</ReportRequest>
</ReportDetails>
There are other elements as well such as case description etc, but not really relevant to the issue as the same solution here will be applicable to them.
I'm currently extracting this information into various tables for each section of elements, one for the person details, one for the other names, one for the case details and so on, in order to move this to a relational format.
However, the issue that I have is that it is possible for a case to be a joint case resulting, as shown above in the sample xml, with duplicated case numbers but with different details in the person section. This becomes especially complicated when it comes to shredding the Other Names section as it leaves me with no way in which to match the other names back to the correct node.
I've tried including the xml fragment in the person table to allow subsequent querying of the other names section, but performance wise this ends up pretty much killing the server and turning a 20 minute shredding job into server meltdown! As such, I was wondering if it's possible to generate some sort of unique row id that will always be consistent from the xml document that I can pass into a header table and then use this when performing the subsequent shredding tasks. I know that if I run something like:-
SELECT root.value('ReportRequest[3]/PersonDetails[1]/FirstName[1]', 'varchar(50)')
FROM @xml.nodes('ReportDetails') root(root)
then I will always get the third ReportRequest; but I'm not sure how to actually dynamically return the element number itself in a query if I was shredding the entire document to produce something along the following lines:-
ElementID - CaseNo - FirstName - SurName
1 - 1 - Bob - Smith
2 - 2 - Geoff - Jones
3 - 1 - Susan - Smith
Which would then allow me to have a second OtherNames table which would be able to be joined to the first table along the lines of:-
ElementID - CaseNo - OtherNames
1 - 1 - Robert Smith
etc
Hope I've made all that clear, if not please ask and I'll try to expand further.
Cheers in advance,
Matthew
Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

WayneS SSC Guru Points: 95405 More actions · Answer 1

Matthew Darwin (12/29/2010)
I was wondering if it's possible to generate some sort of unique row id that will always be consistent from the xml document that I can pass into a header table and then use this when performing the subsequent shredding tasks. I know that if I run something like:-
SELECT root.value('ReportRequest[3]/PersonDetails[1]/FirstName[1]', 'varchar(50)')
FROM @xml.nodes('ReportDetails') root(root)
then I will always get the third ReportRequest; but I'm not sure how to actually dynamically return the element number itself in a query if I was shredding the entire document

Hi Matthew,

From what I understand, this is in the XML specs, but not implemented in SQL Server.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 2

Hi Wayne,

Thanks for the response. Bit of a bummer then. 🙁 I'm trying to play around with ways in which to try and break each fragment down seperately but so far seems to be too heavy resource wise to be a viable solution.

Cheers

Matthew

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Adam Machanic SSCoach Points: 15370 More actions · Answer 3

Perhaps this will help you?

http://sqlblog.com/blogs/adam_machanic/archive/2009/08/03/uniquely-identifying-xml-nodes-with-dense-rank.aspx

--
Adam Machanic
whoisactive

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 4

Superb! Having a little play on my small sample and does the trick perfectly!

Going to try and apply it to the full document, will let you know how I get on.

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 5

Just so it's clear, I thought I'd post the code that would populate the two tables. Works a charm, thanks very much!

DECLARE @xml xml

SET @xml =

'<ReportDetails>

<ReportRequest>

<ReportRequestCaseNo>1</ReportRequestCaseNo>

<PersonDetails>

<PersonDetailsCaseNo>1</PersonDetailsCaseNo>

<FirstName>Bob</FirstName>

<SurName>Smith</SurName>

<Address>1B Shady St, London</Address>

<OtherNames>

<OtherName>Robert Smith</OtherName>

<OtherName>Robert J Smith</OtherName>

</OtherNames>

</PersonDetails>

</ReportRequest>

<ReportRequest>

<ReportRequestCaseNo>2</ReportRequestCaseNo>

<PersonDetails>

<PersonDetailsCaseNo>2</PersonDetailsCaseNo>

<FirstName>Geoff</FirstName>

<SurName>Jones</SurName>

<Address>12 Tipton St, London</Address>

<OtherNames>No Other Names</OtherNames>

</PersonDetails>

</ReportRequest>

<ReportRequest>

<ReportRequestCaseNo>1</ReportRequestCaseNo>

<PersonDetails>

<PersonDetailsCaseNo>1</PersonDetailsCaseNo>

<FirstName>Susan</FirstName>

<SurName>Smith</SurName>

<Address>1B Shady St, London</Address>

<OtherNames>No Other Names</OtherNames>

</PersonDetails>

</ReportRequest>

</ReportDetails>'

SELECT

root.value('ReportRequestCaseNo[1]','int') AS CaseNo

, pd.value('FirstName[1]', 'varchar(50)') AS FirstName

, pd.value('SurName[1]', 'varchar(50)') AS SurName

, pd.value('Address[1]','varchar(50)') AS Address

, pd.query('OtherNames/OtherName').value('count(*)', 'int') AS NumberOtherNames

, DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

CROSS APPLY root.nodes('PersonDetails') pd(pd)

SELECT

root.value('ReportRequestCaseNo[1]','int') AS CaseNo

, onm.value('.[1]', 'varchar(50)') AS OtherName

, ROW_NUMBER() OVER (PARTITION BY pd ORDER BY onm) AS OtherNameID

, DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

CROSS APPLY root.nodes('PersonDetails') pd(pd)

CROSS APPLY pd.nodes('OtherNames/OtherName') onm(onm)

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

WayneS SSC Guru Points: 95405 More actions · Answer 6

Matthew,

Thanks for following up with the code that shows it working - it will help out others in the future when they have a similar issue.

How does this run performance-wise for you?

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 7

Performance wise it's not too bad; the first query takes around 15 seconds to produce ~300k rows.

However, there's a slight tweak that needs making to the second query in order to ensure the unique rows are in alignment - the cross apply needs altering to an outer apply:-

SELECT

root.value('ReportRequestCaseNo[1]','int') AS CaseNo

, onm.value('.[1]', 'varchar(50)') AS OtherName

, ROW_NUMBER() OVER (PARTITION BY pd ORDER BY onm) AS OtherNameID

, DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

CROSS APPLY root.nodes('PersonDetails') pd(pd)

OUTER APPLY pd.nodes('OtherNames/OtherName') onm(onm)

This slows it down somewhat, sadly (just run it and took just over 8 minutes 🙁 ), but is required otherwise those elements that don't have the OtherName element get missed from the DENSE_RANK generation and so throw the generated row numbers out of sequence:-

DECLARE @xml xml

SET @xml =

'<ReportDetails>

<ReportRequest>

<ReportRequestCaseNo>1</ReportRequestCaseNo>

<PersonDetails>

<PersonDetailsCaseNo>1</PersonDetailsCaseNo>

<FirstName>Bob</FirstName>

<SurName>Smith</SurName>

<Address>1B Shady St, London</Address>

<OtherNames>

<OtherName>Robert Smith</OtherName>

<OtherName>Robert J Smith</OtherName>

</OtherNames>

</PersonDetails>

</ReportRequest>

<ReportRequest>

<ReportRequestCaseNo>2</ReportRequestCaseNo>

<PersonDetails>

<PersonDetailsCaseNo>2</PersonDetailsCaseNo>

<FirstName>Geoff</FirstName>

<SurName>Jones</SurName>

<Address>12 Tipton St, London</Address>

<OtherNames>No Other Names</OtherNames>

</PersonDetails>

</ReportRequest>

<ReportRequest>

<ReportRequestCaseNo>1</ReportRequestCaseNo>

<PersonDetails>

<PersonDetailsCaseNo>1</PersonDetailsCaseNo>

<FirstName>Susan</FirstName>

<SurName>Smith</SurName>

<Address>1B Shady St, London</Address>

<OtherNames>

<OtherName>Sue</OtherName>

</OtherNames>

</PersonDetails>

</ReportRequest>

</ReportDetails>'

SELECT

root.value('ReportRequestCaseNo[1]','int') AS CaseNo

, pd.value('FirstName[1]', 'varchar(50)') AS FirstName

, pd.value('SurName[1]', 'varchar(50)') AS SurName

, pd.value('Address[1]','varchar(50)') AS Address

, pd.query('OtherNames/OtherName').value('count(*)', 'int') AS NumberOtherNames

, DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

CROSS APPLY root.nodes('PersonDetails') pd(pd)

/*using cross apply for the OtherName elements throws the dense rank sequence out */

SELECT

root.value('ReportRequestCaseNo[1]','int') AS CaseNo

, onm.value('.[1]', 'varchar(50)') AS OtherName

, ROW_NUMBER() OVER (PARTITION BY pd ORDER BY onm) AS OtherNameID

, DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

CROSS APPLY root.nodes('PersonDetails') pd(pd)

CROSS APPLY pd.nodes('OtherNames/OtherName') onm(onm)

/*Using outer apply means that dense rank is generated for everything*/

SELECT

root.value('ReportRequestCaseNo[1]','int') AS CaseNo

, onm.value('.[1]', 'varchar(50)') AS OtherName

, ROW_NUMBER() OVER (PARTITION BY pd ORDER BY onm) AS OtherNameID

, DENSE_RANK() OVER (ORDER BY root) AS UniqueRootNodeID

FROM @xml.nodes('ReportDetails/ReportRequest') root(root)

CROSS APPLY root.nodes('PersonDetails') pd(pd)

OUTER APPLY pd.nodes('OtherNames/OtherName') onm(onm)

I guess in the circumstances I can cope with the performance, this only needs to be run once a day maximum and can be run out of hours; the resulting normalised data will then be used for querying.

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Adam Machanic SSCoach Points: 15370 More actions · Answer 8

Can you post the schema of the actual table, a few rows of sample data, number of actual rows, and the query plans for both? I'm sure if the CROSS APPLY is taking only 15 seconds we can figure out some way to get the OUTER APPLY to take much less than 8 minutes.

--
Adam Machanic
whoisactive

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 9

Will do, give me a few minutes and I'll get everything together.

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 10

The code to create the schema collection, objects and with a small sample xml is attached; just getting the query plans now.

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 11

And the two execution plans.

Forgot to add total number of rows for the full document:-

Individual Details - 289166

Other Names - 294767

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Adam Machanic SSCoach Points: 15370 More actions · Answer 12

Just to be clear, you run one file at a time, and it takes 8 minutes?

--
Adam Machanic
whoisactive

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 13

Not quite; I load the file into the dbo.eiirXML_TEST table, then perform the shredding from there.

The query to populate the dbo.eiirIndividualDetails_TEST table takes 15 seconds, the query to populate the dbo.eiirOtherNames_TEST table was taking 8 minutes. I've just run an update statistics command (not sure how this applies to xml indexes) and run that second query again and it took 3:36 so I'm guessing maybe the statistics were a little off?

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 14

OK, I've run the two queries below:-

Cross apply, takes 19 seconds to return 27,968 rows:-

SELECT idvd.value('CaseNoIndividual[1]', 'int') AS CaseNoOtherNames

, idvd.value('FirstName[1]', 'varchar(100)') AS FirstName

, idvd.value('Surname[1]', 'varchar(100)') AS SurName

, idvd.query('OtherNames/OtherName').value('count(*)', 'int') AS NumberOtherNames

, otnm.value('.[1]', 'varchar(4000)') As OtherName

, 1--@FileName

, DENSE_RANK() OVER(ORDER BY rd) AS UniqueXMLRowID

, ROW_NUMBER() OVER (PARTITION BY rd ORDER BY otnm) AS OtherNameSubID

FROM dbo.eiirXML

CROSS APPLY data.nodes('/ReportDetails/ReportRequest') AS rd(rd)

CROSS APPLY rd.nodes('IndividualDetails') AS idvd(idvd)

OUTER APPLY idvd.nodes('OtherNames/OtherName') AS otnm(otnm)

WHERE ID = 1--@FileName

Outer Apply, takes 3:38 to return 294,768 rows :-

SELECT idvd.value('CaseNoIndividual[1]', 'int') AS CaseNoOtherNames

, idvd.value('FirstName[1]', 'varchar(100)') AS FirstName

, idvd.value('Surname[1]', 'varchar(100)') AS SurName

, idvd.query('OtherNames/OtherName').value('count(*)', 'int') AS NumberOtherNames

, otnm.value('.[1]', 'varchar(4000)') As OtherName

, 1--@FileName

, DENSE_RANK() OVER(ORDER BY rd) AS UniqueXMLRowID

, ROW_NUMBER() OVER (PARTITION BY rd ORDER BY otnm) AS OtherNameSubID

FROM dbo.eiirXML

CROSS APPLY data.nodes('/ReportDetails/ReportRequest') AS rd(rd)

CROSS APPLY rd.nodes('IndividualDetails') AS idvd(idvd)

CROSS APPLY idvd.nodes('OtherNames/OtherName') AS otnm(otnm)

WHERE ID = 1--@FileName

I've also attached the query plans following running each, as well as the plan from running the two queries in the same batch.

On both plans, the most expensive action shows as the Nested Loops (Inner join) at 77% with an estimated row count of 44,325,400, actual row count of 42,202; this is the same on both plans, yet the cross apply query takes 1% of the cost relative to the batch? I'll happily admit I get easily confused looking at query plans, but this really doesn't make much sense to me. 🙁

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]