Working with large de-normalized dataset

  • Hi,

    I'm trying to find a way to programmatically pivot a very large data set, the structure of which is:

    CREATE TABLE [dbo].[f_Data](

    [RecordID] [bigint] NOT NULL,

    [RecordTypeID] [int] NOT NULL,

    [DataVersionID] [int] NOT NULL,

    [VariableName] [nvarchar](128) NOT NULL,

    [VariableValue] [int] NULL,

    CONSTRAINT [PK_f_Data_1] PRIMARY KEY CLUSTERED

    (

    [RecordID] ASC,

    [VariableName] ASC

    )

    For each RecordID there are upwards of 2000 different variables that we would like to have a as columns, with a row for each RecordID. Now, I know I could do this using a PIVOT statement - but the prospect of typing out 2000 column names (for each different variable) isn't all that appealing. There's also an added difficulty in that there is a second table with additional data on the variables, such as the order that they should be listed in - this makes building a programmatic-yet-dynamic statement more difficult as well. Though I am not beyond doing that, I'm thinking there has to be an easier way (mainly because I can do exactly what I want in Excel with a few clicks). Any ideas?

  • Building a PIVOT query dynamically is a viable option. Using your DDL I created a simple PIVOT query:

    SELECT pvt.RecordID,

    pvt.Variable1,

    pvt.Variable2

    FROM (SELECT RecordID,

    VariableName,

    VariableValue

    FROM dbo.f_Data

    ) a PIVOT

    ( MAX(a.VariableValue) FOR a.VariableName IN (Variable1, Variable2) ) AS pvt ;

    When building the SELECT and PIVOT column list joining to the external table you mentioned that supplies the column-order for the variable-columns is an extra step in building the SQL string you'll eventually execute however it will not be prohibitively complex.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • As an alternative to the PIVOT syntax you could have a look at the CrossTab and DynamicCrossTab article referenced in my signature.

    It's not as easy as using Excel, but Excel isn't a RDBMS (at least not intentionally...).

    Edit: stuff like PIVOT usually is done at the presentation layer (e.g. ReportingServices) since it violates normalization.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Lutz has a point...dynamically generating a crosstab query would require approximately the same level of effort from you and would almost certainly perform better than a PIVOT. My code example as a crosstab:

    SELECT RecordID,

    SUM(CASE WHEN VariableName = 'Variable1' THEN VariableValue ELSE NULL END) AS [Variable1],

    SUM(CASE WHEN VariableName = 'Variable2' THEN VariableValue ELSE NULL END) AS [Variable2]

    FROM (SELECT RecordID,

    VariableName,

    VariableValue

    FROM dbo.f_Data

    ) d

    GROUP BY RecordID

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (2/22/2011)


    Building a PIVOT query dynamically is a viable option. Using your DDL I created a simple PIVOT query:

    SELECT pvt.RecordID,

    pvt.Variable1,

    pvt.Variable2

    FROM (SELECT RecordID,

    VariableName,

    VariableValue

    FROM dbo.f_Data

    ) a PIVOT

    ( MAX(a.VariableValue) FOR a.VariableName IN (Variable1, Variable2) ) AS pvt ;

    When building the SELECT and PIVOT column list joining to the external table you mentioned that supplies the column-order for the variable-columns is an extra step in building the SQL string you'll eventually execute however it will not be prohibitively complex.

    I have actually tried doing the pivot and building the IN clause dynamically (using this approach http://www.simple-talk.com/community/blogs/andras/archive/2007/09/14/37265.aspx), but it turns out that the column names actually end up overflowing the max varchar length. It does work fine for a subset of variables, but not the whole dataset. On top of that, I actually need to try and get a description field in the column name as well so the length is a big issue. The columns that will be included will likely be subject to change too - otherwise I would just write the big nasty query once and call it a day.

  • Kevin Dahl (2/22/2011)


    I have actually tried doing the pivot and building the IN clause dynamically (using this approach http://www.simple-talk.com/community/blogs/andras/archive/2007/09/14/37265.aspx), but it turns out that the column names actually end up overflowing the max varchar length.

    Use NVARCHAR(MAX) instead of NVARCHAR(4000)

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (2/22/2011)


    Kevin Dahl (2/22/2011)


    I have actually tried doing the pivot and building the IN clause dynamically (using this approach http://www.simple-talk.com/community/blogs/andras/archive/2007/09/14/37265.aspx), but it turns out that the column names actually end up overflowing the max varchar length.

    Use NVARCHAR(MAX) instead of NVARCHAR(4000)

    Great tips, thanks guys... The dynamic crosstab was much easier/simpler to build programmatically, and the NVARCHAR(MAX) worked fine as well.

    Ordinarily I would leave the summary of the data to the presentation layer, but this particular dataset is going into an OLAP cube.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply