Mastering Dimensions of Time

  • Jeff Moden (1/21/2014)


    Peter,

    I know that a non-persisted computed column saves space but it's also recalculated on every use and non-indexable. Since the data won't actually ever change, I'd recommend actually dedicating the space to hard data even above using a persisted computed column, which could also be indexed. It's also not so much space because even "big" calendar tables are relatively small. The performance gains could be quite substantial depending on how you're using the table.

    What about the ones that do change... such as below?

    DayPeriods AS datediff(dd,caldate,getdate()) ,

    YearPeriods AS datediff(yy,caldate,getdate())

    etc

    I can also do things like SchoolYearPeriods, 4-4-5 calculations, etc... and have them be consistent, so that a developer doesn't have to remember that the school year for a given company begins in July.

    what I've seen on the query plan is that it typically grabs the few rows from the calendar table with a clustered index scan (yes a terrible table scan) and then joins into everything else with index seeks. Certainly beats having to do nightly manipulation to calculate that kind of stuff and update the columns it also allows for a pile of attributes that can be used to do parallel periods etc... all in SQL... and then pull all of that into an SSAS cube. And with 2014, I'll very likely stuff that table as a memory only table.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • JohnFTamburo (1/21/2014)


    Jeff Moden (1/21/2014)


    I know it's a personal preference but I've worked with "DateDIM" tables that use date and time serial numbers and I've never understood why people would bother. Maybe it was to save on disk space (4 byte INT v.s. 8 byte DATETIME) before extremely large hard disks became so relatively inexpensive but I just don't see the need for it anymore. Add the DATE and TIME datatypes into the mix and I really don't see a need for a surrogate key for any type of calendar or time based table.

    A rich date dimension is often necessary for the "slice and dice" that end-users will do with a SSAS cube. They may want to look at sales on Tuesday vs. Sales on Thursday, or the average of sales on first Saturday of the month vs. all the others. That is why one would refer out to a lush date table. Considering that, in OLTP, DATETIME often is populated with GETDATE() or VB.NET's Now(), using it as the reference key would require the application or ETL process to strip off at minimum the fractional 10000ths/second. At that point, why not shave off four bytes of what is in essence unnecessary storage?

    As for the data space savings, with SSAS cubes, RAM is also an issue as well as disk -- the need to cache rowsets may become critical. an extra 4 bytes per row could bring a performance benefit.

    That said, using a surrogate INT that is an offset of a known unit from a known base date (which could be stored in a one row one column table) is an excellent idea. That should easily populate with no need to go (eww) RBAR.

    Thanks

    John.

    Understood on the Ram thing and I appreciate the feedback. But why INT. For a calendar table, SMALLDATETIME would work just fine in a pre-2008 environment and certainly one of the newer DATE datatypes would work without the added complexity of having to lookup dates through a key instead of using them directly.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • mtassin (1/21/2014)


    Jeff Moden (1/21/2014)


    Peter,

    I know that a non-persisted computed column saves space but it's also recalculated on every use and non-indexable. Since the data won't actually ever change, I'd recommend actually dedicating the space to hard data even above using a persisted computed column, which could also be indexed. It's also not so much space because even "big" calendar tables are relatively small. The performance gains could be quite substantial depending on how you're using the table.

    What about the ones that do change... such as below?

    DayPeriods AS datediff(dd,caldate,getdate()) ,

    YearPeriods AS datediff(yy,caldate,getdate())

    etc

    Now that would make an interesting performance test. Which is faster? A non-persisted computed column in a table or an iSF? I'll try that out as soon as I post this reply.

    I can also do things like SchoolYearPeriods, 4-4-5 calculations, etc... and have them be consistent, so that a developer doesn't have to remember that the school year for a given company begins in July.

    Yes, you can do 4-4-5 calculations. But those aren't going to change. They should be hard coded in the table and the stored procedure to add extensions to the table should have the formulas in them. The columns of the table should not have non-persisted computed columns in them for such things.

    what I've seen on the query plan is that it typically grabs the few rows from the calendar table with a clustered index scan (yes a terrible table scan) and then joins into everything else with index seeks. Certainly beats having to do nightly manipulation to calculate that kind of stuff and update the columns it also allows for a pile of attributes that can be used to do parallel periods etc... all in SQL... and then pull all of that into an SSAS cube.

    If you can avoid the full table scan, why wouldn't you?

    And with 2014, I'll very likely stuff that table as a memory only table.

    I can see that becoming the mantra for forgetting how to write good code just like the age old excuse of "Well... it's for a set of rows that will never grow". 😉 Even scans in memory take a certain amount of time and I/O resources even when they're lock free. I'm going through something similar with a lot of the code at work. Admittedly, it's not with 2014 but I've been able to make some really serious improvements just by changing some things like what we're talking about.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • @mtassin,

    I forgot to ask. Would you be looking up a DayPeriod to get the Date or would you be looking up the Date to get the DayPeriod?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (1/21/2014)See the example code that Luis posted above for one possibility.

    I did and I read the article you published, and I really enjoyed reading it, thank you for that interesting piece and to davoscollective for pointing it.

    What I liked the most is the use of Tally table, though I'd say there is a "downside" because you need to have it in advance, I mean it is not efficient if you create a #temp Tally table because generating a number of rows (I tried 10M) takes a while... apart from that, I liked a lot.

    Jeff Moden (1/21/2014)

    Peter,

    I know that a non-persisted computed column saves space but it's also recalculated on every use and non-indexable. Since the data won't actually ever change, I'd recommend actually dedicating the space to hard data even above using a persisted computed column, which could also be indexed. It's also not so much space because even "big" calendar tables are relatively small. The performance gains could be quite substantial depending on how you're using the table.

    I completely agree that it's a better idea to have them written in disk, even more considering a highly denormalized environment such as a datawarehouse, but cannot agree on that you cannot create an index on computed columns, that's a big "it depends"

    http://technet.microsoft.com/en-us/library/ms189292.aspx

    Even though I'd write them instead of computing them.

  • Jeff Moden (1/21/2014)


    JohnFTamburo (1/21/2014)


    Jeff Moden (1/21/2014)


    I know it's a personal preference but I've worked with "DateDIM" tables that use date and time serial numbers and I've never understood why people would bother. Maybe it was to save on disk space (4 byte INT v.s. 8 byte DATETIME) before extremely large hard disks became so relatively inexpensive but I just don't see the need for it anymore. Add the DATE and TIME datatypes into the mix and I really don't see a need for a surrogate key for any type of calendar or time based table.

    A rich date dimension is often necessary for the "slice and dice" that end-users will do with a SSAS cube. They may want to look at sales on Tuesday vs. Sales on Thursday, or the average of sales on first Saturday of the month vs. all the others. That is why one would refer out to a lush date table. Considering that, in OLTP, DATETIME often is populated with GETDATE() or VB.NET's Now(), using it as the reference key would require the application or ETL process to strip off at minimum the fractional 10000ths/second. At that point, why not shave off four bytes of what is in essence unnecessary storage?

    As for the data space savings, with SSAS cubes, RAM is also an issue as well as disk -- the need to cache rowsets may become critical. an extra 4 bytes per row could bring a performance benefit.

    That said, using a surrogate INT that is an offset of a known unit from a known base date (which could be stored in a one row one column table) is an excellent idea. That should easily populate with no need to go (eww) RBAR.

    Thanks

    John.

    Understood on the Ram thing and I appreciate the feedback. But why INT. For a calendar table, SMALLDATETIME would work just fine in a pre-2008 environment and certainly one of the newer DATE datatypes would work without the added complexity of having to lookup dates through a key instead of using them directly.

    There are advantages to using a structured int, I mentioned some and will go into a bit more detail here.The date/datetime types in SQL Server are not suited to allow for specifying a month without specifying a day. This is a common problem and sadly there is no out of box the solution that is satisfactory.

    In this scenario you have a few choices:

    1. Accept the code and data mess that date types storing months result into. Make the day part constant and then coding against such a model (get you ambiguous dates);

    2. Do not use date types and store each part of a date individually (year + month column), which results in more complex code (scanning date ranges for example);

    3. Use calendar tables that frequently need to be joined to determine details;

    4. Use calendar tables that almost never need to be joined due to the int coding being meaningful and unambiguous.

    We wend for option 4.

    It has the nice properties we want for just the cost of foreign key verification when inserting records and/or updating a date column. The tables recognize the real structure of a date and at each level you can extend the tables with useful information, with leap year as a most obvious example. This simplifies SQL statements that make use of such information, but it does not require every query to reference these tables as you seem to have assumed. If you only want to store months, you just reference the month table and data pollution is impossible. I find it easier to maintain then real date alternatives with multiple check constraints on each and every column that uses dates.

    As for computed columns, it certainly depends on use. We hardly ever need to access those columns and certainly not for row by row comparisons. Computing them for the remaining uses is very fast. Under the hood it is just a few integer additions and multiplications. Doing it this way allowed us to have slim indexes that are covering access to all the columns and thus prevent bookmark look-ups. If the need arises to have them efficiently searchable, adding persisted plus an index creation is all that is needed. It is a starting point as i said, extendable to whatever is needed for a project.

    It would be nice to have all this in a database type from the get go and then only link to additional metadata tables as required, then it would align with your feel on the issue. Such a type would remove the foreign key verification overhead on inserts and updates. This can be done by creating a type with CLR integration, but i suspect downsides and some overhead with that too. At this point I do not feel that is worth an experiment for us.

  • Jeff Moden (1/21/2014)


    what I've seen on the query plan is that it typically grabs the few rows from the calendar table with a clustered index scan (yes a terrible table scan) and then joins into everything else with index seeks. Certainly beats having to do nightly manipulation to calculate that kind of stuff and update the columns it also allows for a pile of attributes that can be used to do parallel periods etc... all in SQL... and then pull all of that into an SSAS cube.

    If you can avoid the full table scan, why wouldn't you?

    Because with just 11,000 rows to represent about 30 years of data, I typically see the optimizer pick table scans for just about anything. I suppose this will be an issue when the table gets to 600 years of data and 220,000 rows or maybe at the 6000 year mark... but if we're still using the same SQL Server architecture in 600 years I probably won't care. With all of the calculated columns in my dimCalendar table, instead of sticking 727 rows on a page I might fit 72.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • To everyone that has commented about calculated columns, functions for values, and using something other than an Integer as a PK and row ID for Dimension tables.

    I appreciate your comments and understand your views.

    However I will never use these types of columns, functions, calculations, etc.. in any schema for any tables in any Data warehouse project. They all fail Best Practice recommendations and basic requirements used for years to architect good Data Warehouse databases and tables.

    The purpose of any value in any table in our Data Warehouse projects is to store the information as static values.

    Anything that would be selected from our data should be selectable as an uncalculated value without using a function.

    All tables only have the data needed, they are net re-used over and over, we set the column values according to the project requirements to minimalize the level of columns and knowledge needed to use them.

    That is what we do and it works a well with 2 terra bytes of data as it does with 2 Mega bytes.

    My article on the creation of our last projects set of Dim Time and Dim Date tables reflect this minimal attitude.

    We get the most bang for our buck with the least amount of time spent coding this way.

    This allows us to spend more time doing analysis of the data and building in automation and intelligence and less time trying to resolve performance issues by re-architecting, re-schema, re-coding, re-testing, re-deploying a delivered solution.

    On my current project we have several other Fact and Dimension tables related to the Date and Time tables; Dim Holidays, Dim OfficeClosed, Fact Appointment, Dim PTO (a slow changing dimension :hehe: ), etc...

    We never just download from Kimble et all or re-cycle Dim tables that might have values and Schema never used in the Project or that are not part of the projects thought process.

    The time this adds is nothing compared to what you add trying to support schema and values that you invested no time or knowledge in creating.

    We never use anything other than Integer values for Primary Key, and all Data Warehouse tables that are Dimensions or are populated from a source have Surrogate Keys.

    We assume EVERY dimension table should have a schema that either supports it being a Slow Changing Dimension, or can be updated to support a SCD without breaking any code accessing that table.

    These three rules and a few more save hundreds of frustrating hours with resolving performance issues, updating tables, adding new data, or removing unused data.

    Most importantly, we have yet to have an issue with interfacing OLAP, Power Center, Power Flow, QLIK, Pentaho, Cognos, etc...

    How do you really know you are on the right track when you do this kind of work?

    Here are my ahha moments -

    You see the same 30K rows of data return in 2 seconds from the Data Warehouse instead of 2 minutes.You can set up You can turn on replication or build OLAP cubes without schema changes or creating custom data views.

    You can spend less than an hour adding new columns and values from the data source to your Fact and Dim tables and begin to populate them.

    A junior DBA, Developer, or even better Business Analyst can understand the Schema and use it for their needs.

    When these things happen you understand the value of an intelligent Business Intelligence.

    It is easy where I live to see what can happen when bad IT does unintelligent business with a bad solution made from untested new ideas.

    You count the number of people you know that used to work for Bridgestone.

    http://spectrum.ieee.org/riskfactor/computing/it/bridgestone-sues-ibm-for-fraud-in-600-million-lawsuit-over-failed-it-implementation

  • mtassin (1/22/2014)


    Jeff Moden (1/21/2014)


    what I've seen on the query plan is that it typically grabs the few rows from the calendar table with a clustered index scan (yes a terrible table scan) and then joins into everything else with index seeks. Certainly beats having to do nightly manipulation to calculate that kind of stuff and update the columns it also allows for a pile of attributes that can be used to do parallel periods etc... all in SQL... and then pull all of that into an SSAS cube.

    If you can avoid the full table scan, why wouldn't you?

    Because with just 11,000 rows to represent about 30 years of data, I typically see the optimizer pick table scans for just about anything. I suppose this will be an issue when the table gets to 600 years of data and 220,000 rows or maybe at the 6000 year mark... but if we're still using the same SQL Server architecture in 600 years I probably won't care. With all of the calculated columns in my dimCalendar table, instead of sticking 727 rows on a page I might fit 72.

    I agree that a single scan is probably fine in such a case. I have found that a lot of code leads to multiple scans, especially on non-persisted computed columns in things like calendar tables leading to Triangular or even full Cartesian joins behind the scenes.

    I think you missed my previous post, though. On the DayPeriod formula you posted... how would that be used in the Calendar Table? Would you do a lookup on DayPeriod to find a given CalDate or would you do a lookup on a CalDate to find the DayPeriod? It's important for the testing I'm doing. Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (1/22/2014)


    mtassin (1/22/2014)


    Jeff Moden (1/21/2014)


    what I've seen on the query plan is that it typically grabs the few rows from the calendar table with a clustered index scan (yes a terrible table scan) and then joins into everything else with index seeks. Certainly beats having to do nightly manipulation to calculate that kind of stuff and update the columns it also allows for a pile of attributes that can be used to do parallel periods etc... all in SQL... and then pull all of that into an SSAS cube.

    If you can avoid the full table scan, why wouldn't you?

    Because with just 11,000 rows to represent about 30 years of data, I typically see the optimizer pick table scans for just about anything. I suppose this will be an issue when the table gets to 600 years of data and 220,000 rows or maybe at the 6000 year mark... but if we're still using the same SQL Server architecture in 600 years I probably won't care. With all of the calculated columns in my dimCalendar table, instead of sticking 727 rows on a page I might fit 72.

    Why not just create a Non Clustered coverage index to support this?

    Am I missing something that makes this obvious solution a bad one?

    With only 11,000 rows it should make a difference with very little storage usage.

    Not to mention what you could do with the In Memory options available in SQL 2012.

  • Jeff,

    In reference to your comments about using date and time data types.

    How would you recommend we take a date value and a time value and convert them to a smalldatetime or datetime value?

    Or requirement is to do this with no explicit (executing the CAST or CONVERT functions) conversions on the data.

  • PHYData DBA (1/22/2014)


    Jeff,

    In reference to your comments about using date and time data types.

    How would you recommend we take a date value and a time value and convert them to a smalldatetime or datetime value?

    Or requirement is to do this with no explicit (executing the CAST or CONVERT functions) conversions on the data.

    Implicit conversions appear to work well. On SQL 2008 SP3:

    declare @D1 datetime

    set @D1 = GETDATE()

    select @D1

    declare @D2 smalldatetime

    set @D2 = @D1

    select @D2

    declare @D3 date

    set @D3 = @D1

    select @D3

    Generates results:

    2014-01-22 15:05:43.030

    (1 row(s) affected)

    2014-01-22 15:06:00

    (1 row(s) affected)

    2014-01-22

    (1 row(s) affected)

    So you need only define the smalldatetime or date type into the column and assign the value -- no explicit conversion needed.

    I won't comment about how silly I feel the "no explicit CAST or CONVERT" standard is...

    Thanks

    John.

  • PHYData DBA (1/22/2014)


    Jeff Moden (1/22/2014)


    mtassin (1/22/2014)


    Jeff Moden (1/21/2014)


    what I've seen on the query plan is that it typically grabs the few rows from the calendar table with a clustered index scan (yes a terrible table scan) and then joins into everything else with index seeks. Certainly beats having to do nightly manipulation to calculate that kind of stuff and update the columns it also allows for a pile of attributes that can be used to do parallel periods etc... all in SQL... and then pull all of that into an SSAS cube.

    If you can avoid the full table scan, why wouldn't you?

    Because with just 11,000 rows to represent about 30 years of data, I typically see the optimizer pick table scans for just about anything. I suppose this will be an issue when the table gets to 600 years of data and 220,000 rows or maybe at the 6000 year mark... but if we're still using the same SQL Server architecture in 600 years I probably won't care. With all of the calculated columns in my dimCalendar table, instead of sticking 727 rows on a page I might fit 72.

    Why not just create a Non Clustered coverage index to support this?

    Am I missing something that makes this obvious solution a bad one?

    With only 11,000 rows it should make a difference with very little storage usage.

    Not to mention what you could do with the In Memory options available in SQL 2012.

    I believe the problem is in all those calculated columns. If they contain GETDATE(), then they are nondeterministic, cannot be persisted, and cannot be indexed. If they're doing lookups on those columns, it will cause Cartesian products with table scans.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • raulggonzalez (1/22/2014)


    Jeff Moden (1/21/2014)See the example code that Luis posted above for one possibility.

    I did and I read the article you published, and I really enjoyed reading it, thank you for that interesting piece and to davoscollective for pointing it.

    What I liked the most is the use of Tally table, though I'd say there is a "downside" because you need to have it in advance, I mean it is not efficient if you create a #temp Tally table because generating a number of rows (I tried 10M) takes a while... apart from that, I liked a lot.

    You don't have to create a tally table if you don't have one. You could simply create a cteTally that requires 0 reads. You can increase the number of cross joins to generate even more numbers if needed.

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4)

    SELECT *

    FROM cteTally

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • PHYData DBA (1/22/2014)


    Jeff,

    In reference to your comments about using date and time data types.

    How would you recommend we take a date value and a time value and convert them to a smalldatetime or datetime value?

    Or requirement is to do this with no explicit (executing the CAST or CONVERT functions) conversions on the data.

    As JohnFTamburo pointed out, you shouldn't need explicit conversions and, most of the time, any implicit conversions amid all the different date/time datatypes should still allow index seeks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 31 through 45 (of 57 total)

You must be logged in to reply to this topic. Login to reply