how big is too big

  • We are in the process of evaluating reporting tools and have been presented with the option of building a cube through analysis services 2008. The base fact table we would be reporting on is currently in Teradata and is approximately 1.2 Terabytes. We've been told we should be able to load the entire thing into a single cube and see very quick response in user query's at any level.

    Is this really possible? Will we really see huge performance boosts from a cube that's not really aggregated at all?

    Thanks!

  • Hey Jeremy,

    Just be wary of the nomenclature that people are using. Yes, you could load the entire table in to a single cube, but in reality, you're likely to be loading it to several partitions. Each partition can reasonably hold 20MM rows ( according to msft best practices, for reference, in some client cubes we've had to load ~ 120MM rows per partition, with ~ 13 partitions in the cube, so well towards 1.5 billion rows). Although it sounds impressive, a table thats a terabyte in size could really vary in row count as the size is influenced by the width of each row (data size wise).

    Why do you think that the cube would be 'not really aggregated at all'? There would be no reason *not* to have aggregations on this cube. If you can find a good partition candidate (a lot of times, 'date' is used, gives clean breaks at a level such as year, or month or even day if you have that many records per day), then you could (would!?!) have aggregations per partition. Having partitions would also allow you to load selectively, ie don't load the entire 1.2 TB every time, just the current 'period' (day, week, month whatever). Note also that the aggregations can be different per partition, you can have some level of control over this (ie maybe have a partition that holds 'last year' and 12 others that hold the current 12 months), you may want to aggregate differently on the 'historical' partition when compared to the current ones.

    Anyway, assuming a reasonable design and some good implementation, you should easily be able to accomodate this cube.

    HTH,

    Steve.

  • Thanks for the information Steve. I'm still kinda new to the cube structure, so I'm probably way off on how the data is stored. My assumption was if you have the data at the most detail level in the cube, then it wouldn't be aggregated unless you created another aggregated cube. Based on your statement it sounds like I'm incorrect. The table actually has about 3.5 billion records and I'm just trying to figure out where the pitfalls in a solution like this might be. Currently everything is run relationally and although a majority of the queries are under 2 minutes, performance is still a major concern. Users really want instantaneous responses.

    Thank you very much for taking the time to respond.

  • Hi Jeremy,

    Gotta start somewhere right? 🙂

    If you're familiar with a star schema, a simple cube structure is very similar. So if you imagine your 3.5 B row table in the center and lets say 3 dimensions around it (usual PK on dim to FK on fact). You will have selected some subset of the dimension fields to be attributes within the dimensions. In general, these are often 'grouping' type elements, simple e.g.s being Country/State/City in a customer or vendor dimension, ProductCategory/Group and maybe Type/Subtype in a Product dim. I'm guessing you have (a lot!) these already in your teradata setup.

    Continuing with some of the simple sample above, let's assume you can partition by month (assuming you have on avg <= 20MM fact rows/mth). So you have ~175 partitions. The partitions are where the aggregations 'live'. The dimensions are used to determine which aggregations will be created. The actual details around aggregate candidates and influencing the aggregation wizard is beyond this 3 para response, but the info is definitely out there and avail 😀 So let's just assume that SSAS decides to put an aggregate on State & ProdCatGroup and maybe another on Country and Subtype. During processing, these aggregates are populated with data. During querying (of the cube) and query that requests (simplisticly) say June 2009 as a filter and TX as state and BigTrucks as ProdCatGroup - internally, this query will i) return data/execute over only the June2009 partition of the cube (cuts out 'dataset to 20MM rows or less) and then ii) likely use the aggregation that exists for the selected combination, meaning we likely read a single row (or at worst a much smaller subset of rows) to retrieve the result. Thats the basic thery of how MOLAP assists you in enhancing query perf.

    Note there are other ways to achieve a similar approach to perf improvement, for e.g. Oracle allows for the same type of mechanism but directly within the RDBMS using materialized views (where the query optimizer will use a [pre-created] aggregate table to answer a query if on exists).

    Returning to the discussion on aggregations on partitions, you do have a reasonable level fo control on what gets built, from hand selecting/building them through to collecting query /usage info and letting the aggregation design wizard determine the best aggs to create that will support most user queries.

    If you end up implementing this yourself (ie versus having a consulting group cme through) I can ping you some links re: aggregations etc. And yes, a lot of things like partition management (who wants to add/remove/etc by hand in a group of 175 partitions?!?!) can be automated through the use of SSIS.

    HTH,

    Steve.

  • realizing that i just wrote 'war & peace' - the key takeaways are i) use partitions to store your data in the cube, this gives you logical separation and smaller subsets of data to work with (loading and querying); and ii) aggregations should be added to each partition giving you quick access to aggregated results based on the low level fact data.

    One thing i missed mentioning earlier was the query engine in AS is aware of things like a user querying June and May for the TX & Trucks (obviosuly assuming an agg exists etc etc etc) it can sum the results of the aggregations for both June and May partitions to provide the query result. This will obviously be signifcantly faster than having to create a result by aggregating across either one large fact table or even two smaller ones on the fly.

    Steve.

  • This information is extremely informative. Hopefully I'll know in the next couple of weeks if this is the direction the business is set on.

    Many thanks!

    Jeremy

  • Steve,

    We'd love some basic articles on how this stuff works if you're interested. You wrote some great descriptions above.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply