what would be appopriate or best practices in DW design for handling deletes, updates processes...

  • I have a few questions:

    1) what would be the appropriate design for both source and target DW to handle hard deletes regularly on the source system?

    The only idea I can come up with is to create a table on both source and target dw, that would contains either the IDs or dates that it needs to delete on, and put index on it, then have the deletions based on the join or look up to that table?

    2) what if I have a big table of 500 mil rows, and have to do daily updates of 200-300 mil records, how can I do this efficiently? Right now, to do these updates using looping and top takes a long time.

    3) How do you handle backfill of the table when you need to add a new column to this big table?

    any inputs would be greatly appreciated. thanks.

  • 1. Managing deletes in source

    Your idea is basically how replication works - you record the operation (Insert, Update, Delete) in a table in the source. Then you apply those changes to the target by joining the log table and the original table. When it is a success, you mark the log as 'complete - don't copy again'

    Note that you can include Inserts and Updates, then you only have to load changed records, not the whole lot.

    2. Big table: why are you using looping? Thats a pretty general statement which generally causes a lot of uproar in this forum.

    You can do things like partition your table (split it among different files), load a partition then switch the partition in instantly. Generally if there is anything 'predictable' about your data - i.e. I only load the last 30 days in - then you can use some smarts (such as partitioning) to load it quicker.

    Keep in mind that if you are loading all of these rules as part of a 'bulk' update, i.e. load last 30 days data, then you can get smart by using the method in question 1 - recording inserts into a log table, then only load in that data that was inserted, not the whole lot.

    3. What do you mean by backfill? Whats the issue? Do you mean when you add a column, what do you put in historical data when there was not actually anything captured previously? Thats up to your business users - what do they want to see?

  • sqlblue (1/22/2012)


    1) what would be the appropriate design for both source and target DW to handle hard deletes regularly on the source system?

    Too many questions in a single post, let me take the first one.

    A "delete" on the source system usually does not means a "delete" on the DWH side will occur.

    The source system, let's say an OLTP system doing billing will probably keep online one or two month worth of invoices which are sent in a daily basis via the ETL process to the DWH. Once a month or so invoices older than (lets say) two month get purged on the source OLTP system.

    Does that means that invoices have to be deleted from the DHW? (please say no here).

    The DWH is supposed to be a repository where nothing gets out, if you want to register the fact that a particular customer is not longer a customer then you have techniques like Type 2 SCD to deal with your DIM tables and temporal (or bitemporal if needed) FACT tables to let users known the validity range of any given row.

    Hope this helps.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • 1) what would be the appropriate design for both source and target DW to handle hard deletes regularly on the source system?

    As Paul already said, delete on source doesn’t mean delete from DW. It’s one way data flow. The life of most granular data in DW is decided by CIO / CTO, which is usually 20 years or so. After that it qualifies for archival (not purging).

    2) what if I have a big table of 500 mil rows, and have to do daily updates of 200-300 mil records, how can I do this efficiently? Right now, to do these updates using looping and top takes a long time.

    You can do it in batches. A batch could be of 1000 rows (or based on your experience in DW). I am guessing a bad DW design or ETL design here because I haven’t seen 200m rows updates in daily loads. Can you please explain your business case?

    3) How do you handle backfill of the table when you need to add a new column to this big table?

    You can add a column to a multi-million row table with default value or with NULL value. Just identify a time when load on DW is comparatively less because it might slowdown other reports.

  • Thank you all for taking your valuable time to provide your thoughts and inputs. I'm sorry if I have not been cleared in my question 2 and 3, so let me clarify them. But overall, I think you have addressed and answered my questions mostly.

    #2. I meant in my question when I mentioned that I'm using looping and top I meant to do the batch process. Doesn't batching mean to do top(10000) or however many records, and while loop the process until there is no more row? But I agree the partition option is much better.

    #3. In this question, when I said backfill the table I meant update the column because it is a new column I wanted to add, so I would need to update that column with some values, but to do it on a big table, I just worry that it may slow down other processes or cause other performance issue. So, besides update it to a default value on the same table, I thought that maybe I can create a new table with the same structure as the other one, but with the new column added, populate the data, and just rename the table afterward? As insert may be faster than update? And then add the index after?

    #1. I agree with you all on this option. You are right also that it may be deleted on the source but not in DW.

    Thanks again.

  • sqlblue (1/23/2012)


    2) what if I have a big table of 500 mil rows, and have to do daily updates of 200-300 mil records, how can I do this efficiently? Right now, to do these updates using looping and top takes a long time.

    #2. I meant in my question when I mentioned that I'm using looping and top I meant to do the batch process. Doesn't batching mean to do top(10000) or however many records, and while loop the process until there is no more row?

    The "top" technique is usually used for deleting rows but it is very inneficient for the described scenario where 300,000 rows have to be updated out of a total of 500M rows.

    Not knowing the environment is really difficult to say what would be the best approach... if base table was partitioned in a way where the 300,000 row candidates where located in a few partitions with just a few million rows it may make sense to do a full scan in such partitions but, on a 500M non-partitioned table an update by a very restrictive key - unique if possible - would probably be the most efficient.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply