Using Schemas to stage data

  • I am a developer working on a solution to pull together data about my companies different sales channels from several sources into a unified database.

    Since each channel/source has its own transfromational rules they each have their own load process. Entities represented in each channel could also turn out to be the same across channels, so there is some cross channel integrity concern.

    My plan is to load the data into some staging tables, verify integrity, and then load the data into the operational tables. This gives the load process plenty of time to run, and if it fails means that I don't need to back anything out (just don't load the operational tables)

    One way of addressing this floated by a DBA at my company is to create schema's with identical tables so that we can do the following:

    +set DBO as non-updatable (if possible)

    +update/reload tables in WRK schema

    +run load process on tables in WRK schema

    +if the load is successful

    --+Transaction?

    ----+ALTER DBO.<tables> to STAGE.<tables>

    ----+ALTER WRK.<tables> to DBO.<tables>

    --+ALTER STAGE.<tables> to WRK.<tables> - it is nice to leave the data here to help evaluate exactly what changed overnight

    Does this seem like a reasonable approach to staging a data load? We also considered a seperate database with the tables, but simply altering schema allows for a much faster "load" of the operation data.

  • If - as it appears to be the case - you are not planning to add/update data into operational tables but fully populate them each time then proposed process may work just fine.

    Something to take into consideration might be indexing - system may need extra indexes on operational tables while not need them in staging tables.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Thanks. There will be some minor updating of the operational tables. The reload/update work tables step would ensure that the work tables are in the same state as the operational tables before the load process begins.

    To be honest, the operational tables exist already. So far we are only pulling data from one sales channel. We are trying to design a staging solution because we had one failed load that resulted in quite a bit of down time. As we add additional channels, I anticipate greater risk to our load process.

    The things that I really like about the schema swap solution so far are:

    +It leaves the operational tables in yesterday's valid state if anything goes wrong in the load process. We can even run validation tests on the data post load before declaring it DBO ready

    +It should be really fast

    +The load process should have no deadlock problems unless it deadlocks itself.

    The things that cause me concern with the schema swap solution so far are

    +what happens to DBO table constraints when the table's schema is altered to WRK. Do these need to be rebuilt to refer to the new schema?

    +I feel like I must be overlooking some kind of best practice answer to staging a data load. I would think that staging a data load would be a common enough need that there would be best practice type answers. I kind of expected that ETL tools might even have staging specific functionality.

  • How big are those tables?

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Currently the tables run around 100MB. As we add additional channels this may grow to as much as 500MB.

    One other plus for the process is that it leaves the operational tables in a state that applications can read the whole time the update is happening. This means that even if the update process took an hour or more, it would have little impact on the applications reading the database.

  • I would not suggest putting them in a schema in the same database as your operational tables, no. You're going to hit the log & data files on your main database in a way you don't need to. Better to separate that processing out so that it doesn't impact your end-users that are dependent on the database for operations. Yeah, you're still going to take a hit when you move from staging to the main production tables, but it'll be less than you'd get with both in place.

    ----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply