Incremental Loading using SSIS

  • I usually use stored procedures as the source. I also use them for the transformations in the staging tables. It makes editing and troubleshooting possible without editing the packages. But I never use them for a full insert statement in an SSIS package.

  • RonKyle (12/16/2015)


    I used to avoid that by using a staging database

    I use those, too, though I'm glad to hear I'm not the only one. I've been told by other BI developers that it's an unnecessary database. The only data in want in the warehouse is clean data. It's not just the transaction log, it's the backup as well.

    It's not unnecessary in the slightest. That's like saying your toolbox is unnecessary because it's in the way. You still need it onsite to store the tools that you using to build a house. The staging database or table is a necessary evil to ensure you have everything you need to load and transform good clean data into the source database.

    That being said, I'm a hybrid of SSIS and TSQL too. I use SSIS to rip through large flat files that stream 200+ million rows into a staging database. Once it's in the source system, I use stored procedures to take it from there. I've found that SSIS is good for streaming and combining multiple data sources into the source system. I've also found that leverage the database engine with TSQL to be good for updating, sorting, joining data during the final stages of my data warehouse loads.

    At the end of the day, it really just depends. It's silly not to explore either option. They are both great tools.

  • Another use for the staging database is a way to see the extracted data when troubleshooting. It takes up space but it's an invaluable tool.

    RonKyle - why won't you use stored procedures to do a final insert?

  • RonKyle - why won't you use stored procedures to do a final insert?

    I do use stored procedures for the final insert in the sense that there is a stored procedure that is a SELECT. What it's not is an INSERT INTO ... SELECT statement. The select statement marshals the rows, joining to all the dimension table via the business key to retrieve the identity number associated with that row. The insert is done via a data flow. With large numbers of rows I have found this easily the faster method.

  • RonKyle (12/19/2015)


    RonKyle - why won't you use stored procedures to do a final insert?

    I do use stored procedures for the final insert in the sense that there is a stored procedure that is a SELECT. What it's not is an INSERT INTO ... SELECT statement. The select statement marshals the rows, joining to all the dimension table via the business key to retrieve the identity number associated with that row. The insert is done via a data flow. With large numbers of rows I have found this easily the faster method.

    Maybe I am overthinking it but I am very familiar with lots of SSIS and T-SQL techniques for loading data and still, after several Q&A's, I cannot picture what you're describing. Are you saying you load data into a staging table on the target instance, then pull the data back out from there and stream it back into final tables in that same instance as opposed to writing an INSERT...SELECT on that instance? Could you post a screenshot of a Package where you employ this technique or point to an article that describes the technique?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo (12/19/2015)


    RonKyle (12/19/2015)


    RonKyle - why won't you use stored procedures to do a final insert?

    I do use stored procedures for the final insert in the sense that there is a stored procedure that is a SELECT. What it's not is an INSERT INTO ... SELECT statement. The select statement marshals the rows, joining to all the dimension table via the business key to retrieve the identity number associated with that row. The insert is done via a data flow. With large numbers of rows I have found this easily the faster method.

    Maybe I am overthinking it but I am very familiar with lots of SSIS and T-SQL techniques for loading data and still, after several Q&A's, I cannot picture what you're describing. Are you saying you load data into a staging table on the target instance, then pull the data back out from there and stream it back into final tables in that same instance as opposed to writing an INSERT...SELECT on that instance? Could you post a screenshot of a Package where you employ this technique or point to an article that describes the technique?

    Well, I could be wrong, but it sounds like he is loading the data into memory within SSIS first, then writing that data to a staging table second, then using the SELECT statement to join the data in order to attach the identity number third, then returning just that field back to SSIS fourth and then joining the field back onto the record set in SSIS fifth and writing only the data with the identity number to the primary table sixth while truncating the staging table seventh?

    That's just a wild guess though based on what he said.

  • That's just a wild guess though based on what he said.

    That's pretty wild. Not it, though. I'll post tonight or tomorrow. I need to screen a couple of things out cannot spare that time at the moment.

  • RonKyle (12/19/2015)


    RonKyle - why won't you use stored procedures to do a final insert?

    I do use stored procedures for the final insert in the sense that there is a stored procedure that is a SELECT. What it's not is an INSERT INTO ... SELECT statement. The select statement marshals the rows, joining to all the dimension table via the business key to retrieve the identity number associated with that row. The insert is done via a data flow. With large numbers of rows I have found this easily the faster method.

    This just sounds like a straight INSERT into a fact table of new rows, after having populated all associated dim tables. It would be fast. But how does it handle updates? I'm guessing that there are none.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • This just sounds like a straight INSERT into a fact table of new rows, after having populated all associated dim tables. It would be fast. But how does it handle updates? I'm guessing that there are none.

    Yes, your are correct on the first part. I consider it pretty straightforward. The update portion isn't as correct. Some of the fact table are transactional and are insert only. Some key ones, however, are accumulating snapshots. As those are fairly low row counts (rarely exceeding 10,0000 per update) I use the SSIS tool. It's RBAR, but given the low record count to update, it finishes in seconds and so far hasn't been worth my time to design a method that would scale well. I also use the SSIS update tool for dimensions as they are also low record counts for any given update. Most of them are 1SCD, so it's straightforward.

    Had a lot of emergencies yesterday so I didn't get it posted last night. May not be able to get to it today, but will do so tomorrow.

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply