Simple Method for Importing Ragged files

  • Definitely an ingenious solution; however, it can quickly become a poor approach when having to deal with larger import file -think telephone call records for instance.

    I've found that the best approach is -and forgive me for still using DTS! :)- the judicious application of modest VB scripting to quickly manipulate suspicous data, or 'cleansing' done directly within the package and temp table.

  • brewmanz.sqlservercentral (3/5/2008)[hrI was hoping for the import of the classic ragged-right file - a variable number of fields in records that are wanted.

    For your definition of "ragged right", is each row supposed to have the same number of "fields" in the same position in each row but because the right hand fields aren't necessarily populated nor padded, that's what makes it ragged?

    Do you have an example ragged right file and a record layout that I could play with?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • admin (3/5/2008)


    Definitely an ingenious solution; however, it can quickly become a poor approach when having to deal with larger import file -think telephone call records for instance.

    I've found that the best approach is -and forgive me for still using DTS! :)- the judicious application of modest VB scripting to quickly manipulate suspicous data, or 'cleansing' done directly within the package and temp table.

    DTS is IMO a preferred method for doing quick imports with a lot less overhead. However, there are various scripting tools available within SSIS also. Also, since DTS is being deprecated for SQL Server 2008, it may be time to bite the bullet. 🙂

  • admin (3/5/2008)


    Definitely an ingenious solution; however, it can quickly become a poor approach when having to deal with larger import file -think telephone call records for instance.

    I've found that the best approach is -and forgive me for still using DTS! :)- the judicious application of modest VB scripting to quickly manipulate suspicous data, or 'cleansing' done directly within the package and temp table.

    Actually, I've used the method in this article for CDR (Call Detail Records) cleansing... Consider that bulk insert will import 5.1 million rows in 60 seconds and BCP will do an export in about the same time... that means you can do a 5.1 million row "clean up" of this nature at the rate of 5.1 million rows in about 3 minutes (1 export, 2 imports).

    Of course, CDRs are normally in a fixed field format and it's a lot cheaper to just input the file into a single column and use substring to split the records whilst ignoring the "short rows". Works nasty fast. Another advantage is for CDR files like what some of the "Bell" companies send... they send a mixed bag... they use a certain "record indicator" in the same postion across multiple record types that have different layouts (CDRs vs Tax Records vs records counts for both). A simple substring "detector" allows me to filter out all but the rows that I want to split.

    Heh... and forgive me for not using DTS! I've actually never learned how to use it because I can normally beat the guys at work that do use it for performance.

    And, no... I'm not making fun of anyone who uses DTS... if you take it that way, then I apologize. I'm just saying I've always been able to beat DTS with T-SQL especially on CDR files.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Paul Ibison (3/4/2008)


    Comments posted to this topic are about the item

    Pretty good article, Paul... straight and to the point.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Paul,

    Once that data was imported into SQL why go back to a flat file? I would have thought that you could have started digesting the data straight away. One less round trip. Even if you are aggregating multiple input files into a larger batch you could just suck the fitting rows into the aggregate table.

    ATBCharles Kincaid

  • Thanks Jeff. The name makes sense but I can find no reference to it. Can you provide a link to the "mixed rowtype" definition (search for "mixed rowtype" in Google and get 0 results).

    Cheers,

    Paul

  • Oops. I reread the original. You are getting the column names from the header row and letting BCP or the like do the column splitting for you.

    Sorry, had not had my first cup yet.

    ATBCharles Kincaid

  • Jeff Moden (3/5/2008)


    And, no... I'm not making fun of anyone who uses DTS... if you take it that way, then I apologize. I'm just saying I've always been able to beat DTS with T-SQL especially on CDR files.

    No apologie needed!:P

  • paul.ibison (3/5/2008)


    Thanks Jeff. The name makes sense but I can find no reference to it. Can you provide a link to the "mixed rowtype" definition (search for "mixed rowtype" in Google and get 0 results).

    Cheers,

    Paul

    Heh... apparently.... not. 😀 I've never had a problem with folks understanding me when I say "Mixed RowType" but, you're correct... it doesn't appear in Google even if I split the word "RowType" into two. Looks like I've coined another phrase.

    I did find a couple of close references... but they don't come near the simplicity of "Mixed RowType"....

    http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=/dbp/rbafoformt.htm

    http://accessblog.net/2007/03/export-mixed-type-data-to-excel.html

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • admin (3/5/2008)


    Jeff Moden (3/5/2008)


    And, no... I'm not making fun of anyone who uses DTS... if you take it that way, then I apologize. I'm just saying I've always been able to beat DTS with T-SQL especially on CDR files.

    No apologie needed!:P

    Thank you for your understanding... pretty cool especially for someone relatively new to the forum. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • This looks like a traditional header/footer formatted text files. It doesn't seem to account for additional information in the header and/or footer that applies to all rows nested within them. I don't think you can account for these types of files with this method. It also seems to think that all header/footer combinations in these files will contain the exact same content in the exact same format all the way through the file. For many header/footer files I've had to import this is not a safe assumption.

  • I had a question about right-ragged file formats in a previous post.

    I ended up having to use a script because of the llimitations of the conditional split.

    (see post http://qa.sqlservercentral.com/Forums/Topic355490-148-1.aspx)

  • That's certainly another way to do it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I agree that this is not a ragged file problem, but an exercise in stripping the first and last rows. A conditional split would be the approach I would also take, or simply import with a delimited field mask for the "," into a staging table with nullable fields and then delete for null data. I have an aversion to bringing data into the system, cleaning it and then extracting it just to import again. I see no reason to create a cleaned extract, and several reasons not to. That will create a file management problem eventually, especially with large datasets.

    Once the data is in the RDBMS why go outside again? All the tools for transformation are there in SSIS and SSMS. I see this as a simplistic and inappropriate use of the tools available. The desire to make it easier for developers down the road can be accomplished inside the package without complex logic. And what "code" would need to be made can be made dynamic by using column indexes instead of names.

    Brandon Forest

    Sr. SQL DBA

Viewing 15 posts - 16 through 30 (of 51 total)

You must be logged in to reply to this topic. Login to reply