Importing Unstructured Flat File Data with Header, Detail and Trailer

  • I have a flat file (pipe delimited) which I want to import into a table. The file format is based on Header, Detail and Trailer and is structured as:

    H|20120531|SAPECC6|

    D|CostCentreID|CostCentreName|CostCentreDescription|

    D|700|Name1|Description 1|

    D|701|Name2|

    D|702|Name3|Description 3|

    D|703|Name4|

    T|4|

    The First Row represents Header (H) which contains ExtractDate and SourceSystemName

    The Second row represents the Column Names for the Detail Rows

    There are 4 Details (D) rows containing the data

    The Trailer(T) row contains the count of data in the Detail rows

    If the file is treated as an "unstructured file" e.g. the 2nd and 4th detail row does not have description, how can the data be imported into a destination table which looks like this:

    CostCentreID CostCentreName CostCentreDescription SourceSystem ExtractDate

    700 Name1 Description1 SAPECC6 2012-05-31

    701 Name2 SAPECC6 2012-05-31

    702 Name3 Description3 SAPECC6 2012-05-31

    703 Name4 SAPECC6 2012-05-31

    And finally, I want to validate the number of rows loaded against the Trailer Recod which was 4 (This acts as a reconciliation)

    Can anyone provide me with a full working solution based on the requirement above?. Any help will be very much appreciated

    Thanks

  • rka (5/31/2012)

    Can anyone provide me with a full working solution based on the requirement above?. Any help will be very much appreciated

    Thanks

    Depends on how much you are paying.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • I managed to write the code. I have 4 outputs from the script component: ValidRows, HeaderRows, FooterRows and ErrorRows

    The code is as follows:

    Imports System

    Imports System.Data

    Imports System.Math

    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

    Public Class ScriptMain

    Inherits UserComponent

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

    If Tokenise(Row.Column0, "|", 1) = "H"

    Row.sourceName = Tokenise(Row.Column0, "|", 2)

    Row.ExtractDate = Tokenise(Row.Column0, "|", 3)

    Row.DirectRowToHeaderRows()

    ElseIf Tokenise(Row.Column0, "|", 1) = "D"

    Row.Col1 = Tokenise(Row.Column0, "|", 2)

    Row.Col2 = Tokenise(Row.Column0, "|", 3)

    Row.Col3 = Tokenise(Row.Column0, "|", 4)

    Row.DirectRowToValidRows()

    ElseIf Tokenise(Row.Column0, "|", 1) = "T"

    Row.TotalRows = Tokenise(Row.Column0, "|", 2)

    Row.DirectRowToFooterRows()

    Else

    Row.DirectRowToErrorRows()

    End If

    End Sub

    'Private function that parses out the columns of the whole row

    Private Function Tokenise(ByVal input As String, ByVal delimiter As String, ByVal token As Integer) As String

    Dim tokenArray As String()

    tokenArray = input.Split(delimiter.ToCharArray) 'Split the string by the delimiter

    If tokenArray.Length < token Then 'Protect against a request for a token that doesn't exist

    Return ""

    Else

    Return tokenArray(token - 1)

    End If

    End Function

    End Class

    The issue now I am having is to create a Cross Join between the Detail rows and Header Row so that I can get the output like

    CostCentreID CostCentreName CostCentreDescription SourceSystem ExtractDate

    700 Name1 Description1 SAPECC6 2012-05-31

    701 Name2 SAPECC6 2012-05-31

    702 Name3 Description3 SAPECC6 2012-05-31

    703 Name4 SAPECC6 2012-05-31

    how to do the Cross Join without doing the Merge Join etc..Is there anyway to do this in the Scrip task by writing to a variable and then using the variable to create a derived column?

    Thanks

  • Points for showing your work so far!

    I would urge you to look into the Conditional Split Transformation for SSIS. Also, I think tokenize is spelled with a z.

    If you do stick with a Script Component, in Input0_ProcessInputRow you ought to store to a variable the result of a single call to Tokenise and then do all future compares against the variable. You are tokenizing each incoming row many many times and you should only need to do that once.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • rka (6/1/2012)


    I managed to write the code. I have 4 outputs from the script component: ValidRows, HeaderRows, FooterRows and ErrorRows

    The code is as follows:

    Imports System

    Imports System.Data

    Imports System.Math

    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

    Public Class ScriptMain

    Inherits UserComponent

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

    If Tokenise(Row.Column0, "|", 1) = "H"

    Row.sourceName = Tokenise(Row.Column0, "|", 2)

    Row.ExtractDate = Tokenise(Row.Column0, "|", 3)

    Row.DirectRowToHeaderRows()

    ElseIf Tokenise(Row.Column0, "|", 1) = "D"

    Row.Col1 = Tokenise(Row.Column0, "|", 2)

    Row.Col2 = Tokenise(Row.Column0, "|", 3)

    Row.Col3 = Tokenise(Row.Column0, "|", 4)

    Row.DirectRowToValidRows()

    ElseIf Tokenise(Row.Column0, "|", 1) = "T"

    Row.TotalRows = Tokenise(Row.Column0, "|", 2)

    Row.DirectRowToFooterRows()

    Else

    Row.DirectRowToErrorRows()

    End If

    End Sub

    'Private function that parses out the columns of the whole row

    Private Function Tokenise(ByVal input As String, ByVal delimiter As String, ByVal token As Integer) As String

    Dim tokenArray As String()

    tokenArray = input.Split(delimiter.ToCharArray) 'Split the string by the delimiter

    If tokenArray.Length < token Then 'Protect against a request for a token that doesn't exist

    Return ""

    Else

    Return tokenArray(token - 1)

    End If

    End Function

    End Class

    The issue now I am having is to create a Cross Join between the Detail rows and Header Row so that I can get the output like

    CostCentreID CostCentreName CostCentreDescription SourceSystem ExtractDate

    700 Name1 Description1 SAPECC6 2012-05-31

    701 Name2 SAPECC6 2012-05-31

    702 Name3 Description3 SAPECC6 2012-05-31

    703 Name4 SAPECC6 2012-05-31

    how to do the Cross Join without doing the Merge Join etc..Is there anyway to do this in the Scrip task by writing to a variable and then using the variable to create a derived column?

    Thanks

    BWAAA-HAAAA! This isn't directed at you. I'm just amazed how little SSIS actually does for folks. I've always been made to believe that it's borderline "magic" and that BI can't actually exist without it. I see that notion is borderline horse feathers. I mean really... All that work and it still doesn't do what you want it to. This is why I steer clear of things like SSIS for most stuff... the added complexity just isn't needed.

    With that thought in mind, let's scrap the notion of having to really go outside of the data engine to SSIS and let's scrap the idea of programming in a language other than T-SQL. About the only thing we're going to do outside of the data engine is to store a generic BCP format file for these types of files. You should store the attached BCP format file in a utility folder somewhere that SQL Server can see. I store mine in D:\Util.

    Next, find out where your file is stored and change the source file in the BULK INSERT command to that full path and file name. In SQL Server 2005, this has to be hard coded or make the full BULK INSERT command a bit of dynamic SQL to make file and path names variable. My example is hard coded just for study purposes.

    Here's the T-SQL to do the same thing that you're trying to use SSIS and VB to do... It's just not that hard to do and I really don't understand why people go through the pain of using SSIS and scripts and whatever else to do something so simple.

    --===== Conditionally drop temp tables to make reruns in SSMS easier

    IF OBJECT_ID('TempDB..#RawData' ,'U') IS NOT NULL DROP TABLE #RawData;

    IF OBJECT_ID('TempDB..#SplitData','U') IS NOT NULL DROP TABLE #SplitData;

    --===== Explicitly create the raw data table for the BULK INSERT

    CREATE TABLE #RawData

    (

    RowNum INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    Data VARCHAR(8000)

    )

    ;

    --===== Do the import into just the data column as controlled by the format file.

    BULK INSERT #RawData

    FROM 'C:\Temp\SomeFile.txt'

    WITH (

    BATCHSIZE = 100000,

    CODEPAGE = 'RAW',

    DATAFILETYPE = 'char',

    FORMATFILE = 'D:\Util\Generic09_PipeCrLf.txt',

    TABLOCK

    )

    ;

    --===== Split the data and load it into a working table

    SELECT rd.RowNum, split.ItemNumber, split.Item

    INTO #SplitData

    FROM #RawData rd

    CROSS APPLY dbo.DelimitedSplit8K(Data,'|') split

    ;

    --===== Get the desired common data from the header and remember it

    DECLARE @ExtractDate DATETIME,

    @SourceSystem VARCHAR(128)

    ;

    SELECT @ExtractDate = MAX(CASE WHEN ItemNumber = 2 THEN Item ELSE '1900' END),

    @SourceSystem = MAX(CASE WHEN ItemNumber = 3 THEN Item ELSE '' END)

    FROM #SplitData

    WHERE RowNum = 1

    ;

    --===== Recombine the "ragged" data using a traditional high speed cross tab.

    SELECT CostCentreID = MAX(CASE WHEN ItemNumber = 2 THEN Item ELSE 0 END),

    CostCentreName = MAX(CASE WHEN ItemNumber = 3 THEN Item ELSE '' END),

    CostCentreDescription = MAX(CASE WHEN ItemNumber = 4 THEN Item ELSE '' END),

    SourceSystem = @SourceSystem,

    ExtractDate = @ExtractDate

    FROM #SplitData

    WHERE RowNum > 2

    GROUP BY RowNum

    HAVING MAX(CASE WHEN ItemNumber = 1 THEN Item ELSE '' END) = 'D'

    ;

    Here's what the output looks like...

    CostCentreID CostCentreName CostCentreDescription SourceSystem ExtractDate

    ------------ -------------- --------------------- ------------ -----------------------

    700 Name1 Description 1 SAPECC6 2012-05-31 00:00:00.000

    701 Name2 SAPECC6 2012-05-31 00:00:00.000

    702 Name3 Description 3 SAPECC6 2012-05-31 00:00:00.000

    703 Name4 SAPECC6 2012-05-31 00:00:00.000

    (4 row(s) affected)

    You'll need my 8k splitter for this. You can pick that up in the "resources" section near the bottom of the following article.

    http://qa.sqlservercentral.com/articles/Tally+Table/72993/

    If you REALLY need to tokenize the column names, let me know and we can, with a little bit of help from dynamic SQL, automatically determine all of the column names, automatcially determine the number of columns, etc, etc.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • It was just too much fun to resist. 😀 Here's the version with the 100% dynamic column name resolution. I also added a potwad of documentation.

    /*************************************************************************************

    This is what a typical file looks like for this problem.

    --------------------------------------------------------------------------------------

    H|20120531|SAPECC6|

    D|CostCentreID|CostCentreName|CostCentreDescription|

    D|700|Name1|Description 1|

    D|701|Name2|

    D|702|Name3|Description 3|

    D|703|Name4|

    T|4|

    --------------------------------------------------------------------------------------

    This is what the generic format file looks like. Note that anything after the

    desinated number of rows in a BCP format file are treated as comment lines for

    documentation purposes.

    --------------------------------------------------------------------------------------

    9.0

    2

    1 SQLCHAR 0 0 "" 0 RowNum ""

    2 SQLCHAR 0 8001 "|\r" 2 Data SQL_Latin1_General_CP1_CI_AS

    --------------------------------------------------------------

    This is a BCP format file that allows a text file to be

    handled as if it were a single column and allows input to a

    table with the following structure because the RowNum column

    is skipped because of the "0" column number in the format

    above.

    There are 3 characters (|CrLf) at the end of each line.

    The 8001 will cause an input to a VARCHAR(8000) to fail if the

    line is actually too long.

    CREATE TABLE #RawData

    (

    RowNum INT IDENTITY(1,1),

    Data VARCHAR(8000)

    )

    ;

    --Jeff Moden

    --------------------------------------------------------------------------------------

    And now for the code. With the help of the generic BCP format file we made for this

    type of file (each line ends with a PIPE character followed by a carriage return (\r)

    and a line feed ()), the code will read the file, extract some common information

    from the first line 9the header), extract the column names from the 2 line, and then

    dynamically and correctly extact all of the data from the file as a single result

    set.

    --Jeff Moden

    *************************************************************************************/

    --====================================================================================

    -- Presets

    --====================================================================================

    --===== Environmental presets

    SET NOCOUNT ON;

    --===== Conditionally drop temp tables to make reruns in SSMS easier

    IF OBJECT_ID('TempDB..#RawData' ,'U') IS NOT NULL DROP TABLE #RawData;

    IF OBJECT_ID('TempDB..#SplitData','U') IS NOT NULL DROP TABLE #SplitData;

    --===== Explicitly create the raw data table for the BULK INSERT.

    -- The #SplitData table will be created on-the-fly.

    CREATE TABLE #RawData

    (

    RowNum INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    Data VARCHAR(8000)

    )

    ;

    --===== Declare some obviously named variables.

    -- Note that the variables that begin with "@p" could be parameters for a

    -- stored procedure.

    DECLARE @ExtractDate DATETIME,

    @SourceSystem VARCHAR(128),

    @ImportSQL NVARCHAR(4000),

    @ResultSQL NVARCHAR(MAX),

    @pFullFileName NVARCHAR(258),

    @pBCPFileName NVARCHAR(258)

    ;

    --===== Preset the parameter variables.

    -- Normally, these would be parameters passed into a stored procedure.

    SELECT @pFullFileName = 'C:\Temp\SomeFile.txt',

    @pBCPFileName = 'D:\Util\Generic09_PipeCrLf.txt'

    ;

    --===== Prepare the Dynamic BULL INSERT code with a couple of "place holders" for the

    -- file names.

    SELECT @ImportSQL = '

    --===== Do the import into just the data column as controlled by the format file.

    BULK INSERT #RawData

    FROM <<@pFullFileName>>

    WITH (

    BATCHSIZE = 100000,

    CODEPAGE = ''RAW'',

    DATAFILETYPE = ''char'',

    FORMATFILE = <<@pBCPFileName>>,

    TABLOCK

    )

    ;

    '

    ;

    --===== Make the necessary substitutions in the Dynamic BULK INSERT code.

    SELECT @ImportSQL =

    REPLACE(

    REPLACE(

    @ImportSQL

    ,'<<@pFullFileName>>',QUOTENAME(@pFullFileName,''''))

    ,'<<@pBCPFileName>>',QUOTENAME(@pBCPFileName,''''))

    ;

    --====================================================================================

    -- Import the data and split it on the designated delimiters.

    -- In this case, it's PIPE (|} symbols.

    --====================================================================================

    --===== Execute the prepared BULK INSERT statement

    EXEC (@ImportSQL)

    ;

    --===== Split the data and load it into a working table

    SELECT rd.RowNum, split.ItemNumber, split.Item

    INTO #SplitData

    FROM #RawData rd

    CROSS APPLY dbo.DelimitedSplit8K(Data,'|') split

    ;

    --===== Get the desired common data from the header and remember it

    SELECT @ExtractDate = MAX(CASE WHEN ItemNumber = 2 THEN Item ELSE '1900' END),

    @SourceSystem = MAX(CASE WHEN ItemNumber = 3 THEN Item ELSE '' END)

    FROM #SplitData

    WHERE RowNum = 1

    ;

    --====================================================================================

    -- Using the column headers that we find in row 2 from the split data,

    -- create the final result dynamic SQL and then execute it.

    -- Basically, this just creates a nice Cross Tab using all of the column names.

    --====================================================================================

    SELECT @ResultSQL = ' SELECT ' + CHAR(10)

    ;

    SELECT @ResultSQL = @ResultSQL

    + SPACE(8)

    + split.Item

    + ' = MAX(CASE WHEN ItemNumber = '

    + CAST(ItemNumber AS VARCHAR(10))

    + ' THEN Item ELSE '''' END)'

    + ',' + CHAR(10)

    FROM #SplitData split

    WHERE RowNum = 2

    AND ItemNumber >= 2

    ORDER BY ItemNumber

    ;

    SELECT @ResultSQL = @ResultSQL

    + ' SourceSystem = ' + QUOTENAME(@SourceSystem,'''') + ',

    ExtractDate = ' + QUOTENAME(@ExtractDate,'''') + '

    FROM #SplitData

    WHERE RowNum > 2

    GROUP BY RowNum

    HAVING MAX(CASE WHEN ItemNumber = 1 THEN Item ELSE '''' END) = ''D''

    ;

    '

    ;

    --===== All set. Produce the final result by executing the SQL we just made.

    EXEC (@ResultSQL)

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • For the record what the OP wants to do can be done without the use of any custom Script Tasks or Script Components, i.e. using only in-the-box SSIS objects.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (6/2/2012)


    For the record what the OP wants to do can be done without the use of any custom Script Tasks or Script Components, i.e. using only in-the-box SSIS objects.

    Outstanding. Any chance of seeing that particular setup?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (6/2/2012)


    opc.three (6/2/2012)


    For the record what the OP wants to do can be done without the use of any custom Script Tasks or Script Components, i.e. using only in-the-box SSIS objects.

    Outstanding. Any chance of seeing that particular setup?

    I won't produce the package but the pseudocode will go something like this:

    Control Flow

    -- Data Flow Task 1

    -- -- Flat File Source reads input file putting entire line into a single column

    -- -- Conditional Split Transformation separates the Header, Data and Trailer lines, piping Data lines to a Flat File Destination (intermediate file) and capturing the relevant Header and Trailer values in variables. Some say a RAW file is the way to go, but I am not well versed in using them, they are still on my list to experiment with. Flat Files as intermediate files have worked fine for me thus far to pass data between Data Flows in a scenario like this.

    -- -- Row Count Transformation captures the number of rows sent to the Data Line Output of the Conditional Split Transformation and stores the value in a variable

    -- Data Flow Task 2 (sequential after Data Flow Task 1 succeeds AND the Row Count Variable matches the Trailer Record count-variable validating the number of rows matches what the Trailer said)

    -- -- Flat File Source reads intermediate file and parses as pipe-delimited into columns

    -- -- Derived Column Transformation adds column to each row with relevant values from Header and Trailer rows previously stored in variables

    -- -- OLE DB Destination stores parsed data to database table

    I hope that addresses all the requirements. A package like this takes about 30 minutes to develop for a simple flat file and I only bother the database engine when I'm ready to store data. I can also process rows wider than 8000 bytes with ease and do not have to worry about parallelism messing with the order of the rows being imported into the table as they are ordered in the file. I can also do anything else to this data prior to involving the database engine including running it through a third-party tool to standardize it (think address and phone number) or enrich it per any other external data or service (think xml file, remote DB instance or web service).

    Can we do all that in T-SQL? Sure, albeit using features I would rather not enable or use from the database engine, as you know my stance on most of those.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (6/2/2012)


    I won't produce the package...

    Um... ok. Thanks for the pseudo code but I'm curious as to why not? The T-SQL to do the job has been posted.

    I hope that addresses all the requirements. A package like this takes about 30 minutes to develop for a simple flat file and I only bother the database engine when I'm ready to store data. I can also process rows wider than 8000 bytes with ease and do not have to worry about parallelism messing with the order of the rows being imported into the table as they are ordered in the file. I can also do anything else to this data prior to involving the database engine including running it through a third-party tool to standardize it (think address and phone number) or enrich it per any other external data or service (think xml file, remote DB instance or web service).

    Will the split end up with an extra column at the end because of the trailing delimiter in the data?

    Also, how are you distinguishing between row 2 and the other rows that begin with "D"?

    I suppose it's possible but I've never seen BULK INSERT "messing with the order of the rows" due to parallelism. Perhaps it's because I normally tell it to load the whole file as a single batch.

    Can we do all that in T-SQL? Sure, albeit using features I would rather not enable or use from the database engine, as you know my stance on most of those.

    Since the only out of the ordinary feature that was used in the T-SQL was that of BULK INSERT and then only because whatever proc using it has to have BULKADMIN privs, are you saying that you'd rather not using BULK INSERT?

    Anyway, thanks for the pseudocode. I'll give that a whirl. A side benefit to the SSIS job might be if its row splitter runs faster (and it should if it's using the guts from BCP) than running a splitter function followed by a CROSS TAB.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Wow.... you made it sound so easy with your pseudo code that I finally decided to give the SSIS tutorial a try. I sure hope that SSIS is better than than bloody tutorial because, right now, I'm pretty well put out. I'm doing the 2005 version because that's what I have at work. The first step is to open a file the give you. Now, when I install SQL Server, I tell it to install EVERYTHING including AdventureWorks and all sample files. The sample file they tell me to look for isn't anywhere on my hard disks.

    I also see why you only posted pseudo code. There's a whole lot of steps to doing what you did.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I hope that addresses all the requirements. A package like this takes about 30 minutes to develop for a simple flat file and I only bother the database engine when I'm ready to store data. I can also process rows wider than 8000 bytes with ease and do not have to worry about parallelism messing with the order of the rows being imported into the table as they are ordered in the file. I can also do anything else to this data prior to involving the database engine including running it through a third-party tool to standardize it (think address and phone number) or enrich it per any other external data or service (think xml file, remote DB instance or web service).

    Will the split end up with an extra column at the end because of the trailing delimiter in the data?

    I went to produce the package I outlined using the sample file and realized I missed something very important in the original post...the meaning behind the "unstructured file" comment in the OP. There are missing delimiters and SSIS does not take kindly to flat files that are not actually flat. So, we must standardize the file before the SSIS Flat File Connector will parse it. My pseudocode has changed to the following:

    Control Flow

    -- Data Flow Task 1

    -- -- Flat File Source reads input file putting entire line into a single column

    -- -- Conditional Split Transformation separates the Header, Data and Trailer lines, piping Data lines to a Flat File Destination (intermediate file #1) and capturing the relevant Header and Trailer values in variables (using a Script Component, not sure if this can be done with a native component but its a simple bit of code in SSIS 2005 and is reduced to 1 line in SSIS 2008).

    -- -- Row Count Transformation captures the number of rows sent to the Data Line Output of the Conditional Split Transformation and stores the value in a variable

    -- Data Flow Task 2 (sequential after Data Flow Task 1 succeeds AND the Row Count Variable matches the Trailer Record count-variable validating the number of rows matches what the Trailer said)

    -- -- Flat File Source reads intermediate file #1 putting entire line into a single column

    -- -- Derived Column Transformation appends delimiters as needed to standardize the file

    -- -- Flat File Destination writes intermediate file #2

    -- Execute SQL Task Clears Staging Table (optional depending on how downstream processes function, they may clear the table for us, added for ease of unit testing)

    -- Data Flow Task 3

    -- -- Flat File Source reads intermediate file #2 and parses as pipe-delimited into columns

    -- -- Derived Column Transformation adds column to each row with relevant values from Header and Trailer rows previously stored in variables

    -- -- OLE DB Destination stores parsed data to database table

    Jeff Moden (6/2/2012)


    Wow.... you made it sound so easy with your pseudo code that I finally decided to give the SSIS tutorial a try. I sure hope that SSIS is better than than bloody tutorial because, right now, I'm pretty well put out. I'm doing the 2005 version because that's what I have at work. The first step is to open a file the give you. Now, when I install SQL Server, I tell it to install EVERYTHING including AdventureWorks and all sample files. The sample file they tell me to look for isn't anywhere on my hard disks.

    Sometimes it can seem that way. We may breeze through the concepts familiar to us however when those are new to someone they can take hours to get working, and days or longer to fully internalize. At least that's how it worked for me when I first read your Tally Table article 🙂

    I also see why you only posted pseudo code. There's a whole lot of steps to doing what you did.

    Nah, I didn't post code along with it because I was on my tablet yesterday and wanted to provide something to continue our conversation.

    It is your lucky day rka 😉

    Jeff, I have attached a sample package that does what I outlined in the pseudocode above. It is done in SSIS 2005, pointing to a SQL 2005 database. The test file and a SQL file to create the staging table are included. Set the value of the FlatFileDirectory variable to point to the location on your machine where the Data Files will reside. I tucked them into the TestFiles directory under the project folder for easy bundling.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (6/3/2012)


    It is your lucky day rka 😉

    Jeff, I have attached a sample package that does what I outlined in the pseudocode above. It is done in SSIS 2005, pointing to a SQL 2005 database. The test file and a SQL file to create the staging table are included. Set the value of the FlatFileDirectory variable to point to the location on your machine where the Data Files will reside. I tucked them into the TestFiles directory under the project folder for easy bundling.

    Very cool. Thanks, Orlando. I'll have a look.

    BTW... didn't need to do anything additional in the T-SQL for handling the ragged right nature of the file. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (6/3/2012)


    opc.three (6/3/2012)


    It is your lucky day rka 😉

    Jeff, I have attached a sample package that does what I outlined in the pseudocode above. It is done in SSIS 2005, pointing to a SQL 2005 database. The test file and a SQL file to create the staging table are included. Set the value of the FlatFileDirectory variable to point to the location on your machine where the Data Files will reside. I tucked them into the TestFiles directory under the project folder for easy bundling.

    Very cool. Thanks, Orlando. I'll have a look.

    You're very welcome. Please do, and let me know if you need clarification on anything. There are many ways to improve performance of the package by collapsing some operations. Note that intermediate files are not necessary at all, but they do make development easier. If trying to reap all performance advantages no intermediate files would be used.

    BTW... didn't need to do anything additional in the T-SQL for handling the ragged right nature of the file. 😉

    You ought to put that one in the brochure in the section that talks about file formats that are not ragged right, but under cons. No negative feedback when a column that was supplied yesterday is not supplied today. I am not so sure that is a good thing in all circumstances :ermm:

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (6/3/2012)


    BTW... didn't need to do anything additional in the T-SQL for handling the ragged right nature of the file. 😉

    You ought to put that one in the brochure in the section that talks about file formats that are not ragged right, but under cons. No negative feedback when a column that was supplied yesterday is not supplied today. I am not so sure that is a good thing in all circumstances :ermm:

    Even SSIS has to load it to make that determination, Orlando. 😉 Since it was an expected condition, I could ignore the supposed fault. If it were not an expected condition, I'd do like any good programmer would do... test for expected conditions. Yep... a little extra code that SSIS already tests for. Will SSIS allow you to NOT test for it if it IS an expected condition? I believe the answer is "yes" but you have to select "Ragged Right".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply