Flat File Import Problems

  • Hi Guys,

    I have a flat file that I want to import. The problem is that the detination table has 32 columns, but the flat file sometimes has 32, and sometimes has 29.. within the same file. Previously in SQL 2000 DTS the data pump would handle this and pad the additional columns with NULL values if they werent preset. In SSIS the data flow task keeps looking for the next delimiter (which is pipe in this case), even if it finds the end of row delimiter first.

    To clarify, if the flat file row has 29 columns, and ssis is expecting 32 rows, so it looks for the next delimiter which occurs on the NEXT ROW and includes the next row information in column 30, then it gets the next colum and puts it into col 31, and finally, it gets the rest of the row (pipes and all) and puts that into col 32.

    a 4 column example. The flat file looks like this:

    Col1a|Col2a{LF}

    Col1b|Col2b|Col3b|Col4b{LF}

    Col1c|Col2c|Col3c|Col4c{LF}

    And when imported to a 4 column table, we get this:

    Col1     Col2     Col3           Col4

    Col1a   Col2a   {LF}Col1b  Col2b|Col3b|Col4b{LF}

    Col1c   Col2c   Col3c        Col4c{LF}

    But what we want is

    Col1     Col2     Col3           Col4

    Col1a   Col2a    NULL        NULL

    Col1b   Col2b    Col3b       Col4b

    Col1c   Col2c    Col3c       Col4c

    Is there a task that will handle these exceptions? Or am I going to have to write some VB?

    Obviously we have pushed back to try and get the additional delimiters added to rows with less than 32 Cols in the text file, but this is not possible.

    I guess the options are to cleanse the input file before it gets into data flow, or correct it on the fly using transformation scripts...

     

    Any suggestions on what I should do here?

  • Ok, I thought since no one seems to know about what to do with these file imports I thought I should post my findings.

    There were a couple of options that I could have gone for. The simple one would be to use the "right ragged" file format. The problem with this is that it only handles 1 additional column past the number of defined cols. We can then take this "extra" column and split it using a contditional split. Our conditional split would have to be able to handle an abritary number of extra columns, which is where this method comes unstuck.

    Instead I opted for a script at the control flow level. The script parses the input file and pads the rows that are short of columns with null columns, and the rows that have too many columns it truncates. (It also logs these to an external file for checking). It knows how many columns there should be by counting the number of columns in the SQL destination table.

    Once it has cleansed the input file, it is then passed to the data flow section where it will always have the same number of columns.

    Script Below:

    Option Strict Off

    Imports System

    Imports System.Data

    Imports System.Math

    Imports Microsoft.SqlServer.Dts.Runtime

    Imports System.IO

    Public Class ScriptMain

    'L.Mason 2007

    'Working Directory

        Dim Dir As String = "\\myUNCinputDirectory"

        'Name of Input file

        Dim InputFile As String = "filetoparse.txt"

        'Name of repaired Input file

        Dim OutputFile As String = "workingtempfile.txt"

        'Name of SQL table the data is going in to

        Dim DestinationTable As String = "Destination Table"

        'Name of Error file for inputing errorneous lines from the input file. ErrorsYYYYMMDD.txt

        Dim ErrorFile As String = "Errors" & Year(Now) & Right("0" & Month(Now), 2) & Right("0" & Day(Now), 2) & ".txt"

        'Define delimiter String

        Dim strDelim As String = "|"

        Public Sub Main()

            'Declare and Set Database Objects

            Dim rstResults = CreateObject("ADODB.Recordset")

            Dim conn = CreateObject("ADODB.Connection")

            conn.Open("Driver={SQL Server};" & _

                               "Server={SQLServer};" & _

                               "Database={SQLDatabase};" & _

                               "UID=;" & _

                               "PWD=;")

            'Set up SQL Query to populate resultset with column data using sproc SP_Columns

            Dim strSQL As String = "exec sp_columns @table_name='" & DestinationTable & " '"

            'Declare file Variables and Objects

            Dim ch As Char() 'Set up array for storage of parsed tokens

            Dim arr() As String 'Set up array for storage of parsed tokens

            Dim ErrorOutput As String = Dir & ErrorFile

            Dim Input As String = Dir & InputFile

            Dim Output As String = Dir & OutputFile

            Dim fFile1 As New FileInfo(Input)

            Dim fFile2 As New FileInfo(Output)

            Dim fErrorFile As New FileInfo(ErrorOutput)

            Dim oWriteError As New System.IO.StreamWriter(ErrorOutput)

            Dim oWrite As New System.IO.StreamWriter(Output)

            Dim oFile As System.IO.File

            Dim oFile2 As System.IO.File

            Dim oRead As System.IO.StreamReader

            'Declare general variables

            Dim ColAmt As Integer 'Number of columns in destination table

            Dim i As Integer 'loop counter i

            Dim strErrorFileLine As String

            Dim strInputString As String

            Dim OutputLine As String

            Dim arrLength As Integer

            Dim intLineNumber As Integer = 0 'Line number counter

            Dim ErrorCounterPadded As Integer = 0

            Dim ErrorCounterTruncated As Integer = 0

            '****************'

            'Begin Processing'

            '****************'

            'populate results set with data from strSQL query

            rstResults.open(strSQL, conn)

            'this will get the number of columns in the desired destination table

            ColAmt = 0

            While Not rstResults.Eof

                ColAmt = ColAmt + 1

                rstResults.movenext()

            End While

            conn.close()

            conn = Nothing

            rstResults = Nothing

            'Load file into oRead.

            oRead = oFile.OpenText(Input)

            While oRead.Peek() > -1

                'increment line number counter (is there a .net type for this?)

                intLineNumber = intLineNumber + 1

                'get next line from file

                strInputString = oRead.ReadLine

                'Parse line according to delimiter

                arr = strInputString.Split(strDelim.ToCharArray())

                'Check if the current line has less columns than expected.

                If arr.Length < ColAmt Then

                    arrLength = arr.Length

                    'make the short array = to the expected length

                    ReDim Preserve arr(ColAmt - 1)

                    'Send a line to error file

                    strErrorFileLine = "Line " & intLineNumber & " has less Cols than expected:" & strDelim & strInputString

                    oWriteError.WriteLine(strErrorFileLine)

                    'increment the error counter variable

                    ErrorCounterPadded = ErrorCounterPadded + 1

                    'pad added cols with NULLs

                    For i = (arrLength) To (ColAmt - 1)

                        arr(i) = Nothing

                    Next

                End If

                'check for input line with MORE cols than expected

                If arr.Length > ColAmt Then

                    'write the line with too many cols to error file

                    strErrorFileLine = "Line " & intLineNumber & " has more Cols than expected:" & strDelim & strInputString

                    oWriteError.WriteLine(strErrorFileLine)

                    'increment the error counter variable

                    ErrorCounterTruncated = ErrorCounterTruncated + 1

                    'then remove the extra cols and insert as per normal

                    ReDim Preserve arr(ColAmt - 1)

                End If

                'build output line from array

                For i = 0 To ColAmt - 2

                    OutputLine = OutputLine & CStr(arr(i)) & strDelim

                Next

                OutputLine = OutputLine & CStr(arr(ColAmt - 1))

                'write line to output file

                oWrite.WriteLine(OutputLine)

                'clear line variable for next pass

                OutputLine = Nothing

            End While

            'write out summary lines to error file

            If ErrorCounterTruncated > 0 Then

                strErrorFileLine = "Total number of TRUNCATED input rows: " & CStr(ErrorCounterTruncated) & " These will need to be investiaged. The data will be in, but the extra columns will be truncated"

                oWriteError.WriteLine(strErrorFileLine)

            End If

            If ErrorCounterPadded > 0 Then

                strErrorFileLine = "Total number of PADDED input rows: " & CStr(ErrorCounterPadded) & " These should be OK, the missing cols were padded with NULLs"

                oWriteError.WriteLine(strErrorFileLine)

            End If

            'close file system read/write objects

            oRead.Close()

            oWrite.Close()

            oWriteError.Close()

            'rename working file to original filename        fFile1.Delete()

            fFile2.MoveTo(Input)

            'if there were no errors, we can delete the transerror file

            If ErrorCounterTruncated + ErrorCounterPadded = 0 Then

                fErrorFile.Delete()

            End If

            'done!

            Dts.TaskResult = Dts.Results.Success

        End Sub

    End Class

    An additional question I now have is: can the number of columns in the destination table be found in an easier way? At the moment I use ADODB object to pass a query to the SQL DB which executes the SP_Columns stored proc and stores the result in a resultset type. The result set is then looped until the EOF, each time incrementing a counter (lines 71 to 79 in the code). This counter gives me the number of columns as the number of rows returned from the SP_Columns sproc = the number of cols in the queried table. It seems like a very crude way of doing this, and I would have thought that there must be a better way of doing this. I tried using a SQL task at the control flow level, but found that I could get the column count out ok, but I couldnt pass the variable on to the script to make use of it.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply