Excel Source - Overide External Columns Datatype

  • HI All,

    I'm having a problem with an ETL project I'm working on. The file I receive has multiple sheets, which are all in the same format. I then need to import them into one table on the SQL server.

    The problem I'm facing is that the company we receive the file from have a habit of filling "N/A", "Unknown", and other strings in numeric fields. Also some of the sheets won't have a column populated one month, as it's not applicable. This means that, according to the Excel Source, the data type of the column has changed (if they're all NULLs, then Excel suddenly thinks it's a int, not a string, for example).

    What I want to therefore do, is use the excel source and treat EVERY column as a Unicode string, and then have a data conversion tasks that makes them the correct data type. Any conversions that fails will be NULL (as I don't want someone's age to be "TBC", and I actually have our Age column as a integer field on the server). When I change the DataType in the External Columns menu though, it changes it back when you open it back up. How do I change the datatypes, and actually get it to save it?

    I could change all the output columns to Unicode Strings, but if I recall correctly, I had conversion errors when I tried that (I could be wrong, so correct me. This is a project that got put away for a few months and has worked it way back up to the top of the daisy chain).

    Many thanks for your help!

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Try updating the connection string to include "IMEX=1" - this should tell the OLEDB provider to treat all cells as text. FOr more info - see https://www.connectionstrings.com/excel/

  • happycat59 (10/7/2015)


    Try updating the connection string to include "IMEX=1" - this should tell the OLEDB provider to treat all cells as text. FOr more info - see https://www.connectionstrings.com/excel/

    Thanks Happy.

    I had already done this, but it was still trying to parse some decimals as Currency, and others as Dates (which was inconsistent depending on which sheet I was getting the data from). I ended up using a Script Component as a source, and creating the connection in there to pull all the data through as a String, and then using a Data Conversion Transformation to get them to the right type.

    Excel was a bit janky, and sometimes sent dates as ints, and ints as dates, but some tryparse's and dateadd's/maths fixed that. A struggle, but got the result I wanted.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • I understand your issues - Excel is probably my least preferred data source. Unfortunately, Excel does not really care too much about data types - underneath the wrappers, I suspect that each cell is just a bunch of bytes and the display format is used as the type that people think is stored in the cell.

    The oledb provider needs to work with that sort of type-less data and struggles big time. What it actually does is parse the first few rows (by default 8 rows but this is configurable using a registry setting) to try to work out what data type to use. This is why you get different data types for a column depending on what is in the first few rows. Not very useful when SSIS really needs the data type to be the same each time otherwise it gets upset.

  • I had the same issue a few weeks ago, the reason is simple,the excel source determines datatypes based on the values in the first 8 rows. IMEX on its own changes nothing about that.

    IMEX=1 will only treat the column as text if you have a text value within those 8 rows.

    IMEX=0 will take the datatype that seems to be prevalent the most in those 8 rows.

    You can change the amount of rows it samples in the registry of the OS but I don't think you ever want to mess with the registry.

    I'll post a script during my lunch break that can convert a excell tab into an csv file.

  • Ok I've used the following script to convert excel tabs/sheets into csv files.

    Use a for each file loop to loop through the excel files,you can use an ado.net for loop to loop through the tabs/sheets.

    The script takes the following variables

    Read Only

    User::CSV_Loc = location where you want the csv file

    User::DataFile_Input_Path = location of where the excel file sits

    User::Input_Share_Root = root folder of the location

    User::SourceFile = name of the excel file

    Read Write

    User::csv_FileName = name of the csv file

    written in Microsoft Visual Basic 2008

    Imports System

    Imports System.Data

    Imports System.Math

    Imports Microsoft.SqlServer.Dts.Runtime

    Imports System.Data.OleDb

    Imports System.IO

    .

    .

    .

    Public Sub Main()

    Dim sourceFile As String

    Dim worksheetName As String

    Dim targetFile As String

    Dim fileName As String

    fileName = Replace(Replace(Dts.Variables("SourceFile").Value.ToString(), "filename", ""), ".xlsx", "")

    sourceFile = Dts.Variables("Input_Share_Root").Value.ToString() + Dts.Variables("DataFile_Input_Path").Value.ToString() + Dts.Variables("SourceFile").Value.ToString()

    targetFile = Dts.Variables("CSV_Loc").Value.ToString() + fileName + ".csv"

    worksheetName = "sheetname"

    convertExcelToCSV(sourceFile, worksheetName, targetFile)

    Dts.Variables("csv_FileName").Value = "fileName + ".csv"

    Dts.TaskResult = ScriptResults.Success

    End Sub

    Private Shared Sub convertExcelToCSV(ByVal sourceFile As String, ByVal worksheetName As String, ByVal targetFile As String)

    Dim strConn As String = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" & sourceFile & "';Extended Properties=""Excel 12.0 Xml;HDR=YES;"""

    Dim conn As OleDbConnection = Nothing

    Dim wrtr As StreamWriter = Nothing

    Dim cmd As OleDbCommand = Nothing

    Dim da As OleDbDataAdapter = Nothing

    Try

    conn = New OleDbConnection(strConn)

    conn.Open()

    cmd = New OleDbCommand("SELECT * FROM [" & worksheetName & "$]", conn)

    cmd.CommandType = CommandType.Text

    wrtr = New StreamWriter(targetFile)

    da = New OleDbDataAdapter(cmd)

    Dim dt As DataTable = New DataTable()

    da.Fill(dt)

    Dim rowString As String = ""

    For z As Integer = 0 To dt.Columns.Count - 1

    rowString &= "----;"

    Next z

    wrtr.WriteLine(rowString)

    For x As Integer = 0 To dt.Rows.Count - 1

    rowString = ""

    For y As Integer = 0 To dt.Columns.Count - 1

    rowString &= "" & dt.Rows(x)(y).ToString() & ";"

    Next y

    wrtr.WriteLine(rowString)

    Next x

    ' MessageBox.Show("Done! Your " & sourceFile & " has been converted into " & targetFile & "")

    '' wrtr.Close()

    conn.Close()

    Catch exc As Exception

    ' MessageBox.Show("- An unexpected error occure" + vbNewLine + "- Please contact application administrator")

    End Try

    wrtr.Close()

    wrtr.Dispose()

    conn.Close()

    conn.Dispose()

    End Sub

    End Class

    Quickly following script in Visual C# 2008

    using System;

    using System.Data;

    using Microsoft.SqlServer.Dts.Runtime;

    using System.Windows.Forms;

    using System.Text.RegularExpressions;

    using System.IO;

    .

    .

    .

    public void Main()

    {

    string sheetName = Dts.Variables["User::SheetName"].Value.ToString();

    string pattern = Dts.Variables["User::PatternToMatch"].Value.ToString();

    Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);

    Match match = Regex.Match(sheetName, pattern);

    //Dts.Variables["User::ProcessTheSheet"].Value = match.Success;

    bool a_1 = Regex.IsMatch(sheetName, @"[0-9]");

    if (match.Success)

    {

    if (a_1 == true)

    {

    Dts.Variables["User::ProcessTheSheet"].Value = true;

    Dts.Variables["User::SheetName"].Value = sheetName.Replace("'", "");

    // tw.WriteLine("Y:" + Dts.Variables["User::SheetName"].Value);

    }

    else

    {

    Dts.Variables["User::ProcessTheSheet"].Value = false;

    // tw.WriteLine("N:" + Dts.Variables["User::SheetName"].Value);

    }

    }

    else

    {

    Dts.Variables["User::ProcessTheSheet"].Value = false;

    // tw.WriteLine("N:" + Dts.Variables["User::SheetName"].Value);

    }

    // tw.Close();

    Dts.TaskResult = (int)ScriptResults.Success;

    }

    This script will give true false when the tab/sheetname matches a pattern,you can use that to filter out the tabs you want.

  • Thanks Resender.

    I've included the script that one of the guys passed to be as well. I have a feeling he found it only, but there's no annotations to say where from (he would of left them in, so sorry to whoever created it that you don't have your kudos):

    #region Namespaces

    using System;

    using System.Data;

    using Microsoft.SqlServer.Dts.Pipeline.Wrapper;

    using Microsoft.SqlServer.Dts.Runtime.Wrapper;

    using System.IO;

    using System.Data.SqlClient;

    using System.Data.OleDb;

    using System.Diagnostics;

    #endregion

    [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]

    public class ScriptMain : UserComponent

    {

    private OleDbDataReader excelReader;

    private OleDbConnection excelConnection;

    private OleDbCommand excelCommand;

    public override void PreExecute()

    {

    base.PreExecute();

    // Open connection

    GetDataFromExcelToReader(Variables.FilePath);

    }

    public override void PostExecute()

    {

    base.PostExecute();

    excelReader.Close();

    excelConnection.Close();

    }

    private void GetDataFromExcelToReader(string p_strFileName)

    {

    string l_strConnectionString;

    if (File.Exists(p_strFileName))

    {

    string extension = Path.GetExtension(p_strFileName);

    extension = extension.Replace(".", "");

    if (extension.ToLower() == "xlsx")

    {

    l_strConnectionString = "Provider=Microsoft.ACE.OLEDB.12.0;" +

    "Data Source=" + p_strFileName + ";Extended Properties=\"Excel 12.0;HDR=NO;IMEX=1\";";

    }

    else

    {

    l_strConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;" +

    "Data Source=" + p_strFileName + ";Extended Properties=\"Excel 8.0;HDR=NO;IMEX=1\";";

    }

    excelConnection = new OleDbConnection(l_strConnectionString);

    excelConnection.Open();

    excelCommand = excelConnection.CreateCommand();

    excelCommand.CommandText = "SELECT * FROM [" + "RTT$A8:BO" /*Variables.SheetName.ToString()*/ + "]";

    excelCommand.CommandType = CommandType.Text;

    excelReader = excelCommand.ExecuteReader();

    }

    }

    public override void CreateNewOutputRows()

    {

    int intInputDate;

    DateTime dtOutputDate;

    int IntOutputDate;

    DateTime dtInputDate;

    int intLineNumber = 7;

    while (excelReader.Read())

    {

    intLineNumber++;

    ComponentMetaData.FireInformation(0, "RTT - FD", "Reading Line " + intLineNumber.ToString(), string.Empty, 0, true);

    RTTBuffer.AddRow();

    if (excelReader.FieldCount > 0)

    {

    if (DateTime.TryParse(excelReader[0].ToString(), out dtInputDate))

    {

    dtInputDate = DateTime.Parse(excelReader[0].ToString());

    IntOutputDate = int.Parse(((dtInputDate - DateTime.Parse("01/01/1900")).TotalDays).ToString());

    RTTBuffer.YOA = IntOutputDate.ToString();

    }

    else

    {

    RTTBuffer.YOA = excelReader.FieldCount > 0 ? excelReader[0].ToString() : "";

    }

    }

    RTTBuffer.ClaimNo = excelReader.FieldCount > 1 ? excelReader[1].ToString() : "";

    RTTBuffer.ClaimNumber = excelReader.FieldCount > 2 ? excelReader[2].ToString() : "";

    RTTBuffer.Ref = excelReader.FieldCount > 3 ? excelReader[3].ToString() : "";

    RTTBuffer.PolicyNumber = excelReader.FieldCount > 4 ? excelReader[4].ToString() : "";

    if (excelReader.FieldCount > 5)

    {

    if (int.TryParse(excelReader[5].ToString(), out intInputDate))

    {

    intInputDate = int.Parse(excelReader[5].ToString());

    dtOutputDate = DateTime.Parse("01/01/1900");

    dtOutputDate = dtOutputDate.AddDays(intInputDate);

    RTTBuffer.PolicyInceptionDate = dtOutputDate.ToString();

    }

    else

    {

    RTTBuffer.PolicyInceptionDate = excelReader.FieldCount > 5 ? excelReader[5].ToString() : "";

    }

    }

    }

    }

    }

    There's a part commented out in the GetDataFromExcelToReader, so that if you'd like to have the sheet name in a variable, rather than hard coded, you can simply replace that part. The if statement is in there, as the Reader had problems with some integers and Dates (it sometimes processed dates as ints, and ints as dates). Therefore it has a tryparse to check if it comes back as I expect it. The first if statement if hopeing for an Integer, and the second a DateTime.

    If you have questions, feel free to ask, and i'll try and answer.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply