Examine Data Values to Determine Data Type of Column

  • Hi,

    Say if I have a table that just got created, based on a flat file bulk load. All the columns in the table are currently set for varchar(max) which is a huge waste of space. I need to programmetically determine what the data type and length should be correctly set to, so that we efficiently use our space. In other words, some of the columns within the table should be set to int, or smalldatetime or varchar(5) or whatever.

    So I need some t-sql which will find the max length of each column and also the column's data type based on the data itself, NOT on how the data type is currently set. Then I'd alter each column's settings to make it so.

    There must be something already written to do this, so thought I'd check.

    Thanks,

    Paul

  • Hi Paul

    Sorry for saying that, but this sounds like a coding for fun project. :crazy:

    Probably this might be possible with hundreds and thousands of lines SQL using thinks like ISNUMERIC() and ISDATE, but it will be very slow due to all the conversions and ALTER TABLE statements you need to create and execute.

    The main question is. Why don't you know the format of your data? Usually you create a table with correct data types and use a format file to load data into correct format.

    Greets

    Flo

  • All the columns in the table are currently set for varchar(max) which is a huge waste of space.

    I'm not sure but is this true? I thought the "var" implies that the space reserved for those columns would be based on the data entered into those columns (or the max length of all data elements).

  • rjv_rnjn (9/1/2009)


    All the columns in the table are currently set for varchar(max) which is a huge waste of space.

    I'm not sure but is this true? I thought the "var" implies that the space reserved for those columns would be based on the data entered into those columns (or the max length of all data elements).

    A 10-digit Int takes less space than a 10-character varchar. Same for datetime data.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (9/1/2009)


    rjv_rnjn (9/1/2009)


    All the columns in the table are currently set for varchar(max) which is a huge waste of space.

    I'm not sure but is this true? I thought the "var" implies that the space reserved for those columns would be based on the data entered into those columns (or the max length of all data elements).

    A 10-digit Int takes less space than a 10-character varchar. Same for datetime data.

    Ah! So foolish of me to ask that (what was I thinking). Thanks.

  • The first part of the solution would be to check for the existence in each column of rows that do not pass IsDate and IsNumeric.

    You could probably leave all of those as varchar. If you want to narrow them down from varchar(max), which you probably don't really need to do, but it might be tidier, you could just use Max(Len(ColumnName)) to get the numbers.

    The trick would be columns that pass both IsDate and IsNumeric. If they just pass IsNumeric, and not IsDate, it would be easy enough to check for the existence of decimal places (charindex for a period), and to check max values and max decimal places for each column. From that, you could have it decide on those.

    On the ones that pass both IsDate and IsNumeric, for all rows in the whole column, you might have a problem. It's no big deal if all your dates are formatted. For example, IsDate = 1 for 1/1/1900, but IsNumeric = 0 for the same value. But just 1900 all by itself could be a year, and is definitely a number, so it passes both checks.

    Once you've queried that, it would be easy enough to generate a dynamic table creation script.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Seems like it would be a heck of a lot easier to just know your source data, and where it is ultimately going to be stored. There is a problem with programatically saying "The maximum length here is 25, so make the column varchar(25). Later on, it may need to hold a 28 character string. Similarly, integer data may ultimately need to be numeric, smallint may need to be integer, etc.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Well lets just say that we have sort of inherited some very interesting processes. Quite often we receive flat files of data, possibly exported from Excel or god knows where. We then need to figure out what the data type and length of each column should be, based on the data. There is no dictionary that tells us what to expect for data. We simply need to figure it out and place it in a db accordingly.

    I actually was already doing the max(len(column)) thing, but that didn't help me account for the columns which were other than varchar. I'm really surprised this doesn't exist out there already. Seems like a rather common thing to do.

    I like the isdate and isnumeric options presented above, that would get us in the ballpark, though still not to the point of knowing for certain wherther it was datetime or smalldatetime, int vs bigint. Sounds like more testing of the data with lots of code to best determine what the data type should be.

    Actually does sound like sort of a fun little sql project, wish I had the time to fool with it !

    Maybe someone will chime in who has something already built. If not, maybe I'll attempt to put something together.

    Thanks,

    Paul

  • Well lets just say that we have sort of inherited some very interesting processes. Quite often we receive flat files of data, possibly exported from Excel or god knows where. We then need to figure out what the data type and length of each column should be, based on the data. There is no dictionary that tells us what to expect for data. We simply need to figure it out and place it in a db accordingly.

    For as long as I've been doing this I've had files passed to me and asked that they be loaded, I've generally refused without an understanding of the content and data types of the columns, this is such a bad practice I can't even begin to cover my issues.

    Unfortunately you've likely set a precedent with your users on this..

    Are your users creating this data or is it from outside enterprises?

    CEWII

  • I have the same issue, but mine doesn't stem from a flat file it comes from an XML document which doesn't have any data typing...(even if it did it would be very basic) but when we import the XML we want to take the data and put it into the correct column type in an auto generated table.

    I am also wondering why there is no detect at least the column value (varchar, numeric, int, bit). I guess you could do a case statement to figure it out then add some ranking to the type of data.


    Over 12yrs in IT and 10yrs happily stuck with SQL.
    - SQL 2008/R2/2012/2014/2016/2017
    - Oracle 8/9/10/11
    - MySQL 4/5 and MariaDB

  • Our data comes from many sources, the best answer I can give is "God knows where". I've got to find the time to build something which will help me to sort this out.

  • Hi Paul

    Maybe you don't know where your data from, but I'm quite sure that you know the required destination for them.

    I think a completely generic approach makes it impossible to handle some failures within your source data. What about a hybrid solution?

    Some configuration tables for your different data sources. One table which describes the main information for a data source like format (XML, CSV, XLS, ...) and another table which describes the names and types of the columns within the file. This is still a generic approach but you are able to handle some wrong formatted data on column level and it seems to be much more maintainable.

    Greets

    Flo

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply