convert blob to table

  • Hello,

    I have stored a CSV file in a BLOB field.

    I would like to convert this BLOB field into a new table that has the same layout as the CSV file.

    So my CSV-file:

    name, place, age

    erik, amsterdam, 34

    janet, new-york,30

    paula, berlin, 29

    This is binary stored in my database in a blob field. Now I want to convert this field into a new table with 3 columns: name, place, age and the three records (rows in CSV).

    Of course I could use OPENROWSET to bulk insert the CSV file immediatly into a table, however I am calling this from an ASP.NET application and then I need to use linked server to be able to do that, cause OPENROWSET is not enabled, I think.....

    Thanks in Advance,

    Erik

  • Google: TSQL split string

    There are loads of threads about stuff like this.

    Have you considered using XML, or not doing this at all(!)?

    I hope you aren't using a production (rather than admin) app to generate permanent tables!

    Have fun and enjoy the comma in "Washington, DC" !

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • Hello Tim,

    Actualy I was thinking of generating tables in production environment, but I guess I should reconsider. Thanks for the advice and I will leave this olution path.

    Erik

  • It is an intriguing problem. Such a device can be got to work perfectly happily, and I'd hate to think anyone was put off trying. Am I right in believing that this CSV list is stored in a TEXT field in a table? If so, then I may have some old function lying around that might do the job. (...sounds of rummaging around in the dark recesses of an old tool box...)

    ...if you could, please, elaborate slightly...

     

     

  • Hmm.. Well, here goes anyway, assuming it is coming from a text field in a table, here are the two functions and sample to illustrate a way of doing this without too much sweat. I've used your example to test it out. (functions borrowed from Phil Factor with permission)

    Obviously it looks a bit long-winded but once you have the functions, then further jobs like this are dead quick to do!

    ALTER FUNCTION dbo.uftSplitTextToTable
    (
     @TheText Text,
     @RowDelimiter VARCHAR(10)
    )
    RETURNS
    @Results TABLE
    (
     SeqNo INT IDENTITY(1, 1),
     Item VARCHAR(255)
    )
    AS
    BEGIN
    DECLARE @Pos INT,        --index of current search
           @WhereWeAre INT,--index into string so far
           @ii INT, --the number of words found so far
           @size INT,
           @LenDelimiter INT --the length of the delimiter
    
    IF @TheText IS NULL  RETURN
    IF DATALENGTH(@TheText)=0 RETURN
    
    SELECT         @pos=8000,
           @ii=0,
           @WhereWeAre=1, 
           @size=DATALENGTH(@TheText),
           @LenDelimiter=LEN(REPLACE(@rowDelimiter,' ','!'))
    
    WHILE @pos>0
           BEGIN
           SELECT @pos=PATINDEX('%['+@RowDelimiter+']%',
                                   SUBSTRING(@TheText,@whereWeAre,8000))
           IF @pos>0 
                   BEGIN
                   INSERT INTO @Results(item) SELECT SUBSTRING(@TheText,@whereWeAre,@pos-1)
                   SELECT @WhereWeAre= @WhereWeAre+@pos+@LenDelimiter-1
                   END
           END
    INSERT INTO @Results(item) SELECT SUBSTRING(@TheText,@whereWeAre,@size-@whereWeAre+1)
    
    RETURN 
    END
    
    GO
    
    ALTER   function dbo.ufsSplit 
    ( 
    @String VARCHAR(8000),
    @which INT,
    @Delimiter VARCHAR(10) = ',' 
    ) 
    
    RETURNS VARCHAR(8000) AS 
    BEGIN 
    DECLARE @ii INT
    DECLARE @Substring VARCHAR(8000)
    
    SELECT @ii=1, @Substring=''
    
    
    WHILE @ii <= @which 
           BEGIN 
           
           IF (@String IS NULL OR @Delimiter IS NULL )
                   BEGIN
                   SELECT @Substring=''
                    BREAK 
                   END
           
           IF CHARINDEX(@Delimiter,@String) = 0 
                   BEGIN 
                   SELECT @subString = @string
                   SELECT @String=''
                   END 
           ELSE
                   BEGIN
                   SELECT @subString = SUBSTRING( @String, 1, CHARINDEX( @Delimiter, @String )-1)
                   SELECT @String = SUBSTRING( @String, CHARINDEX( @Delimiter, @String )+LEN(@delimiter),LEN(@String))
                   END
           SELECT @ii=@ii+1
           END 
    
    
    RETURN (@subString) 
    END 
    
    
    SELECT         [name]=CONVERT(VARCHAR(20),LTRIM(dbo.ufsSplit(item,1,','))),
           [place]=CONVERT(VARCHAR(20),LTRIM(dbo.ufsSplit(item,2,','))),
           [age]=CONVERT(INT,dbo.ufsSplit(item,3,','))
     FROM dbo.uftSplitTextToTable ('erik, amsterdam, 34
    janet, new-york,30
    paula, berlin, 29','
    ')
    
    
    /*---------------------------------------------------------
    name                 place                age         
    -------------------- -------------------- ----------- 
    erik                 amsterdam            34
    janet                new-york             30
    paula                berlin               29
    
    (3 row(s) affected)*/
  • Dead quick to write, dead slow to run and debug (+ scalability to >8000char, embedded commas?). Admin procedures are allowed to be slow (if necessary), production ones aren't (it shouldn't be necessary). It is an intriguing (and much-discussed) problem, but if you are at the design phase and can avoid the problem altogether, you should do so, even if it means your code doesn't have to be so interesting.

    One way of looking at it is that the only reason for wanting to wait until runtime to specify your schema is to increase adaptability. But the more adaptable (in the sense of changeable, unpredictable) your schema is, the more work you will have to do to make your DML adaptable. You end up using dynamic SQL, and doing everything at one remove. It's a bit(!) like putting a postcard on the ground, climbing up a ladder and using a string-puppet to write your message, with a little puppet pen held in its wobbly little hand. OK bizarre analogy over.

    If you need adaptability, there are better ways of implementing it. You generally know quite a lot about the data you are going to store (that's how you can get it into a database instead of leaving it in a word document), so you should make use of the information you have to make the schema stable. Relational DBs aren't made to be altered as a part of their normal running. I think you would have a wide range of problems with creating tables at run time, and I would avoid it if possible.

    What is the 'problem space' that got you to this point? How did you envisage the CSVs getting into the text column in the first place? I'm sure the kind folk on these boards would love to help with a juicy design/datamodelling problem. As the previous post exemplifies, there is no shortage of ingenuity here!

    BTW, I think Phil Factor may have had help from his Irish/West African friend Paddy N'Dex (groan)

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • I'm very happy you are giving me all these information. I think I should tell you what my intentions are and why I came up with this not so smart solution.

    I am working with scientific data that is based on text strings (DNA, biology science). I am using Integrating Services to process this data (don't ask me why, it would be another long story). Basically since the data, which resides in CSV files (or XLS or whatever the scientist comes up with) always has different forms. 1 column, 20 columns, different headers and/or column types. I thought it was wise to let the user upload his data in CSV-format, then create a table directly from this CSV (OPENROWSET). Configure SSIS for this specific table and call SSIS on this table to store the results in another newly created table.

    The interface the scientists have is an ASP.NET 2.0 website. I ran into the problem of not being able to call OPENROWSET from the website. Even if I enabled OPENROWSET in Surface Area Configuration tool. Something about the need for a linked server to access OLEDB.

    My solution was now to upload the CSV-file into one field of a table. Create a table from this field and run SSIS. I was hoping not to run into denied access things anymore.

    Maybe this shows you why I wanted to do what I am doing. Any other ideas?

    Eriok

  • Ah. That explains it nicely. I was once faced with a very similar problem in processing mechanical test data for engineers. The test equipment, depending on its age, would export its data in a variety of formats. and a huge variety of data schemas. It was no use refusing to take anything but XML. A new test machine could cost a fortune. Having a separate pre-processing system in a procedural language (we used C) turned out to be very restrictive.

    After a bit of experimentation, we decided to read every file into a temporary table, line by line. The table had just two columns, the identity field and the contents of the row (as big as the current varchar limit would allow). you can achieve this by using xp_cmdshell with the MSDOS TYPE command. (Nowdays one uses CURL so as to be able to get a file over a network or across the internet) you can then process each line using a stored procedure, depending on the source of the data.  Because of the strange nature of some of the data,  and the requiremnt to re-format some of the data such as dates/times, or decimal representations  we avoided bulk import methods.

    This turned out to be very simple and satisfactory, and we were able to extend the system rapidly when the engineers came up with a string of further similar requirements. It also allowed us to debug any problems very quickly.

    I therefore stand by the solution I've posted, as a sketch of a perfectly satisfactory system. Phil Factor (hates being teased about his name) has a blog entry that you might find useful, and where I pinched (with permission) some of the code above.

    Getting Stuff into SQL Server

  • I can only say: THANKS A LOT!!

    This helped me and I will set it up the same way,

    Erik

  • Not so fast! I agree the solution is workable and suitable (unavoidable) for some situations. Problem-solvers prefer situations with awkward or unusual constraints. It means we have to use ingenuity and makes things more interesting. But if you are in a design phase, it may be possible to avoid the constraints on your solution space, and allow yourself access to a simpler and more robust design route. Remember the old Chinese curse: 'may you live in interesting times'.

    I'm not quite sure whether you want to use the DB to store CSVs which can then be retrieved and opened in excel, etc.  If so, putting the data in a BLOB would be fine, as you don't need to look at what is inside, just extract the binary data to a file and point your user at it. (If you are using IE and the clients have excel, you can view an excel file in IE).

    If on the other hand you want to decompose the data into a relational form, as this line suggests you might:

    >you can then process each line using a stored procedure, depending on the source of the data.

    ,then you will presumably have some way of telling what schema the data should be converted to  - data types, etc. So my point is, why not do this when the data is input into the system, rather than when it is extracted? If the data needs to be updated, then having it already stored in relational tables will make things a lot easier. And by performing the conversion before you store the data, you also validate it, ensuring that it does indeed have the content you think it does. So something like Andrew's suggestion could be suitable as part of the parsing process that separates the values out, along with further routine(s) that process the individual string values and store the resulting data in the appropriate columns.

    You could have a process that generates a new table for the file content data each time a new file format is specified, which includes a file_id FK which points to a record holding details of the file (dates, user, format etc), and possibly a row_order column. That way the table can be used to staore all data of that format. You avoid the problems attendant on creating tables on the fly to hold individual recordsets (e.g. possible contention issues, due to intent-exclusive table locking on sysrowsetcolumns, sysrowsets, sysallocunits, syshobtcolumns, syshobts, sysserefs, sysschobjs, syscolpars, sysidxstats).

    In the circumstances Andrew describes,

    >It was no use refusing to take anything but XML.

    But in your case, it might be a feasible solution. It would certainly help to avoid the problems that can arise when data is divorced from its metadata. 

    Or perhaps (conceivably) DTS? Good old-fashioned bulk insert into a staging database? There are quite a few possibilities and choosing the right approach now based on careful assessment of the requirements (and the likelihood of unknown unknowns) could save you a lot of trouble in the future. By all means go down the route you are currently contemplating, but not before considering the alternatives.

    [Thanks are due to the First Horseman of the Apocalypse for the expression 'unknown unknowns']

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply