Parsing a Text field using TSQL

  • I am looking over a database that I have inherited, and there is one field that I need to work with that is being a pain.

    This is a Text field with multiple "records" with multiple values in each appended into it. These records / values are of the format:

    AAAA;BBBB;CCCC|aaa;BBBBBBB;xxxxx|aaaaaaa;bbb;xx

    Each "record" is separated by a pipe character, and each field is separated by a semicolon.

    Now, in cases where the Datalength of the field is 8k or less, this is relatively simple to parse. However, parsing more than 8k is difficult, and made more so by the fact that the READTEXT function will not place the text read into a variable. So, my question is:

    How would you parse this data, when more than 8k exists in the TEXT field, using only T-SQL.

    Thanks in advance,

    Brendt

  • This is the best I have come up with so far. The first version works for a single row, the second is a version which processes a whole rowset. The code assumes an element length < 100, but this could be up to 8000. It also only splits data once, so you would have to add code to do the second step (splitting the row elements into individual data values) - but that would probably be easier since (we hope) rows will be <8001 chars.

     

    /*******************************************************/
    /**********SPLIT DELIMITED LISTS IN TEXT FIELD**********/
    /***********Copyright (c) Tim WIlkinson 2006************/
    /************All non-commercial use allowed*************/
    /*******************************************************/
     
    --this version loops through the text in a single cell
    --and produces a numbered recordset
    --containing the split values in left-right order.

    declare

    @separator varchar(3)

    declare

    @retval table(item_order int identity, val varchar(100))

    declare

    @text table(txt text)

    declare

    @id int

    declare @i int, @patindex int, @curpos int, @lastpos int, @len int, @val varchar(100)

    select

    @id = 8, @separator = ' '

    --replace reference to the source table. Replace 'de.reference' with a unique key,
    --and 'de.displaytext' with the text column you want to process.

    insert

    @text select DisplayText from dbo.Diary_Entry where reference = @id

    select

    @i = 1, @len = datalength(txt), @curpos = 1 from @text

    while

    @curpos <= @len

    begin

    select @lastpos=@curpos

    ,@patindex=patindex('%'+@separator+'%',substring(txt,@lastpos+1,101))

    ,@patindex=case @patindex when 0 then @len-@lastpos else @patindex end

    ,@val=substring(txt,@lastpos,@patindex)

    ,@curpos=@lastpos+@patindex+1

    from @text
    insert @retval(val) values(@val)

    end
    select

    * from @retval order by item_order

    go

    --this one splits multiple records and produces a recordset with
    --both the id of the record containing the source data,
    --and an item_order number representing the order of elements in the list.
    --it involves making N passes through the data,
    --where N = MAX(number of elements in the list)

    declare

    @separator varchar(3)

    declare

    @retval table(rowid int, item_order int, val varchar(100))

    declare

    @text table(txt text)

    declare

    @val varchar(100), @lastpos int, @len int, @patindex int, @curpos int, @i int, @j-2 int

    declare @vars table(Vlastpos int, Vpatindex int, Vpatindex2 int, Vcurpos int, Vlen int,
    Vkey int, Vtext text, vVal varchar(100), vCtr int)

    select

    @separator = ' ', @j-2 = 0, @i = 1

    insert

    @vars(Vcurpos, Vlen, Vkey, Vtext)

    --replace reference to the source table. Replace 'de.reference' with a unique key,
    --and 'de.displaytext' with the text column you want to process.

    select 1, datalength(de.displaytext), de.reference, de.displaytext from dbo.Diary_Entry de

    while

    @i > 0

    begin

    update v

    set @lastpos=Vcurpos
    ,@len = Vlen
    ,@patindex=patindex('%'+@separator+'%',substring(vtext,@lastpos+1,101))

    ,@patindex=case @patindex when 0 then @len-@lastpos else @patindex end

    ,@curpos=@lastpos+@patindex+1

    ,Vval=substring(vtext,@lastpos,@patindex)

    ,Vcurpos=@curpos

    ,Vlen = @len - @curpos
    ,vCtr = @j-2
    from @vars v
    where Vlen > 0
    select @i = @@rowcount, @j-2 = @j-2 + 1
    insert @retval(rowid, val, item_order) select vkey, vval, @j-2 from @vars where vCtr = @j-2-1

    end
    select

    * from @retval order by rowid, item_order

    go

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • Unfortunately, the rows include some with Text of over 30kb size. So, although this would work for about 65% of the rows, I am still stymied as to how to handle the remainder of the rows.

    Thanks for your efforts, though. I appreciate them.

  • Looks like you have some import files stuck in your DB. You want to extract them and use an import process: bcp out, bcp in. Or perhaps ask the DTS fiddlers forum for newly wrapped ways of doing it.

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • Just out of interest, is the problem with the 30K text values just a performance issue, are your embedded 'rows' > 8K, or is a variable overflowing, or none of the above?

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • Data in text column is not actually data. You need to make data from it before you can use it.

    Use this procedure to recover your data from the text field and place into appropriate tables.

    IF NOT EXISTS (SELECT name FROM sysobjects WHERE  name = N'ListDelimitedLong' AND type = 'P')

    EXEC (' CREATE PROCEDURE dbo.ListDelimitedLong

      @List text,

      @Delim char(1)

     AS

     SELECT @Delim')

    GO

    ALTER PROC dbo.ListDelimitedLong

     @List text, -- A delimiter-separated list of values, this parameter may be up to 2GB in length

     @Delim char(1)

    AS

    SET NOCOUNT ON

    DECLARE @ptrval binary(16)

    DECLARE @One int

    SET @One = 1

    DECLARE @LeftPointer int, @RightPointer int, @SubStr varchar(50)

    SET @LeftPointer = 0

    DECLARE @val TABLE (IntVal bigint )

    CREATE TABLE #TXTList (List text)

    INSERT INTO #TXTList

    SELECT @List

    SELECT @ptrval = TEXTPTR(List) FROM #TXTList

    WHILE @RightPointer < (SELECT DATALENGTH(List) FROM #TXTList) OR @RightPointer IS NULL

    BEGIN

     SELECT @RightPointer = CHARINDEX(@Delim, SUBSTRING(List, @LeftPointer+@One, DATALENGTH(List)) )

           + @LeftPointer - @One FROM #TXTList

     IF @RightPointer <= @LeftPointer

      SELECT @RightPointer = DATALENGTH(List) FROM #TXTList

     SELECT @SubStr = SUBSTRING(List, @LeftPointer + @One, @RightPointer-@LeftPointer)

     FROM #TXTList

     SELECT @LeftPointer = @RightPointer + DATALENGTH(@Delim)

    END

     SELECT DISTINCT IntVal

     FROM @val

    GO

    _____________
    Code for TallyGenerator

  • But you can't do it setwise, of course.

    MS have deprecated use of READTEXT with text pointer saying it will be phased out, and recommended using SUBSTRING instead. Does anyone know any more about this?

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • You can use Substring to access blocks of text from a text column (or ntext). The blocks are limited by the maximum varchar sizes, which are varchar=8000 and nvarchar=4000. 

    The simplest way would be to loop until the amount of text retrieve with the substring() function is less than the block size (i.e. the length of the text requested in the substring function).  I use Datalength to do this because Len() doesn't count trailing spaces.

    The problem with using Substring() in a loop is that the data may be variable length, and thus a "record" could span two blocks of text.  You could use CharIndex to locate the pipes symbols and incorporate that into the loop.

    Anyway, here's a simple loop that prints out the text in query analyzer (I have set the maximum number of characters per columns in Tools|Options |Results to 8000):

    DECLARE @buffer varchar(4000)

          , @blocksize int

          , @blockPtr int

    SET @blocksize = 4000

    SET @buffer = ''

    SET @blockPtr = 1

    WHILE 1=1

    BEGIN

      SELECT @buffer = Substring(TextColumn, @blockPtr, @blockSize)

        FROM MyTable

       WHERE pkey= <A_Key_Value>

      PRINT @buffer

      IF DataLength(@buffer) < @blocksize

      BEGIN

        BREAK

      END

      SET @blockPtr = @blockPtr + @blocksize

    END --WHILE

  • This one will fetch a single field value at a time and sort out the row numbering in the process. So as long as none of your data values (+ trailing separator!) exceed 8000 chars, you'll be OK on that score.

    --These two lines:

    ,@patindexF=patindex('%'+@separatorF+'%',substring(vtext,@lastpos,8000))

    ,@patindexR=patindex('%'+@separatorR+'%',substring(vtext,@lastpos,8000))

    --require altering to allow a full 8000 characters in your data values.

    More than that in a single field, and the SQL approach is reaching its limits.

    You don't get a recordset per input row; the data is unpivoted into sourcerecordID, recordID, fieldID, value. You would need to create appropriate tables and pivot the data into those.

    declare

    @separatorR varchar(3),@separatorF varchar(3)

    --field delimiter, row delimiter
    --Set up for testing on natural language text:
    --sentence = record, word = field value.
    --note this means it can cope with ragged 'recordsets'.

    select

    @separatorF = ' ', @separatorR = '. '

    -------------------------------------------------------

    declare

    @retval table(sourcerowid int, recordid int, field_order int, val varchar(8000))

    declare

    @val varchar(8000), @lastpos int, @len int, @patindex int, @patindexF int, @patindexR int

    ,

    @curpos int, @i int, @j-2 int, @seplenF int, @seplenR int, @incrlen int, @LastField bit, @FirstField bit

    declare @vars table(Vlastpos int, Vcurpos int, Vlen int, Vkey int, Vtext text, vVal varchar(8000),vFieldCtr int, vrowCtr int, vCtr int, vNewRow bit)

    select

    @separatorF = ' ', @separatorR = '.', @j-2 = 0, @i = 1, @seplenF = datalength(@separatorF), @seplenR = datalength(@separatorR)

    insert

    @vars(Vcurpos, Vlen, Vkey, Vtext, vFieldCtr, vRowCtr, vNewRow)

    --replace reference to the source table. Replace 'de.reference' with a unique key,
    --and 'de.displaytext' with the text column you want to process.

    select 1, datalength(de.displaytext), de.reference, de.displaytext,0,0,1 from dbo.Diary_Entry de

    ----------------------------------------------------------------

    while

    @i > 0

    begin

    update v
    set @FirstField=vNewRow
    ,@lastpos=Vcurpos
    ,@len = Vlen
    ,@patindexF=patindex('%'+@separatorF+'%',substring(vtext,@lastpos,8000)) --should = max field len + len of largest delimiter.
    ,@patindexR=patindex('%'+@separatorR+'%',substring(vtext,@lastpos,8000))
    ,@LastField= case when @patindexF = 0 then 1 when @patindexR > 0 and @patindexF > @patindexR then 1 else 0 end
    ,@patindex=case @LastField when 1 then @patindexR else @patindexF end
    ,@incrlen=case @LastField when 1 then @seplenR else @seplenF end
    ,@patindex=case @patindex when 0 then @len-@lastpos else @patindex end
    ,@curpos=@lastpos+@patindex+@incrlen-1
    ,Vval=substring(vtext,@lastpos,@patindex-1)
    ,Vcurpos=@curpos
    ,Vlen = @len - @curpos
    ,vCtr = @j-2
    ,vRowCtr = case @FirstField when 1 then vRowCtr+1 else vRowCtr end
    ,vFieldCtr = case @FirstField when 1 then 1 else vFieldCtr + 1 end
    ,vNewRow = @LastField
    from @vars v
    where Vlen > 0
    insert @retval(sourcerowid, val, recordid, field_order) select vkey, vval, vRowCtr, vFieldCtr
    from @vars where vCtr = @j-2
    select @i = @@rowcount, @j-2 = @j-2 + 1

    end
    select

    * from @retval order by sourcerowid, recordid, field_order

    go

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply