Identifying non-letters and non-integers

  • All

    We have to implement some validation on data sent to us to ensure that we only allow letters (English alphabet) and integers. Some fields also allow other characters. The field is of variable length, so we used the REPLICATE function in the following code to build a string for use by the LIKE operator:

    SELECTL6

    FROM[tablename]

    WHEREL6 NOT LIKE REPLICATE('[0-9A-Z]',LEN(L6))

    We thought that this was working fine, but have now found out that this code has allowed through some characters which should not be valid. All of the characters below are deemed by SQL Server 2005 to be between 0 and 9 or A and Z, but should not be allowed by our system:

    ƒŠŒ™šœŸª²³¹º¼½¾ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ

    SQL Server even thinks that the "squared" "²" is the same as "2":

    select case when '²'='2' then 'y' else 'n' end

    returns "y"

    The proposed solution is to build a table containing the ASCII values of the acceptable characters for each field (or "All" where characters are universally accepted), then use lots of SUBSTRING and ASCII functions, similar to this:

    SELECTL6

    FROM[tablename]

    WHERE(ASCII(LEFT(L6,1)) NOT IN (SELECT ASCIICode FROM ValidCharsPerField WHERE FieldNumber IN ('L6','All')))

    OR (LEN(L6)>1 AND ASCII(SUBSTRING(L6,2,1)) NOT IN (SELECT ASCIICode FROM ValidCharsPerField WHERE FieldNumber IN ('L6','All')))

    OR (LEN(L6)>2 AND ASCII(SUBSTRING(L6,3,1)) NOT IN (SELECT ASCIICode FROM ValidCharsPerField WHERE FieldNumber IN ('L6','All')))

    This sample code only checks the first 3 characters; we have up to 80 characters to check, depending on the field, so the WHERE clause is going to be quite long.

    Have any of you come across this before?

    Is there some way that we can get SQL Server to classify these non-standard characters as not being between 0 to 9 or A to Z, perhaps by amending the collation on the table that the data is imported to?

    The code below should be useful if anyone wants to test some code:

    create table testdata (L6 varchar(80))

    insert into testdata

    select '12345' as data union all --should pass

    select 'test4' union all --should pass

    select 'kj&98' union all --should fail

    select 'tËst' union all --should fail

    select 'ad9½A' union all --should fail

    select '5m²' --should fail

    CREATE TABLE ValidCharsPerField

    (FieldNumber VARCHAR(5),

    PrintableCharacter CHAR(1),

    ASCIICode int,

    CharacterType VARCHAR(6))

    declare @loopcounter int

    --insert numbers

    set @loopcounter=48

    while @loopcounter<58

    begin

    INSERT INTO ValidCharsPerField SELECT 'All',char(@loopcounter),@loopcounter,'Number'

    set @loopcounter=@loopcounter+1

    end

    --insert letters

    set @loopcounter=65

    while @loopcounter<91

    begin

    INSERT INTO ValidCharsPerField SELECT 'All',char(@loopcounter),@loopcounter,'Letter'

    set @loopcounter=@loopcounter+1

    end

    --insert more letters

    set @loopcounter=97

    while @loopcounter<123

    begin

    INSERT INTO ValidCharsPerField SELECT 'All',char(@loopcounter),@loopcounter,'Letter'

    set @loopcounter=@loopcounter+1

    end

    Thanks all

    Alun

  • Sorry, I should have said that we are currently using Latin1_General_CI_AS collation.

    Thanks

  • Would a binary collation work?

    SELECT L6

    FROM testdata

    WHERE L6 LIKE '%[^0-9A-Za-z]%' COLLATE Latin1_General_BIN

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Hi.

    Yes, that works perfectly. I think I'll need to read up about binary collations. I'd never heard of them!

    Thanks very much for your help

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply