Why so may BIN collations?

  • Hello! I want discuss binary collation in SQL but first - two important notes:

    1. Let's talk about SQL Collations only, NOT Windows one.

    2. Let's talk about char/varchar types only, NO unicode in this topic.

    Have in mind these notes my question is: why we need all that stuff like Latin1_General_BIN, Greek_BIN, Turkish_BIN, ... and many other _BINs? As I understand the algorithm is identical for all: take the code (ASCII) for first symbol in first word (column) and take the same for first symbol in second word (column). The lesser "win" (S[0x53] will be "on top", s[0x73] will be bottom). Nice and clear. So, if we replace all that _BINs with one like All_BIN, will this step "good" or "bad"? I don't see any drawback for All_BIN, and what about you?

    Thanks!

  • Hi Shcherbunov,

    I apologise, I don't understand the question.

    If you could post the code I would be glad to have a look.

    KR,

    Iulian

  • I guess it’s for Logical Grouping of collations only.

    Per BOL:

    Binary collations sort data based on the sequence of coded values defined by the locale and data type. A binary collation in SQL Server defines the locale and the ANSI code page to be used, enforcing a binary sort order. Binary collations are useful in achieving improved application performance due to their relative simplicity. For non-Unicode data types, data comparisons are based on the code points defined in the ANSI code page. For Unicode data types, data comparisons are based on the Unicode code points. For binary collations on Unicode data types, the locale is not considered in data sorts. For example, Latin_1_General_BIN and Japanese_BIN yield identical sorting results when used on Unicode data.

    More: http://msdn.microsoft.com/en-us/library/ms143350.aspx

  • If you could post the code I would be glad to have a look.

    There is no "error demonstration" code, my question is more philosophical and theoretical. But, the closest I can invent:

    CREATE TABLE T1 (Col1 char(1))

    GO

    SELECT * FROM T1 ORDER BY Col1 COLLATE Cyrillic_General_BIN

    The question is: what will be different if I use Turkish_BIN? Hungarian_BIN? Latin1_General_BIN? They all use the same algorithm: compare the byte value of char (code in ASCII table) and make the char with smaller byte value "win". Why I should choose the "right" collation at all? They all the same! Why I can't just write '...COLLATE Do_ASCII_like_sort'?

  • Dev (1/29/2012)


    I guess it’s for Logical Grouping of collations only.

    Hmm... don't quite grasp you thought, sorry. Do you want to say that all these numerous BINs just for management purpose? No real (functional) difference in work? And I can use ANY _BIN in ANY T-SQL command? Yes, I am myself see the state of affairs like this, but doubt - does my point of view is correct?

    P.S. Just would remind - we talk about NOT unicode symbols only, i.e. char/varchar. Unicode symbol is different story, I know.

  • Please see this article on msdn: http://msdn.microsoft.com/en-us/library/ms143350.aspx

    It states that:

    "...For binary collations on Unicode data types, the locale is not considered in data sorts. For example, Latin_1_General_BIN and Japanese_BIN yield identical sorting results when used on Unicode data."

    I hope this answer your question. Let me know if this helps.

    KR,

    Iulian

  • Iulian -207023 (1/29/2012)I hope this answer your question.

    To be honest - it doesn't. πŸ™ Let's take this sentence:

    For binary collations on Unicode data types, the locale is not considered in data sorts

    Since I from the beginning of post warn "NO Unicode!" we must rewrite this sentence like:

    For binary collations on NON-Unicode data types, the locale IS considered in data sorts.

    Correct? Then, the next question - why the hell binary collations DO IT? It must take byte 0xE4 in left hand, byte 0xE9 in right, and say: "0xE4 will be first". Finish! Why to take into account locale (code page, in other words)? E4<E9 in russian, english, turkish, zimbabwean and even Martian languages!

  • Shcherbunov Neil (1/29/2012)


    The question is: what will be different if I use Turkish_BIN? Hungarian_BIN? Latin1_General_BIN? They all use the same algorithm: compare the byte value of char (code in ASCII table) and make the char with smaller byte value "win". Why I should choose the "right" collation at all? They all the same! Why I can't just write '...COLLATE Do_ASCII_like_sort'?

    You are misunderstanding how collations work. The collation defines how the available characters sort in each language/locale, so there is a different map from character code point to sort position in each collation. BIN and BIN2 just defines the rules for looking up a character code point in the map. The following script demonstrates the difference, feel free to change the collations used to any two you want to compare:

    WITH

    -- On the fly numbers table

    N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),

    N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),

    N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),

    N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R),

    Numbers AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) AS n FROM N4)

    SELECT

    n,

    CHAR(n),

    ROW_NUMBER() OVER (ORDER BY CHAR(n) COLLATE SQL_Latin1_General_CP437_BIN),

    ROW_NUMBER() OVER (ORDER BY CHAR(n) COLLATE SQL_Latin1_General_CP850_BIN)

    FROM Numbers

    WHERE n BETWEEN 0 AND 255

    ORDER BY n;

    Partial output:

  • Another example:

    WITH

    -- On the fly numbers table

    N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),

    N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),

    N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),

    N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R),

    Numbers AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) AS n FROM N4)

    SELECT

    n,

    CHAR(n),

    ROW_NUMBER() OVER (ORDER BY CHAR(n) COLLATE Greek_BIN2),

    ROW_NUMBER() OVER (ORDER BY CHAR(n) COLLATE Hungarian_BIN2)

    FROM Numbers

    WHERE n BETWEEN 0 AND 255

    ORDER BY n;

  • Ahhh! Thanks to all of you guys! And many thanks to SQL Kiwi personally. With his last 2 replays I, at last(!), capture the essence of things. Look, it's interesting!! It turned out that line of code

    SELECT ASCII(CHAR(167) COLLATE SQL_Latin1_General_CP850_BIN)

    not always return you 167! By no means! On my computer it's return 245, on you - only you can say. And now I can explain this "miracle", step-by-step:

    -SQL engine take the char (167) and take the current code page (the one you set in Control Panel->Region and Language->Change system locale). In my case code page=1251, Cyrillic.

    -SQL engine find this char (SECTION SIGN, or paragraph) in this code page and "place the mark" for itself: this char has unicode codepoint U+00A7

    -SQL engine take code page 850 (as I point in T-SQL) and "scan" it, in an attempt to find NOT the ASCII-code(167), BUT unicode codepoint U+00A7

    -SQL engine discover: U+00A7 in CP850 is F5(245) ASCII code

    -SQL engine return: 245

    And yes, of course

    SELECT ASCII(CHAR(167) COLLATE Cyrillic_General_BIN)

    return 167, for sure. But, again, this "sure" for my computer only!

    The bottom line: in SQL Server (2008+, at least; think 2005 is same) you CAN'T "eliminate" unicode completely. Even if you work with char/varchar/SQL collations only.

  • I missed to highlight the correct sentence in my last post (and yes, the note about Unicode as well).

    For non-Unicode data types, data comparisons are based on the code points defined in the ANSI code page.

    Paul has already explained it nicely... πŸ™‚

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply