Why so may BIN collations?

Question

Post reply

Why so may BIN collations?

Shcherbunov Neil

SSCommitted

Points: 1525
More actions
January 29, 2012 at 1:53 am

#267340

Hello! I want discuss binary collation in SQL but first - two important notes:
1. Let's talk about SQL Collations only, NOT Windows one.
2. Let's talk about char/varchar types only, NO unicode in this topic.
Have in mind these notes my question is: why we need all that stuff like Latin1_General_BIN, Greek_BIN, Turkish_BIN, ... and many other _BINs? As I understand the algorithm is identical for all: take the code (ASCII) for first symbol in first word (column) and take the same for first symbol in second word (column). The lesser "win" (S[0x53] will be "on top", s[0x73] will be bottom). Nice and clear. So, if we replace all that _BINs with one like All_BIN, will this step "good" or "bad"? I don't see any drawback for All_BIN, and what about you?
Thanks!

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Iulian -207023 SSCertifiable Points: 7510 More actions · Answer 1

Hi Shcherbunov,

I apologise, I don't understand the question.

If you could post the code I would be glad to have a look.

KR,

Iulian

Devendrakumar Shirbad SSC-Forever Points: 42493 More actions · Answer 2

I guess it’s for Logical Grouping of collations only.

Per BOL:

Binary collations sort data based on the sequence of coded values defined by the locale and data type. A binary collation in SQL Server defines the locale and the ANSI code page to be used, enforcing a binary sort order. Binary collations are useful in achieving improved application performance due to their relative simplicity. For non-Unicode data types, data comparisons are based on the code points defined in the ANSI code page. For Unicode data types, data comparisons are based on the Unicode code points. For binary collations on Unicode data types, the locale is not considered in data sorts. For example, Latin_1_General_BIN and Japanese_BIN yield identical sorting results when used on Unicode data.

More: http://msdn.microsoft.com/en-us/library/ms143350.aspx

Shcherbunov Neil SSCommitted Points: 1525 More actions · Answer 3

If you could post the code I would be glad to have a look.

There is no "error demonstration" code, my question is more philosophical and theoretical. But, the closest I can invent:

CREATE TABLE T1 (Col1 char(1))

GO

SELECT * FROM T1 ORDER BY Col1 COLLATE Cyrillic_General_BIN

The question is: what will be different if I use Turkish_BIN? Hungarian_BIN? Latin1_General_BIN? They all use the same algorithm: compare the byte value of char (code in ASCII table) and make the char with smaller byte value "win". Why I should choose the "right" collation at all? They all the same! Why I can't just write '...COLLATE Do_ASCII_like_sort'?

Shcherbunov Neil SSCommitted Points: 1525 More actions · Answer 4

Dev (1/29/2012)
I guess it’s for Logical Grouping of collations only.

Hmm... don't quite grasp you thought, sorry. Do you want to say that all these numerous BINs just for management purpose? No real (functional) difference in work? And I can use ANY _BIN in ANY T-SQL command? Yes, I am myself see the state of affairs like this, but doubt - does my point of view is correct?

P.S. Just would remind - we talk about NOT unicode symbols only, i.e. char/varchar. Unicode symbol is different story, I know.

Iulian -207023 SSCertifiable Points: 7510 More actions · Answer 5

Please see this article on msdn: http://msdn.microsoft.com/en-us/library/ms143350.aspx

It states that:

"...For binary collations on Unicode data types, the locale is not considered in data sorts. For example, Latin_1_General_BIN and Japanese_BIN yield identical sorting results when used on Unicode data."

I hope this answer your question. Let me know if this helps.

KR,

Iulian

Shcherbunov Neil SSCommitted Points: 1525 More actions · Answer 6

Iulian -207023 (1/29/2012)I hope this answer your question.

To be honest - it doesn't. 🙁 Let's take this sentence:

For binary collations on Unicode data types, the locale is not considered in data sorts

Since I from the beginning of post warn "NO Unicode!" we must rewrite this sentence like:

For binary collations on NON-Unicode data types, the locale IS considered in data sorts.

Correct? Then, the next question - why the hell binary collations DO IT? It must take byte 0xE4 in left hand, byte 0xE9 in right, and say: "0xE4 will be first". Finish! Why to take into account locale (code page, in other words)? E4<E9 in russian, english, turkish, zimbabwean and even Martian languages!

Paul White SSC Guru Points: 150457 More actions · Answer 7

Shcherbunov Neil (1/29/2012)
The question is: what will be different if I use Turkish_BIN? Hungarian_BIN? Latin1_General_BIN? They all use the same algorithm: compare the byte value of char (code in ASCII table) and make the char with smaller byte value "win". Why I should choose the "right" collation at all? They all the same! Why I can't just write '...COLLATE Do_ASCII_like_sort'?

You are misunderstanding how collations work. The collation defines how the available characters sort in each language/locale, so there is a different map from character code point to sort position in each collation. BIN and BIN2 just defines the rules for looking up a character code point in the map. The following script demonstrates the difference, feel free to change the collations used to any two you want to compare:

WITH

-- On the fly numbers table

N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),

N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),

N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),

N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R),

Numbers AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) AS n FROM N4)

SELECT

n,

CHAR(n),

ROW_NUMBER() OVER (ORDER BY CHAR(n) COLLATE SQL_Latin1_General_CP437_BIN),

ROW_NUMBER() OVER (ORDER BY CHAR(n) COLLATE SQL_Latin1_General_CP850_BIN)

FROM Numbers

WHERE n BETWEEN 0 AND 255

ORDER BY n;

Partial output:

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Paul White SSC Guru Points: 150457 More actions · Answer 8

Another example:

WITH

-- On the fly numbers table

N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),

N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),

N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),

N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N3 AS R),

Numbers AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) AS n FROM N4)

SELECT

n,

CHAR(n),

ROW_NUMBER() OVER (ORDER BY CHAR(n) COLLATE Greek_BIN2),

ROW_NUMBER() OVER (ORDER BY CHAR(n) COLLATE Hungarian_BIN2)

FROM Numbers

WHERE n BETWEEN 0 AND 255

ORDER BY n;

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Shcherbunov Neil SSCommitted Points: 1525 More actions · Answer 9

Ahhh! Thanks to all of you guys! And many thanks to SQL Kiwi personally. With his last 2 replays I, at last(!), capture the essence of things. Look, it's interesting!! It turned out that line of code

SELECT ASCII(CHAR(167) COLLATE SQL_Latin1_General_CP850_BIN)

not always return you 167! By no means! On my computer it's return 245, on you - only you can say. And now I can explain this "miracle", step-by-step:

-SQL engine take the char (167) and take the current code page (the one you set in Control Panel->Region and Language->Change system locale). In my case code page=1251, Cyrillic.

-SQL engine find this char (SECTION SIGN, or paragraph) in this code page and "place the mark" for itself: this char has unicode codepoint U+00A7

-SQL engine take code page 850 (as I point in T-SQL) and "scan" it, in an attempt to find NOT the ASCII-code(167), BUT unicode codepoint U+00A7

-SQL engine discover: U+00A7 in CP850 is F5(245) ASCII code

-SQL engine return: 245

And yes, of course

SELECT ASCII(CHAR(167) COLLATE Cyrillic_General_BIN)

return 167, for sure. But, again, this "sure" for my computer only!

The bottom line: in SQL Server (2008+, at least; think 2005 is same) you CAN'T "eliminate" unicode completely. Even if you work with char/varchar/SQL collations only.

Devendrakumar Shirbad SSC-Forever Points: 42493 More actions · Answer 10

I missed to highlight the correct sentence in my last post (and yes, the note about Unicode as well).

For non-Unicode data types, data comparisons are based on the code points defined in the ANSI code page.

Paul has already explained it nicely... 🙂